04 Mar, 2011
1 commit
-
When a DNS resolver key is instantiated with an error indication, attempts to
read that key will result in an oops because user_read() is expecting there to
be a payload - and there isn't one [CVE-2011-1076].Give the DNS resolver key its own read handler that returns the error cached in
key->type_data.x[0] as an error rather than crashing.Also make the kenter() at the beginning of dns_resolver_instantiate() limit the
amount of data it prints, since the data is not necessarily NUL-terminated.The buggy code was added in:
commit 4a2d789267e00b5a1175ecd2ddefcc78b83fbf09
Author: Wang Lei
Date: Wed Aug 11 09:37:58 2010 +0100
Subject: DNS: If the DNS server returns an error, allow that to be cached [ver #2]This can trivially be reproduced by any user with the following program
compiled with -lkeyutils:#include
#include
#include
static char payload[] = "#dnserror=6";
int main()
{
key_serial_t key;
key = add_key("dns_resolver", "a", payload, sizeof(payload),
KEY_SPEC_SESSION_KEYRING);
if (key == -1)
err(1, "add_key");
if (keyctl_read(key, NULL, 0) == -1)
err(1, "read_key");
return 0;
}What should happen is that keyctl_read() reports error 6 (ENXIO) to the user:
dns-break: read_key: No such device or address
but instead the kernel oopses.
This cannot be reproduced with the 'keyutils add' or 'keyutils padd' commands
as both of those cut the data down below the NUL termination that must be
included in the data. Without this dns_resolver_instantiate() will return
-EINVAL and the key will not be instantiated such that it can be read.The oops looks like:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
IP: [] user_read+0x4f/0x8f
PGD 3bdf8067 PUD 385b9067 PMD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:19.0/irq
CPU 0
Modules linked in:Pid: 2150, comm: dns-break Not tainted 2.6.38-rc7-cachefs+ #468 /DG965RY
RIP: 0010:[] [] user_read+0x4f/0x8f
RSP: 0018:ffff88003bf47f08 EFLAGS: 00010246
RAX: 0000000000000001 RBX: ffff88003b5ea378 RCX: ffffffff81972368
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88003b5ea378
RBP: ffff88003bf47f28 R08: ffff88003be56620 R09: 0000000000000000
R10: 0000000000000395 R11: 0000000000000002 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: ffffffffffffffa1
FS: 00007feab5751700(0000) GS:ffff88003e000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000010 CR3: 000000003de40000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process dns-break (pid: 2150, threadinfo ffff88003bf46000, task ffff88003be56090)
Stack:
ffff88003b5ea378 ffff88003b5ea3a0 0000000000000000 0000000000000000
ffff88003bf47f68 ffffffff811b708e ffff88003c442bc8 0000000000000000
00000000004005a0 00007fffba368060 0000000000000000 0000000000000000
Call Trace:
[] keyctl_read_key+0xac/0xcf
[] sys_keyctl+0x75/0xb6
[] system_call_fastpath+0x16/0x1b
Code: 75 1f 48 83 7b 28 00 75 18 c6 05 58 2b fb 00 01 be bb 00 00 00 48 c7 c7 76 1c 75 81 e8 13 c2 e9 ff 4c 8b b3 e0 00 00 00 4d 85 ed 0f b7 5e 10 74 2d 4d 85 e4 74 28 e8 98 79 ee ff 49 39 dd 48
RIP [] user_read+0x4f/0x8f
RSP
CR2: 0000000000000010Signed-off-by: David Howells
Acked-by: Jeff Layton
cc: Wang Lei
Signed-off-by: James Morris
02 Mar, 2011
2 commits
-
This reverts commit c4ff4b829ef9e6353c0b133b7adb564a68054979.
Ted Ts'o reports:
"TPM is working for me so I can log into employer's network in 2.6.37.
It broke when I tried 2.6.38-rc6, with the following relevant lines
from my dmesg:[ 11.081627] tpm_tis 00:0b: 1.2 TPM (device-id 0x0, rev-id 78)
[ 25.734114] tpm_tis 00:0b: Operation Timed out
[ 78.040949] tpm_tis 00:0b: Operation Timed outThis caused me to get suspicious, especially since the _other_ TPM
commit in 2.6.38 had already been reverted, so I tried reverting
commit c4ff4b829e: "TPM: Long default timeout fix". With this commit
reverted, my TPM on my Lenovo T410 is once again working."Requested-and-tested-by: Theodore Ts'o
Acked-by: Rajiv Andrade
Signed-off-by: Linus Torvalds
01 Mar, 2011
10 commits
-
* 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/staging:
hwmon: (adt7411) add MODULE_DEVICE_TABLE
hwmon: (ad7414) add MODULE_DEVICE_TABLE -
Fix new kernel-doc warning in fs/block_dev.c:
Warning(fs/block_dev.c:937): No description found for parameter 'kill_dirty'
Signed-off-by: Randy Dunlap
Signed-off-by: Linus Torvalds -
Several ACPI drivers fail to build if CONFIG_NET is unset, because
they refer to things depending on CONFIG_THERMAL that in turn depends
on CONFIG_NET. However, CONFIG_THERMAL doesn't really need to depend
on CONFIG_NET, because the only part of it requiring CONFIG_NET is
the netlink interface in thermal_sys.c.Put the netlink interface in thermal_sys.c under #ifdef CONFIG_NET
and remove the dependency of CONFIG_THERMAL on CONFIG_NET from
drivers/thermal/Kconfig.Signed-off-by: Rafael J. Wysocki
Acked-by: Randy Dunlap
Cc: Ingo Molnar
Cc: Len Brown
Cc: Stephen Rothwell
Cc: Luming Yu
Cc: Andrew Morton
Signed-off-by: Linus Torvalds -
* 'drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6:
drm: fix unsigned vs signed comparison issue in modeset ctl ioctl.
drm/nv50-nvc0: make sure vma is definitely unmapped when destroying bo -
…/git/tmlind/linux-omap-2.6
* 'omap-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap-2.6:
omap4: prcm: Fix the CPUx clockdomain offsets
OMAP2+: clocksource: fix crash on boot when !CONFIG_OMAP_32K_TIMER
OMAP2/3: clock: fix fint calculation for DPLL_FREQSEL
OMAP2+: mailbox: fix lookups for multiple mailboxes
OMAP2420: mailbox: fix IVA vs DSP IRQ numbering
mach-omap2: smartreflex: world-writable debugfs voltage files
mach-omap2: pm: world-writable debugfs timer files
mach-omap2: mux: world-writable debugfs files -
…or-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf timechart: Fix max number of cpus
perf timechart: Fix black idle boxes in the title
perf hists: Print number of samples, not the period sum* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Use u32 instead of long to set reset vector back to 0* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
clockevents: Prevent oneshot mode when broadcast device is periodic -
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: fix truncate after open
fuse: fix hang of single threaded fuseblk filesystem -
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
ocfs2: Check heartbeat mode for kernel stacks only
Ocfs2/refcounttree: Fix a bug for refcounttree to writeback clusters in a right number.
ocfs2: Fix estimate of necessary credits for mkdir -
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6:
eukrea-tlv320: fix platform_name
ASoC: correct pxa AC97 DAI names
ALSA: hda - Add support for new IDT 92HD98 and 92HD99 codecs
ALSA: HDA: Add ideapad quirk for two Dell machines
ALSA: HDA: Add a new Conexant codec 506e (20590)
ALSA: usb-audio: fix oops due to cleanup race when disconnecting
ASoC: Hook wm_hubs micbiases up to CLK_SYS
ASoC: Correct definition of WM8903_VMID_RES_5K
ASoC: Fix WM8958 default microphone detection argument ordering
ALSA: HDA: Fix mic initialization in VIA auto parser
ALSA: fix one memory leak in sound jack -
Commit e2cda3226481 ("thp: add pmd mangling generic functions") replaced
some macros in with inline functions.If the functions are to be defined (not all architectures need them)
then struct vm_area_struct must be defined first. So include
.Fixes a build failure seen in Debian:
CC [M] drivers/media/dvb/mantis/mantis_pci.o
In file included from arch/arm/include/asm/pgtable.h:460,
from drivers/media/dvb/mantis/mantis_pci.c:25:
include/asm-generic/pgtable.h: In function 'ptep_test_and_clear_young':
include/asm-generic/pgtable.h:29: error: dereferencing pointer to incomplete typeSigned-off-by: Ben Hutchings
Signed-off-by: Linus Torvalds
28 Feb, 2011
6 commits
-
A customer of ours, complained that when setting the reset
vector back to 0, it trashed other data and hung their box.
They noticed when only 4 bytes were set to 0 instead of 8,
everything worked correctly.Mathew pointed out:
|
| We're supposed to be resetting trampoline_phys_low and
| trampoline_phys_high here, which are two 16-bit values.
| Writing 64 bits is definitely going to overwrite space
| that we're not supposed to be touching.
|So limit the area modified to u32.
Signed-off-by: Don Zickus
Acked-by: Matthew Garrett
Cc:
LKML-Reference:
Signed-off-by: Ingo Molnar -
Currently numcpus is determined in pid_put_sample which is only
called on sched_switch/sched_wakeup sample processing.On a machine with a lot cpus I often saw the last cpu missing.
Check for (max) numcpus on every event happening and in the
beginning. -> fixes the issue for me.Signed-off-by: Thomas Renninger
Cc: Arjan van de Ven
Cc: Arnaldo Carvalho de Melo
Cc: lenb@kernel.org
LKML-Reference:
Signed-off-by: Ingo Molnar -
This fix is needed for eye of gnome and firefox svg viewers.
Only Inkscape can handle the broken case.Compare with the other svg_legenda_box declarations, looks
like a typo slipped in at this place.Signed-off-by: Thomas Renninger
Cc: Arjan van de Ven
Cc: Arnaldo Carvalho de Melo
Cc: lenb@kernel.org
LKML-Reference:
Signed-off-by: Ingo Molnar -
* 'nouveau/drm-nouveau-fixes' of /ssd/git/drm-nouveau-next:
drm/nv50-nvc0: make sure vma is definitely unmapped when destroying bo -
This fixes CVE-2011-1013.
Reported-by: Matthiew Herrb (OpenBSD X.org team)
Cc: stable@kernel.org
Signed-off-by: Dave Airlie -
Somehow fixes a misrendering + hang at GDM startup on my NVA8...
My first guess would have been stale TLB entries laying around that a new
bo then accidentally inherits. That doesn't make a great deal of sense
however, as when we mapped the pages for the new bo the TLBs would've
gotten flushed anyway.Signed-off-by: Ben Skeggs
27 Feb, 2011
2 commits
-
The device table is required to load modules based on modaliases.
Signed-off-by: Axel Lin
Acked-by: Wolfram Sang
Signed-off-by: Guenter Roeck -
The device table is required to load modules based on modaliases.
Signed-off-by: Axel Lin
Signed-off-by: Guenter Roeck
26 Feb, 2011
19 commits
-
When the per cpu timer is marked CLOCK_EVT_FEAT_C3STOP, then we only
can switch into oneshot mode, when the backup broadcast device
supports oneshot mode as well. Otherwise we would try to switch the
broadcast device into an unsupported mode unconditionally. This went
unnoticed so far as the current available broadcast devices support
oneshot mode. Seth unearthed this problem while debugging and working
around an hpet related BIOS wreckage.Add the necessary check to tick_is_oneshot_available().
Reported-and-tested-by: Seth Forshee
Signed-off-by: Thomas Gleixner
LKML-Reference:
Cc: stable@kernel.org # .21 -> -
* 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
PM: Make ACPI wakeup from S5 work again when CONFIG_PM_SLEEP is unset -
Fixes sysfs config attribute to allow access to entire 16MB maintenance
space of RapidIO devices.Signed-off-by: Alexandre Bounine
Cc: Kumar Gala
Cc: Matt Porter
Cc: Li Yang
Cc: Thomas Moll
Cc: Micha Nelissen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Initialize ts_real.flags to fix compiler warning about possible
uninitialized use of this field.Signed-off-by: Alexander Gordeev
Cc: john stultz
Cc: Rodolfo Giometti
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
It seems odd that truncate_inode_pages_range(), called not only when
truncating but also when evicting inodes, has mem_cgroup_uncharge_start
and _end() batching in its second loop to clear up a few leftovers, but
not in its first loop that does almost all the work: add them there too.Signed-off-by: Hugh Dickins
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Acked-by: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The THP code didn't pass the correct interleaving shift to the memory
policy code. Fix this here by adjusting for the order.Signed-off-by: Andi Kleen
Reviewed-by: Christoph Lameter
Acked-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
A race can occur when io_submit() races with io_destroy():
CPU1 CPU2
io_submit()
do_io_submit()
...
ctx = lookup_ioctx(ctx_id);
io_destroy()
Now do_io_submit() holds the last reference to ctx.
...
queue new AIO
put_ioctx(ctx) - frees ctx with active AIOsWe solve this issue by checking whether ctx is being destroyed in AIO
submission path after adding new AIO to ctx. Then we are guaranteed that
either io_destroy() waits for new AIO or we see that ctx is being
destroyed and bail out.Cc: Nick Piggin
Reviewed-by: Jeff Moyer
Signed-off-by: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
aio-dio-invalidate-failure GPFs in aio_put_req from io_submit.
lookup_ioctx doesn't implement the rcu lookup pattern properly.
rcu_read_lock does not prevent refcount going to zero, so we might take
a refcount on a zero count ioctx.Fix the bug by atomically testing for zero refcount before incrementing.
[jack@suse.cz: added comment into the code]
Reviewed-by: Jeff Moyer
Signed-off-by: Nick Piggin
Signed-off-by: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When pfn_valid_within() failed 'iter' was incremented twice.
Signed-off-by: Namhyung Kim
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Minchan Kim
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
In linux rtc_time struct, tm_mon range is 0~11, tm_wday range is 0~6,
while in RTC HW REG, month range is 1~12, day of the week range is 1~7,
this patch adjusts difference of them.The efect of this bug was that most of month will be operated on as the
next month by the hardware (When in Jan it maybe even worse). For
example, if in May, software wrote 4 to the hardware, which handled it as
April. Then the logic would be different between software and hardware,
which would cause weird things to happen.Signed-off-by: Lei Xu
Cc: Alessandro Zummo
Cc: john stultz
Cc: Jack Lan
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The kernel automatically evaluates partition tables of storage devices.
The code for evaluating LDM partitions (in fs/partitions/ldm.c) contains
a bug that causes a kernel oops on certain corrupted LDM partitions. A
kernel subsystem seems to crash, because, after the oops, the kernel no
longer recognizes newly connected storage devices.The patch changes ldm_parse_vmdb() to Validate the value of vblk_size.
Signed-off-by: Timo Warns
Cc: Eugene Teo
Acked-by: Richard Russon
Cc: Harvey Harrison
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
should_continue_reclaim() for reclaim/compaction allows scanning to
continue even if pages are not being reclaimed until the full list is
scanned. In terms of allocation success, this makes sense but potentially
it introduces unwanted latency for high-order allocations such as
transparent hugepages and network jumbo frames that would prefer to fail
the allocation attempt and fallback to order-0 pages. Worse, there is a
potential that the full LRU scan will clear all the young bits, distort
page aging information and potentially push pages into swap that would
have otherwise remained resident.This patch will stop reclaim/compaction if no pages were reclaimed in the
last SWAP_CLUSTER_MAX pages that were considered. For allocations such as
hugetlbfs that use __GFP_REPEAT and have fewer fallback options, the full
LRU list may still be scanned.Order-0 allocation should not be affected because RECLAIM_MODE_COMPACTION
is not set so the following avoids the gfp_mask being examined:if (!(sc->reclaim_mode & RECLAIM_MODE_COMPACTION))
return false;A tool was developed based on ftrace that tracked the latency of
high-order allocations while transparent hugepage support was enabled and
three benchmarks were run. The "fix-infinite" figures are 2.6.38-rc4 with
Johannes's patch "vmscan: fix zone shrinking exit when scan work is done"
applied.STREAM Highorder Allocation Latency Statistics
fix-infinite break-early
1 :: Count 10298 10229
1 :: Min 0.4560 0.4640
1 :: Mean 1.0589 1.0183
1 :: Max 14.5990 11.7510
1 :: Stddev 0.5208 0.4719
2 :: Count 2 1
2 :: Min 1.8610 3.7240
2 :: Mean 3.4325 3.7240
2 :: Max 5.0040 3.7240
2 :: Stddev 1.5715 0.0000
9 :: Count 111696 111694
9 :: Min 0.5230 0.4110
9 :: Mean 10.5831 10.5718
9 :: Max 38.4480 43.2900
9 :: Stddev 1.1147 1.1325Mean time for order-1 allocations is reduced. order-2 looks increased but
with so few allocations, it's not particularly significant. THP mean
allocation latency is also reduced. That said, allocation time varies so
significantly that the reductions are within noise.Max allocation time is reduced by a significant amount for low-order
allocations but reduced for THP allocations which presumably are now
breaking before reclaim has done enough work.SysBench Highorder Allocation Latency Statistics
fix-infinite break-early
1 :: Count 15745 15677
1 :: Min 0.4250 0.4550
1 :: Mean 1.1023 1.0810
1 :: Max 14.4590 10.8220
1 :: Stddev 0.5117 0.5100
2 :: Count 1 1
2 :: Min 3.0040 2.1530
2 :: Mean 3.0040 2.1530
2 :: Max 3.0040 2.1530
2 :: Stddev 0.0000 0.0000
9 :: Count 2017 1931
9 :: Min 0.4980 0.7480
9 :: Mean 10.4717 10.3840
9 :: Max 24.9460 26.2500
9 :: Stddev 1.1726 1.1966Again, mean time for order-1 allocations is reduced while order-2
allocations are too few to draw conclusions from. The mean time for THP
allocations is also slightly reduced albeit the reductions are within
varianes.Once again, our maximum allocation time is significantly reduced for
low-order allocations and slightly increased for THP allocations.Anon stream mmap reference Highorder Allocation Latency Statistics
1 :: Count 1376 1790
1 :: Min 0.4940 0.5010
1 :: Mean 1.0289 0.9732
1 :: Max 6.2670 4.2540
1 :: Stddev 0.4142 0.2785
2 :: Count 1 -
2 :: Min 1.9060 -
2 :: Mean 1.9060 -
2 :: Max 1.9060 -
2 :: Stddev 0.0000 -
9 :: Count 11266 11257
9 :: Min 0.4990 0.4940
9 :: Mean 27250.4669 24256.1919
9 :: Max 11439211.0000 6008885.0000
9 :: Stddev 226427.4624 186298.1430This benchmark creates one thread per CPU which references an amount of
anonymous memory 1.5 times the size of physical RAM. This pounds swap
quite heavily and is intended to exercise THP a bit.Mean allocation time for order-1 is reduced as before. It's also reduced
for THP allocations but the variations here are pretty massive due to
swap. As before, maximum allocation times are significantly reduced.Overall, the patch reduces the mean and maximum allocation latencies for
the smaller high-order allocations. This was with Slab configured so it
would be expected to be more significant with Slub which uses these size
allocations more aggressively.The mean allocation times for THP allocations are also slightly reduced.
The maximum latency was slightly increased as predicted by the comments
due to reclaim/compaction breaking early. However, workloads care more
about the latency of lower-order allocations than THP so it's an
acceptable trade-off.Signed-off-by: Mel Gorman
Acked-by: Andrea Arcangeli
Acked-by: Johannes Weiner
Reviewed-by: Minchan Kim
Acked-by: Andrea Arcangeli
Acked-by: Rik van Riel
Cc: Michal Hocko
Cc: Kent Overstreet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The regulator framework is used for power management. The regulators are
only named in the driver code, the actual control stuff is in the board
file for each architecture or use case.The PN544 chip has three regulators that can be controlled or not -
depending on the architecture where the chip is being used. So some of
the regulators may not be controllable. In our current case the third
regulator, which was missing from the code, went unnoticed because we
didn't need to control it. To be as general as possible - in this respect
- the driver needs to list all regulators. Then the board file can be
used to actually set the usage.Signed-off-by: Matti J. Aaltonen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Spell out the NFC acronym when it's shown for the first time.
Signed-off-by: Matti J. Aaltonen
Acked-by: Wolfram Sang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
swiotlb's map_page wrongly calls panic() when it can't find a buffer fit
for device's dma mask. It should return an error instead.Devices with an odd dma mask (i.e. under 4G) like b44 network card hit
this bug (the system crashes):http://marc.info/?l=linux-kernel&m=129648943830106&w=2
If swiotlb returns an error, b44 driver can use the own bouncing
mechanism.Reported-by: Chuck Ebbert
Signed-off-by: FUJITA Tomonori
Tested-by: Arkadiusz Miskiewicz
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
I have translated some kernel documentation so I wish to maintain the
Chinese documentation in our kernel directories.Signed-off-by: Harry Wei
Cc: Joe Perches
Cc: Greg KH
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The move_pages() usage of find_task_by_vpid() requires rcu_read_lock() to
prevent free_pid() from reclaiming the pid.Without this patch, RCU warnings are printed in v2.6.38-rc4 move_pages()
with:CONFIG_LOCKUP_DETECTOR=y
CONFIG_PREEMPT=y
CONFIG_LOCKDEP=y
CONFIG_PROVE_LOCKING=y
CONFIG_PROVE_RCU=yPreviously, migrate_pages() went through a similar transformation
replacing usage of tasklist_lock with rcu read lock:commit 55cfaa3cbdd29c4919ecb5fb8965c310f357e48c
Author: Zeng Zhaoming
Date: Thu Dec 2 14:31:13 2010 -0800mm/mempolicy.c: add rcu read lock to protect pid structure
commit 1e50df39f6e2c3a4a3394df62baa8a213df16c54
Author: KOSAKI Motohiro
Date: Thu Jan 13 15:46:14 2011 -0800mempolicy: remove tasklist_lock from migrate_pages
Signed-off-by: Greg Thelen
Cc: Mel Gorman
Cc: Minchan Kim
Cc: Rik van Riel
Cc: KAMEZAWA Hiroyuki
Cc: "Paul E. McKenney"
Cc: Tetsuo Handa
Cc: Sergey Senozhatsky
Cc: Oleg Nesterov
Cc: Zeng Zhaoming
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
In several places, an epoll fd can call another file's ->f_op->poll()
method with ep->mtx held. This is in general unsafe, because that other
file could itself be an epoll fd that contains the original epoll fd.The code defends against this possibility in its own ->poll() method using
ep_call_nested, but there are several other unsafe calls to ->poll
elsewhere that can be made to deadlock. For example, the following simple
program causes the call in ep_insert recursively call the original fd's
->poll, leading to deadlock:#include
#includeint main(void) {
int e1, e2, p[2];
struct epoll_event evt = {
.events = EPOLLIN
};e1 = epoll_create(1);
e2 = epoll_create(2);
pipe(p);epoll_ctl(e2, EPOLL_CTL_ADD, e1, &evt);
epoll_ctl(e1, EPOLL_CTL_ADD, p[0], &evt);
write(p[1], p, sizeof p);
epoll_ctl(e1, EPOLL_CTL_ADD, e2, &evt);return 0;
}On insertion, check whether the inserted file is itself a struct epoll,
and if so, do a recursive walk to detect whether inserting this file would
create a loop of epoll structures, which could lead to deadlock.[nelhage@ksplice.com: Use epmutex to serialize concurrent inserts]
Signed-off-by: Davide Libenzi
Signed-off-by: Nelson Elhage
Reported-by: Nelson Elhage
Tested-by: Nelson Elhage
Cc: [2.6.34+, possibly earlier]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds