04 Mar, 2011

1 commit

  • When a DNS resolver key is instantiated with an error indication, attempts to
    read that key will result in an oops because user_read() is expecting there to
    be a payload - and there isn't one [CVE-2011-1076].

    Give the DNS resolver key its own read handler that returns the error cached in
    key->type_data.x[0] as an error rather than crashing.

    Also make the kenter() at the beginning of dns_resolver_instantiate() limit the
    amount of data it prints, since the data is not necessarily NUL-terminated.

    The buggy code was added in:

    commit 4a2d789267e00b5a1175ecd2ddefcc78b83fbf09
    Author: Wang Lei
    Date: Wed Aug 11 09:37:58 2010 +0100
    Subject: DNS: If the DNS server returns an error, allow that to be cached [ver #2]

    This can trivially be reproduced by any user with the following program
    compiled with -lkeyutils:

    #include
    #include
    #include
    static char payload[] = "#dnserror=6";
    int main()
    {
    key_serial_t key;
    key = add_key("dns_resolver", "a", payload, sizeof(payload),
    KEY_SPEC_SESSION_KEYRING);
    if (key == -1)
    err(1, "add_key");
    if (keyctl_read(key, NULL, 0) == -1)
    err(1, "read_key");
    return 0;
    }

    What should happen is that keyctl_read() reports error 6 (ENXIO) to the user:

    dns-break: read_key: No such device or address

    but instead the kernel oopses.

    This cannot be reproduced with the 'keyutils add' or 'keyutils padd' commands
    as both of those cut the data down below the NUL termination that must be
    included in the data. Without this dns_resolver_instantiate() will return
    -EINVAL and the key will not be instantiated such that it can be read.

    The oops looks like:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
    IP: [] user_read+0x4f/0x8f
    PGD 3bdf8067 PUD 385b9067 PMD 0
    Oops: 0000 [#1] SMP
    last sysfs file: /sys/devices/pci0000:00/0000:00:19.0/irq
    CPU 0
    Modules linked in:

    Pid: 2150, comm: dns-break Not tainted 2.6.38-rc7-cachefs+ #468 /DG965RY
    RIP: 0010:[] [] user_read+0x4f/0x8f
    RSP: 0018:ffff88003bf47f08 EFLAGS: 00010246
    RAX: 0000000000000001 RBX: ffff88003b5ea378 RCX: ffffffff81972368
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88003b5ea378
    RBP: ffff88003bf47f28 R08: ffff88003be56620 R09: 0000000000000000
    R10: 0000000000000395 R11: 0000000000000002 R12: 0000000000000000
    R13: 0000000000000000 R14: 0000000000000000 R15: ffffffffffffffa1
    FS: 00007feab5751700(0000) GS:ffff88003e000000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000010 CR3: 000000003de40000 CR4: 00000000000006f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process dns-break (pid: 2150, threadinfo ffff88003bf46000, task ffff88003be56090)
    Stack:
    ffff88003b5ea378 ffff88003b5ea3a0 0000000000000000 0000000000000000
    ffff88003bf47f68 ffffffff811b708e ffff88003c442bc8 0000000000000000
    00000000004005a0 00007fffba368060 0000000000000000 0000000000000000
    Call Trace:
    [] keyctl_read_key+0xac/0xcf
    [] sys_keyctl+0x75/0xb6
    [] system_call_fastpath+0x16/0x1b
    Code: 75 1f 48 83 7b 28 00 75 18 c6 05 58 2b fb 00 01 be bb 00 00 00 48 c7 c7 76 1c 75 81 e8 13 c2 e9 ff 4c 8b b3 e0 00 00 00 4d 85 ed 0f b7 5e 10 74 2d 4d 85 e4 74 28 e8 98 79 ee ff 49 39 dd 48
    RIP [] user_read+0x4f/0x8f
    RSP
    CR2: 0000000000000010

    Signed-off-by: David Howells
    Acked-by: Jeff Layton
    cc: Wang Lei
    Signed-off-by: James Morris

    David Howells
     

02 Mar, 2011

2 commits

  • Linus Torvalds
     
  • This reverts commit c4ff4b829ef9e6353c0b133b7adb564a68054979.

    Ted Ts'o reports:

    "TPM is working for me so I can log into employer's network in 2.6.37.
    It broke when I tried 2.6.38-rc6, with the following relevant lines
    from my dmesg:

    [ 11.081627] tpm_tis 00:0b: 1.2 TPM (device-id 0x0, rev-id 78)
    [ 25.734114] tpm_tis 00:0b: Operation Timed out
    [ 78.040949] tpm_tis 00:0b: Operation Timed out

    This caused me to get suspicious, especially since the _other_ TPM
    commit in 2.6.38 had already been reverted, so I tried reverting
    commit c4ff4b829e: "TPM: Long default timeout fix". With this commit
    reverted, my TPM on my Lenovo T410 is once again working."

    Requested-and-tested-by: Theodore Ts'o
    Acked-by: Rajiv Andrade
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

01 Mar, 2011

10 commits

  • * 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/staging:
    hwmon: (adt7411) add MODULE_DEVICE_TABLE
    hwmon: (ad7414) add MODULE_DEVICE_TABLE

    Linus Torvalds
     
  • Fix new kernel-doc warning in fs/block_dev.c:

    Warning(fs/block_dev.c:937): No description found for parameter 'kill_dirty'

    Signed-off-by: Randy Dunlap
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Several ACPI drivers fail to build if CONFIG_NET is unset, because
    they refer to things depending on CONFIG_THERMAL that in turn depends
    on CONFIG_NET. However, CONFIG_THERMAL doesn't really need to depend
    on CONFIG_NET, because the only part of it requiring CONFIG_NET is
    the netlink interface in thermal_sys.c.

    Put the netlink interface in thermal_sys.c under #ifdef CONFIG_NET
    and remove the dependency of CONFIG_THERMAL on CONFIG_NET from
    drivers/thermal/Kconfig.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Randy Dunlap
    Cc: Ingo Molnar
    Cc: Len Brown
    Cc: Stephen Rothwell
    Cc: Luming Yu
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • * 'drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6:
    drm: fix unsigned vs signed comparison issue in modeset ctl ioctl.
    drm/nv50-nvc0: make sure vma is definitely unmapped when destroying bo

    Linus Torvalds
     
  • …/git/tmlind/linux-omap-2.6

    * 'omap-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap-2.6:
    omap4: prcm: Fix the CPUx clockdomain offsets
    OMAP2+: clocksource: fix crash on boot when !CONFIG_OMAP_32K_TIMER
    OMAP2/3: clock: fix fint calculation for DPLL_FREQSEL
    OMAP2+: mailbox: fix lookups for multiple mailboxes
    OMAP2420: mailbox: fix IVA vs DSP IRQ numbering
    mach-omap2: smartreflex: world-writable debugfs voltage files
    mach-omap2: pm: world-writable debugfs timer files
    mach-omap2: mux: world-writable debugfs files

    Linus Torvalds
     
  • …or-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

    * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    perf timechart: Fix max number of cpus
    perf timechart: Fix black idle boxes in the title
    perf hists: Print number of samples, not the period sum

    * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86: Use u32 instead of long to set reset vector back to 0

    * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    clockevents: Prevent oneshot mode when broadcast device is periodic

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
    fuse: fix truncate after open
    fuse: fix hang of single threaded fuseblk filesystem

    Linus Torvalds
     
  • * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
    ocfs2: Check heartbeat mode for kernel stacks only
    Ocfs2/refcounttree: Fix a bug for refcounttree to writeback clusters in a right number.
    ocfs2: Fix estimate of necessary credits for mkdir

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6:
    eukrea-tlv320: fix platform_name
    ASoC: correct pxa AC97 DAI names
    ALSA: hda - Add support for new IDT 92HD98 and 92HD99 codecs
    ALSA: HDA: Add ideapad quirk for two Dell machines
    ALSA: HDA: Add a new Conexant codec 506e (20590)
    ALSA: usb-audio: fix oops due to cleanup race when disconnecting
    ASoC: Hook wm_hubs micbiases up to CLK_SYS
    ASoC: Correct definition of WM8903_VMID_RES_5K
    ASoC: Fix WM8958 default microphone detection argument ordering
    ALSA: HDA: Fix mic initialization in VIA auto parser
    ALSA: fix one memory leak in sound jack

    Linus Torvalds
     
  • Commit e2cda3226481 ("thp: add pmd mangling generic functions") replaced
    some macros in with inline functions.

    If the functions are to be defined (not all architectures need them)
    then struct vm_area_struct must be defined first. So include
    .

    Fixes a build failure seen in Debian:

    CC [M] drivers/media/dvb/mantis/mantis_pci.o
    In file included from arch/arm/include/asm/pgtable.h:460,
    from drivers/media/dvb/mantis/mantis_pci.c:25:
    include/asm-generic/pgtable.h: In function 'ptep_test_and_clear_young':
    include/asm-generic/pgtable.h:29: error: dereferencing pointer to incomplete type

    Signed-off-by: Ben Hutchings
    Signed-off-by: Linus Torvalds

    Ben Hutchings
     

28 Feb, 2011

6 commits


27 Feb, 2011

2 commits


26 Feb, 2011

19 commits

  • Takashi Iwai
     
  • When the per cpu timer is marked CLOCK_EVT_FEAT_C3STOP, then we only
    can switch into oneshot mode, when the backup broadcast device
    supports oneshot mode as well. Otherwise we would try to switch the
    broadcast device into an unsupported mode unconditionally. This went
    unnoticed so far as the current available broadcast devices support
    oneshot mode. Seth unearthed this problem while debugging and working
    around an hpet related BIOS wreckage.

    Add the necessary check to tick_is_oneshot_available().

    Reported-and-tested-by: Seth Forshee
    Signed-off-by: Thomas Gleixner
    LKML-Reference:
    Cc: stable@kernel.org # .21 ->

    Thomas Gleixner
     
  • * 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
    PM: Make ACPI wakeup from S5 work again when CONFIG_PM_SLEEP is unset

    Linus Torvalds
     
  • Fixes sysfs config attribute to allow access to entire 16MB maintenance
    space of RapidIO devices.

    Signed-off-by: Alexandre Bounine
    Cc: Kumar Gala
    Cc: Matt Porter
    Cc: Li Yang
    Cc: Thomas Moll
    Cc: Micha Nelissen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Bounine
     
  • Initialize ts_real.flags to fix compiler warning about possible
    uninitialized use of this field.

    Signed-off-by: Alexander Gordeev
    Cc: john stultz
    Cc: Rodolfo Giometti
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Gordeev
     
  • It seems odd that truncate_inode_pages_range(), called not only when
    truncating but also when evicting inodes, has mem_cgroup_uncharge_start
    and _end() batching in its second loop to clear up a few leftovers, but
    not in its first loop that does almost all the work: add them there too.

    Signed-off-by: Hugh Dickins
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Acked-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The THP code didn't pass the correct interleaving shift to the memory
    policy code. Fix this here by adjusting for the order.

    Signed-off-by: Andi Kleen
    Reviewed-by: Christoph Lameter
    Acked-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • A race can occur when io_submit() races with io_destroy():

    CPU1 CPU2
    io_submit()
    do_io_submit()
    ...
    ctx = lookup_ioctx(ctx_id);
    io_destroy()
    Now do_io_submit() holds the last reference to ctx.
    ...
    queue new AIO
    put_ioctx(ctx) - frees ctx with active AIOs

    We solve this issue by checking whether ctx is being destroyed in AIO
    submission path after adding new AIO to ctx. Then we are guaranteed that
    either io_destroy() waits for new AIO or we see that ctx is being
    destroyed and bail out.

    Cc: Nick Piggin
    Reviewed-by: Jeff Moyer
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • aio-dio-invalidate-failure GPFs in aio_put_req from io_submit.

    lookup_ioctx doesn't implement the rcu lookup pattern properly.
    rcu_read_lock does not prevent refcount going to zero, so we might take
    a refcount on a zero count ioctx.

    Fix the bug by atomically testing for zero refcount before incrementing.

    [jack@suse.cz: added comment into the code]
    Reviewed-by: Jeff Moyer
    Signed-off-by: Nick Piggin
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • When pfn_valid_within() failed 'iter' was incremented twice.

    Signed-off-by: Namhyung Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • In linux rtc_time struct, tm_mon range is 0~11, tm_wday range is 0~6,
    while in RTC HW REG, month range is 1~12, day of the week range is 1~7,
    this patch adjusts difference of them.

    The efect of this bug was that most of month will be operated on as the
    next month by the hardware (When in Jan it maybe even worse). For
    example, if in May, software wrote 4 to the hardware, which handled it as
    April. Then the logic would be different between software and hardware,
    which would cause weird things to happen.

    Signed-off-by: Lei Xu
    Cc: Alessandro Zummo
    Cc: john stultz
    Cc: Jack Lan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lei Xu
     
  • The kernel automatically evaluates partition tables of storage devices.
    The code for evaluating LDM partitions (in fs/partitions/ldm.c) contains
    a bug that causes a kernel oops on certain corrupted LDM partitions. A
    kernel subsystem seems to crash, because, after the oops, the kernel no
    longer recognizes newly connected storage devices.

    The patch changes ldm_parse_vmdb() to Validate the value of vblk_size.

    Signed-off-by: Timo Warns
    Cc: Eugene Teo
    Acked-by: Richard Russon
    Cc: Harvey Harrison
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Timo Warns
     
  • should_continue_reclaim() for reclaim/compaction allows scanning to
    continue even if pages are not being reclaimed until the full list is
    scanned. In terms of allocation success, this makes sense but potentially
    it introduces unwanted latency for high-order allocations such as
    transparent hugepages and network jumbo frames that would prefer to fail
    the allocation attempt and fallback to order-0 pages. Worse, there is a
    potential that the full LRU scan will clear all the young bits, distort
    page aging information and potentially push pages into swap that would
    have otherwise remained resident.

    This patch will stop reclaim/compaction if no pages were reclaimed in the
    last SWAP_CLUSTER_MAX pages that were considered. For allocations such as
    hugetlbfs that use __GFP_REPEAT and have fewer fallback options, the full
    LRU list may still be scanned.

    Order-0 allocation should not be affected because RECLAIM_MODE_COMPACTION
    is not set so the following avoids the gfp_mask being examined:

    if (!(sc->reclaim_mode & RECLAIM_MODE_COMPACTION))
    return false;

    A tool was developed based on ftrace that tracked the latency of
    high-order allocations while transparent hugepage support was enabled and
    three benchmarks were run. The "fix-infinite" figures are 2.6.38-rc4 with
    Johannes's patch "vmscan: fix zone shrinking exit when scan work is done"
    applied.

    STREAM Highorder Allocation Latency Statistics
    fix-infinite break-early
    1 :: Count 10298 10229
    1 :: Min 0.4560 0.4640
    1 :: Mean 1.0589 1.0183
    1 :: Max 14.5990 11.7510
    1 :: Stddev 0.5208 0.4719
    2 :: Count 2 1
    2 :: Min 1.8610 3.7240
    2 :: Mean 3.4325 3.7240
    2 :: Max 5.0040 3.7240
    2 :: Stddev 1.5715 0.0000
    9 :: Count 111696 111694
    9 :: Min 0.5230 0.4110
    9 :: Mean 10.5831 10.5718
    9 :: Max 38.4480 43.2900
    9 :: Stddev 1.1147 1.1325

    Mean time for order-1 allocations is reduced. order-2 looks increased but
    with so few allocations, it's not particularly significant. THP mean
    allocation latency is also reduced. That said, allocation time varies so
    significantly that the reductions are within noise.

    Max allocation time is reduced by a significant amount for low-order
    allocations but reduced for THP allocations which presumably are now
    breaking before reclaim has done enough work.

    SysBench Highorder Allocation Latency Statistics
    fix-infinite break-early
    1 :: Count 15745 15677
    1 :: Min 0.4250 0.4550
    1 :: Mean 1.1023 1.0810
    1 :: Max 14.4590 10.8220
    1 :: Stddev 0.5117 0.5100
    2 :: Count 1 1
    2 :: Min 3.0040 2.1530
    2 :: Mean 3.0040 2.1530
    2 :: Max 3.0040 2.1530
    2 :: Stddev 0.0000 0.0000
    9 :: Count 2017 1931
    9 :: Min 0.4980 0.7480
    9 :: Mean 10.4717 10.3840
    9 :: Max 24.9460 26.2500
    9 :: Stddev 1.1726 1.1966

    Again, mean time for order-1 allocations is reduced while order-2
    allocations are too few to draw conclusions from. The mean time for THP
    allocations is also slightly reduced albeit the reductions are within
    varianes.

    Once again, our maximum allocation time is significantly reduced for
    low-order allocations and slightly increased for THP allocations.

    Anon stream mmap reference Highorder Allocation Latency Statistics
    1 :: Count 1376 1790
    1 :: Min 0.4940 0.5010
    1 :: Mean 1.0289 0.9732
    1 :: Max 6.2670 4.2540
    1 :: Stddev 0.4142 0.2785
    2 :: Count 1 -
    2 :: Min 1.9060 -
    2 :: Mean 1.9060 -
    2 :: Max 1.9060 -
    2 :: Stddev 0.0000 -
    9 :: Count 11266 11257
    9 :: Min 0.4990 0.4940
    9 :: Mean 27250.4669 24256.1919
    9 :: Max 11439211.0000 6008885.0000
    9 :: Stddev 226427.4624 186298.1430

    This benchmark creates one thread per CPU which references an amount of
    anonymous memory 1.5 times the size of physical RAM. This pounds swap
    quite heavily and is intended to exercise THP a bit.

    Mean allocation time for order-1 is reduced as before. It's also reduced
    for THP allocations but the variations here are pretty massive due to
    swap. As before, maximum allocation times are significantly reduced.

    Overall, the patch reduces the mean and maximum allocation latencies for
    the smaller high-order allocations. This was with Slab configured so it
    would be expected to be more significant with Slub which uses these size
    allocations more aggressively.

    The mean allocation times for THP allocations are also slightly reduced.
    The maximum latency was slightly increased as predicted by the comments
    due to reclaim/compaction breaking early. However, workloads care more
    about the latency of lower-order allocations than THP so it's an
    acceptable trade-off.

    Signed-off-by: Mel Gorman
    Acked-by: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Reviewed-by: Minchan Kim
    Acked-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Cc: Michal Hocko
    Cc: Kent Overstreet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The regulator framework is used for power management. The regulators are
    only named in the driver code, the actual control stuff is in the board
    file for each architecture or use case.

    The PN544 chip has three regulators that can be controlled or not -
    depending on the architecture where the chip is being used. So some of
    the regulators may not be controllable. In our current case the third
    regulator, which was missing from the code, went unnoticed because we
    didn't need to control it. To be as general as possible - in this respect
    - the driver needs to list all regulators. Then the board file can be
    used to actually set the usage.

    Signed-off-by: Matti J. Aaltonen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matti J. Aaltonen
     
  • Spell out the NFC acronym when it's shown for the first time.

    Signed-off-by: Matti J. Aaltonen
    Acked-by: Wolfram Sang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matti J. Aaltonen
     
  • swiotlb's map_page wrongly calls panic() when it can't find a buffer fit
    for device's dma mask. It should return an error instead.

    Devices with an odd dma mask (i.e. under 4G) like b44 network card hit
    this bug (the system crashes):

    http://marc.info/?l=linux-kernel&m=129648943830106&w=2

    If swiotlb returns an error, b44 driver can use the own bouncing
    mechanism.

    Reported-by: Chuck Ebbert
    Signed-off-by: FUJITA Tomonori
    Tested-by: Arkadiusz Miskiewicz
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    FUJITA Tomonori
     
  • I have translated some kernel documentation so I wish to maintain the
    Chinese documentation in our kernel directories.

    Signed-off-by: Harry Wei
    Cc: Joe Perches
    Cc: Greg KH
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Harry Wei
     
  • The move_pages() usage of find_task_by_vpid() requires rcu_read_lock() to
    prevent free_pid() from reclaiming the pid.

    Without this patch, RCU warnings are printed in v2.6.38-rc4 move_pages()
    with:

    CONFIG_LOCKUP_DETECTOR=y
    CONFIG_PREEMPT=y
    CONFIG_LOCKDEP=y
    CONFIG_PROVE_LOCKING=y
    CONFIG_PROVE_RCU=y

    Previously, migrate_pages() went through a similar transformation
    replacing usage of tasklist_lock with rcu read lock:

    commit 55cfaa3cbdd29c4919ecb5fb8965c310f357e48c
    Author: Zeng Zhaoming
    Date: Thu Dec 2 14:31:13 2010 -0800

    mm/mempolicy.c: add rcu read lock to protect pid structure

    commit 1e50df39f6e2c3a4a3394df62baa8a213df16c54
    Author: KOSAKI Motohiro
    Date: Thu Jan 13 15:46:14 2011 -0800

    mempolicy: remove tasklist_lock from migrate_pages

    Signed-off-by: Greg Thelen
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: "Paul E. McKenney"
    Cc: Tetsuo Handa
    Cc: Sergey Senozhatsky
    Cc: Oleg Nesterov
    Cc: Zeng Zhaoming
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     
  • In several places, an epoll fd can call another file's ->f_op->poll()
    method with ep->mtx held. This is in general unsafe, because that other
    file could itself be an epoll fd that contains the original epoll fd.

    The code defends against this possibility in its own ->poll() method using
    ep_call_nested, but there are several other unsafe calls to ->poll
    elsewhere that can be made to deadlock. For example, the following simple
    program causes the call in ep_insert recursively call the original fd's
    ->poll, leading to deadlock:

    #include
    #include

    int main(void) {
    int e1, e2, p[2];
    struct epoll_event evt = {
    .events = EPOLLIN
    };

    e1 = epoll_create(1);
    e2 = epoll_create(2);
    pipe(p);

    epoll_ctl(e2, EPOLL_CTL_ADD, e1, &evt);
    epoll_ctl(e1, EPOLL_CTL_ADD, p[0], &evt);
    write(p[1], p, sizeof p);
    epoll_ctl(e1, EPOLL_CTL_ADD, e2, &evt);

    return 0;
    }

    On insertion, check whether the inserted file is itself a struct epoll,
    and if so, do a recursive walk to detect whether inserting this file would
    create a loop of epoll structures, which could lead to deadlock.

    [nelhage@ksplice.com: Use epmutex to serialize concurrent inserts]
    Signed-off-by: Davide Libenzi
    Signed-off-by: Nelson Elhage
    Reported-by: Nelson Elhage
    Tested-by: Nelson Elhage
    Cc: [2.6.34+, possibly earlier]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi