17 Feb, 2011

1 commit


16 Feb, 2011

1 commit


15 Feb, 2011

6 commits


14 Feb, 2011

1 commit

  • Commit c0e69a5bbc6f ("klist.c: bit 0 in pointer can't be used as flag")
    intended to make sure that all klist objects were at least pointer size
    aligned, but used the constant "4" which only works on 32-bit.

    Use "sizeof(void *)" which is correct in all cases.

    Signed-off-by: David S. Miller
    Acked-by: Jesper Nilsson
    Cc: stable
    Cc: Greg Kroah-Hartman
    Signed-off-by: Linus Torvalds

    David Miller
     

13 Feb, 2011

8 commits


12 Feb, 2011

23 commits

  • On an SMP ARM system running ext4, I've received a report that the
    first J_ASSERT in jbd2_journal_commit_transaction has been triggering:

    J_ASSERT(journal->j_running_transaction != NULL);

    While investigating possible causes for this problem, I noticed that
    __jbd2_log_start_commit() is getting called with j_state_lock only
    read-locked, in spite of the fact that it's possible for it might
    j_commit_request. Fix this by grabbing the necessary information so
    we can test to see if we need to start a new transaction before
    dropping the read lock, and then calling jbd2_log_start_commit() which
    will grab the write lock.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • ext4 has a data corruption case when doing non-block-aligned
    asynchronous direct IO into a sparse file, as demonstrated
    by xfstest 240.

    The root cause is that while ext4 preallocates space in the
    hole, mappings of that space still look "new" and
    dio_zero_block() will zero out the unwritten portions. When
    more than one AIO thread is going, they both find this "new"
    block and race to zero out their portion; this is uncoordinated
    and causes data corruption.

    Dave Chinner fixed this for xfs by simply serializing all
    unaligned asynchronous direct IO. I've done the same here.
    The difference is that we only wait on conversions, not all IO.
    This is a very big hammer, and I'm not very pleased with
    stuffing this into ext4_file_write(). But since ext4 is
    DIO_LOCKING, we need to serialize it at this high level.

    I tried to move this into ext4_ext_direct_IO, but by then
    we have the i_mutex already, and we will wait on the
    work queue to do conversions - which must also take the
    i_mutex. So that won't work.

    This was originally exposed by qemu-kvm installing to
    a raw disk image with a normal sector-63 alignment. I've
    tested a backport of this patch with qemu, and it does
    avoid the corruption. It is also quite a lot slower
    (14 min for package installs, vs. 8 min for well-aligned)
    but I'll take slow correctness over fast corruption any day.

    Mingming suggested that we can track outstanding
    conversions, and wait on those so that non-sparse
    files won't be affected, and I've implemented that here;
    unaligned AIO to nonsparse files won't take a perf hit.

    [tytso@mit.edu: Keep the mutex as a hashed array instead
    of bloating the ext4 inode]

    [tytso@mit.edu: Fix up namespace issues so that global
    variables are protected with an "ext4_" prefix.]

    Signed-off-by: Eric Sandeen
    Signed-off-by: "Theodore Ts'o"

    Eric Sandeen
     
  • In 2.6.37 I was running into oopses with repeated module
    loads & unloads. I tracked this down to:

    fb1813f4 ext4: use dedicated slab caches for group_info structures

    (this was in addition to the features advert unload problem)

    The kstrdup & subsequent kfree of the cache name was causing
    a double free. In slub, at least, if I read it right it allocates
    & frees the name itself, slab seems to do something different...
    so in slub I think we were leaking -our- cachep->name, and double
    freeing the one allocated by slub.

    After getting lost in slab/slub/slob a bit, I just looked at other
    sized-caches that get allocated. jbd2, biovec, sgpool all do it
    more or less the way jbd2 does. Below patch follows the jbd2
    method of dynamically allocating a cache at mount time from
    a list of static names.

    (This might also possibly fix a race creating the caches with
    parallel mounts running).

    [Folded in a fix from Dan Carpenter which fixed an off-by-one error in
    the original patch]

    Cc: stable@kernel.org
    Signed-off-by: Eric Sandeen
    Signed-off-by: "Theodore Ts'o"

    Eric Sandeen
     
  • I'll probably regret this....

    Signed-off-by: Grant Likely

    Grant Likely
     
  • * 'kvm-updates/2.6.38' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
    KVM: SVM: Make sure KERNEL_GS_BASE is valid when loading gs_index

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp:
    amd64_edac: Fix DIMMs per DCTs output

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm:
    dlm: use single thread workqueues

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
    cifs: don't always drop malformed replies on the floor (try #3)
    cifs: clean up checks in cifs_echo_request
    [CIFS] Do not send SMBEcho requests on new sockets until SMBNegotiate

    Linus Torvalds
     
  • * 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/staging:
    hwmon: (emc1403) Fix I2C address range
    hwmon: (lm63) Consider LM64 temperature offset

    Linus Torvalds
     
  • …s/security-testing-2.6

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
    pci: use security_capable() when checking capablities during config space read
    security: add cred argument to security_capable()
    tpm_tis: Use timeouts returned from TPM

    Linus Torvalds
     
  • …git/kgene/linux-samsung

    * 's5p-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/kgene/linux-samsung:
    ARM: SAMSUNG: Ensure struct sys_device is declared in plat/pm.h
    ARM: S5PV310: Cleanup System MMU
    ARM: S5PV310: Add support System MMU on SMDKV310

    Linus Torvalds
     
  • * 'next' of git://git.monstr.eu/linux-2.6-microblaze:
    microblaze: Fix msr instruction detection
    microblaze: Fix pte_update function
    microblaze: Fix asm compilation warning
    microblaze: Fix IRQ flag handling for MSR=0

    Linus Torvalds
     
  • This code makes two calls to clk_get, then test both return values and
    fails if either failed.

    The problem is that in the first inner if, where the first call to
    clk_get has failed, it don't know if the second call has failed as well.
    So it don't know whether clk_get should be called on the result of the
    second call. Of course, it would be possible to test that value again.
    A simpler solution is just to test the result of calling clk_get
    directly after each call.

    The semantic match that finds this problem is as follows:
    (http://coccinelle.lip6.fr/)

    //
    @r@
    position p1,p2;
    expression e;
    statement S;
    @@

    e = clk_get@p1(...)
    ...
    if@p2 (IS_ERR(e)) S

    @@
    expression e;
    statement S;
    identifier l;
    position r.p1, p2 != r.p2;
    @@

    *e = clk_get@p1(...)
    ... when != clk_put(e)
    *if@p2 (...)
    {
    ... when != clk_put(e)
    * return ...;
    }//

    Signed-off-by: Julia Lawall
    Cc: Evgeniy Polyakov
    Acked-by: Tony Lindgren
    Acked-by: Amit Kucheria
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Julia Lawall
     
  • mem_cgroup_uncharge_page() should be called in all failure cases after
    mem_cgroup_charge_newpage() is called in huge_memory.c::collapse_huge_page()

    [ 4209.076861] BUG: Bad page state in process khugepaged pfn:1e9800
    [ 4209.077601] page:ffffea0006b14000 count:0 mapcount:0 mapping: (null) index:0x2800
    [ 4209.078674] page flags: 0x40000000004000(head)
    [ 4209.079294] pc:ffff880214a30000 pc->flags:2146246697418756 pc->mem_cgroup:ffffc9000177a000
    [ 4209.082177] (/A)
    [ 4209.082500] Pid: 31, comm: khugepaged Not tainted 2.6.38-rc3-mm1 #1
    [ 4209.083412] Call Trace:
    [ 4209.083678] [] ? bad_page+0xe4/0x140
    [ 4209.084240] [] ? free_pages_prepare+0xd6/0x120
    [ 4209.084837] [] ? rwsem_down_failed_common+0xbd/0x150
    [ 4209.085509] [] ? __free_pages_ok+0x32/0xe0
    [ 4209.086110] [] ? free_compound_page+0x1b/0x20
    [ 4209.086699] [] ? __put_compound_page+0x1c/0x30
    [ 4209.087333] [] ? put_compound_page+0x4d/0x200
    [ 4209.087935] [] ? put_page+0x45/0x50
    [ 4209.097361] [] ? khugepaged+0x9e9/0x1430
    [ 4209.098364] [] ? autoremove_wake_function+0x0/0x40
    [ 4209.099121] [] ? khugepaged+0x0/0x1430
    [ 4209.099780] [] ? kthread+0x96/0xa0
    [ 4209.100452] [] ? kernel_thread_helper+0x4/0x10
    [ 4209.101214] [] ? kthread+0x0/0xa0
    [ 4209.101842] [] ? kernel_thread_helper+0x0/0x10

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Daisuke Nishimura
    Reviewed-by: Johannes Weiner
    Cc: Andrea Arcangeli
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Commit 3e7d34497067 ("mm: vmscan: reclaim order-0 and use compaction
    instead of lumpy reclaim") introduced an indefinite loop in
    shrink_zone().

    It meant to break out of this loop when no pages had been reclaimed and
    not a single page was even scanned. The way it would detect the latter
    is by taking a snapshot of sc->nr_scanned at the beginning of the
    function and comparing it against the new sc->nr_scanned after the scan
    loop. But it would re-iterate without updating that snapshot, looping
    forever if sc->nr_scanned changed at least once since shrink_zone() was
    invoked.

    This is not the sole condition that would exit that loop, but it
    requires other processes to change the zone state, as the reclaimer that
    is stuck obviously can not anymore.

    This is only happening for higher-order allocations, where reclaim is
    run back to back with compaction.

    Signed-off-by: Johannes Weiner
    Reported-by: Michal Hocko
    Tested-by: Kent Overstreet
    Reported-by: Kent Overstreet
    Acked-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • If the page is going to be written to, __do_page needs to break COW.

    However, the old page (before breaking COW) was never mapped mapped into
    the current pte (__do_fault is only called when the pte is not present),
    so vmscan can't have marked the old page as PageMlocked due to being
    mapped in __do_fault's VMA. Therefore, __do_fault() does not need to
    worry about clearing PageMlocked() on the old page.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • vmscan can lazily find pages that are mapped within VM_LOCKED vmas, and
    set the PageMlocked bit on these pages, transfering them onto the
    unevictable list. When do_wp_page() breaks COW within a VM_LOCKED vma,
    it may need to clear PageMlocked on the old page and set it on the new
    page instead.

    This change fixes an issue where do_wp_page() was clearing PageMlocked
    on the old page while the pte was still pointing to it (as well as
    rmap). Therefore, we were not protected against vmscan immediately
    transfering the old page back onto the unevictable list. This could
    cause pages to get stranded there forever.

    I propose to move the corresponding code to the end of do_wp_page(),
    after the pte (and rmap) have been pointed to the new page.
    Additionally, we can use munlock_vma_page() instead of
    clear_page_mlock(), so that the old page stays mlocked if there are
    still other VM_LOCKED vmas mapping it.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • While applying patch to use memblock to find aperture for 64bit x86.
    Ingo found system with 1g + force_iommu

    > No AGP bridge found
    > Node 0: aperture @ 38000000 size 32 MB
    > Aperture pointing to e820 RAM. Ignoring.
    > Your BIOS doesn't leave a aperture memory hole
    > Please enable the IOMMU option in the BIOS setup
    > This costs you 64 MB of RAM
    > Cannot allocate aperture memory hole (0,65536K)

    the corresponding code:

    addr = memblock_find_in_range(0, 1ULL<< 0xffffffff) {
    printk(KERN_ERR
    "Cannot allocate aperture memory hole (%lx,%uK)\n",
    addr, aper_size>>10);
    return 0;
    }
    memblock_x86_reserve_range(addr, addr + aper_size, "aperture64")

    fails because memblock core code align the size with 512M. That could
    make size way too big.

    So don't align the size in that case.

    actually __memblock_alloc_base, the another caller already align that
    before calling that function.

    BTW. x86 does not use __memblock_alloc_base...

    Signed-off-by: Yinghai Lu
    Cc: Ingo Molnar
    Cc: David Miller
    Cc: "H. Peter Anvin"
    Cc: Benjamin Herrenschmidt
    Cc: Dave Airlie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     
  • Commit 2a48fc0ab242417 ("block: autoconvert trivial BKL users to private
    mutex") replaced uses of the BKL in the nbd driver with mutex
    operations. Since then, I've been been seeing these lock ups:

    INFO: task qemu-nbd:16115 blocked for more than 120 seconds.
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    qemu-nbd D 0000000000000001 0 16115 16114 0x00000004
    ffff88007d775d98 0000000000000082 ffff88007d775fd8 ffff88007d774000
    0000000000013a80 ffff8800020347e0 ffff88007d775fd8 0000000000013a80
    ffff880133730000 ffff880002034440 ffffea0004333db8 ffffffffa071c020
    Call Trace:
    [] __mutex_lock_slowpath+0xf7/0x180
    [] mutex_lock+0x2b/0x50
    [] nbd_ioctl+0x6c/0x1c0 [nbd]
    [] blkdev_ioctl+0x230/0x730
    [] block_ioctl+0x41/0x50
    [] do_vfs_ioctl+0x93/0x370
    [] sys_ioctl+0x81/0xa0
    [] system_call_fastpath+0x16/0x1b

    Instrumenting the nbd module's ioctl handler with some extra logging
    clearly shows the NBD_DO_IT ioctl being invoked which is a long-lived
    ioctl in the sense that it doesn't return until another ioctl asks the
    driver to disconnect. However, that other ioctl blocks, waiting for the
    module-level mutex that replaced the BKL, and then we're stuck.

    This patch removes the module-level mutex altogether. It's clearly
    wrong, and as far as I can see, it's entirely unnecessary, since the nbd
    driver maintains per-device mutexes, and I don't see anything that would
    require a module-level (or kernel-level, for that matter) mutex.

    Signed-off-by: Soren Hansen
    Acked-by: Serge Hallyn
    Acked-by: Paul Clements
    Cc: Arnd Bergmann
    Cc: Jens Axboe
    Cc: [2.6.37.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Soren Hansen
     
  • In file drivers/rtc/rtc-proc.c seq_open() can return -ENOMEM.

    86 if (!try_module_get(THIS_MODULE))
    87 return -ENODEV;
    88
    89 return single_open(file, rtc_proc_show, rtc);

    In this case before exiting (line 89) from rtc_proc_open the
    module_put(THIS_MODULE) must be called.

    Found by Linux Device Drivers Verification Project

    Signed-off-by: Alexander Strakh
    Cc: Alessandro Zummo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Strakh
     
  • Add a mutex to register communication and handling. Without the mutex,
    GPIOs didn't switch as expected when toggled in a fast sequence of
    status changes of multiple outputs.

    Signed-off-by: Roland Stigge
    Acked-by: Eric Miao
    Cc: Grant Likely
    Cc: Marc Zyngier
    Cc: Ben Gardner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland Stigge
     
  • The wake_up_process() call in ptrace_detach() is spurious and not
    interlocked with the tracee state. IOW, the tracee could be running or
    sleeping in any place in the kernel by the time wake_up_process() is
    called. This can lead to the tracee waking up unexpectedly which can be
    dangerous.

    The wake_up is spurious and should be removed but for now reduce its
    toxicity by only waking up if the tracee is in TRACED or STOPPED state.

    This bug can possibly be used as an attack vector. I don't think it
    will take too much effort to come up with an attack which triggers oops
    somewhere. Most sleeps are wrapped in condition test loops and should
    be safe but we have quite a number of places where sleep and wakeup
    conditions are expected to be interlocked. Although the window of
    opportunity is tiny, ptrace can be used by non-privileged users and with
    some loading the window can definitely be extended and exploited.

    Signed-off-by: Tejun Heo
    Acked-by: Roland McGrath
    Acked-by: Oleg Nesterov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • In commit fa0d7e3de6d6 ("fs: icache RCU free inodes"), we use rcu free
    inode instead of freeing the inode directly. It causes a crash when we
    rmmod immediately after we umount the volume[1].

    So we need to call rcu_barrier after we kill_sb so that the inode is
    freed before we do rmmod. The idea is inspired by Aneesh Kumar.
    rcu_barrier will wait for all callbacks to end before preceding. The
    original patch was done by Tao Ma, but synchronize_rcu() is not enough
    here.

    1. http://marc.info/?l=linux-fsdevel&m=129680863330185&w=2

    Tested-by: Tao Ma
    Signed-off-by: Boaz Harrosh
    Cc: Nick Piggin
    Cc: Al Viro
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Boaz Harrosh