03 Apr, 2012

3 commits

  • commit ce880cb860f36694d2cdebfac9e6ae18176fe4c4 upstream.

    The USB graphics card driver delays the unregistering of the framebuffer
    device to a workqueue, which breaks the userspace visible remove uevent
    sequence. Recent userspace tools started to support USB graphics card
    hotplug out-of-the-box and rely on proper events sent by the kernel.

    The framebuffer device is a direct child of the USB interface which is
    removed immediately after the USB .disconnect() callback. But the fb device
    in /sys stays around until its final cleanup, at a time where all the parent
    devices have been removed already.

    To work around that, we remove the sysfs fb device directly in the USB
    .disconnect() callback and leave only the cleanup of the internal fb
    data to the delayed work.

    Before:
    add /devices/pci0000:00/0000:00:1d.0/usb2/2-1/2-1.2 (usb)
    add /devices/pci0000:00/0000:00:1d.0/usb2/2-1/2-1.2/2-1.2:1.0 (usb)
    add /devices/pci0000:00/0000:00:1d.0/usb2/2-1/2-1.2/2-1.2:1.0/graphics/fb0 (graphics)
    remove /devices/pci0000:00/0000:00:1d.0/usb2/2-1/2-1.2/2-1.2:1.0 (usb)
    remove /devices/pci0000:00/0000:00:1d.0/usb2/2-1/2-1.2 (usb)
    remove /2-1.2:1.0/graphics/fb0 (graphics)

    After:
    add /devices/pci0000:00/0000:00:1d.0/usb2/2-1/2-1.2 (usb)
    add /devices/pci0000:00/0000:00:1d.0/usb2/2-1/2-1.2/2-1.2:1.0 (usb)
    add /devices/pci0000:00/0000:00:1d.0/usb2/2-1/2-1.2/2-1.2:1.0/graphics/fb1 (graphics)
    remove /devices/pci0000:00/0000:00:1d.0/usb2/2-1/2-1.2/2-1.2:1.0/graphics/fb1 (graphics)
    remove /devices/pci0000:00/0000:00:1d.0/usb2/2-1/2-1.2/2-1.2:1.0 (usb)
    remove /devices/pci0000:00/0000:00:1d.0/usb2/2-1/2-1.2 (usb)

    Tested-by: Bernie Thompson
    Acked-by: Bernie Thompson
    Signed-off-by: Kay Sievers
    Signed-off-by: Florian Tobias Schandinat
    Signed-off-by: Greg Kroah-Hartman

    Kay Sievers
     
  • commit 1a5a9906d4e8d1976b701f889d8f35d54b928f25 upstream.

    In some cases it may happen that pmd_none_or_clear_bad() is called with
    the mmap_sem hold in read mode. In those cases the huge page faults can
    allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
    false positive from pmd_bad() that will not like to see a pmd
    materializing as trans huge.

    It's not khugepaged causing the problem, khugepaged holds the mmap_sem
    in write mode (and all those sites must hold the mmap_sem in read mode
    to prevent pagetables to go away from under them, during code review it
    seems vm86 mode on 32bit kernels requires that too unless it's
    restricted to 1 thread per process or UP builds). The race is only with
    the huge pagefaults that can convert a pmd_none() into a
    pmd_trans_huge().

    Effectively all these pmd_none_or_clear_bad() sites running with
    mmap_sem in read mode are somewhat speculative with the page faults, and
    the result is always undefined when they run simultaneously. This is
    probably why it wasn't common to run into this. For example if the
    madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
    fault, the hugepage will not be zapped, if the page fault runs first it
    will be zapped.

    Altering pmd_bad() not to error out if it finds hugepmds won't be enough
    to fix this, because zap_pmd_range would then proceed to call
    zap_pte_range (which would be incorrect if the pmd become a
    pmd_trans_huge()).

    The simplest way to fix this is to read the pmd in the local stack
    (regardless of what we read, no need of actual CPU barriers, only
    compiler barrier needed), and be sure it is not changing under the code
    that computes its value. Even if the real pmd is changing under the
    value we hold on the stack, we don't care. If we actually end up in
    zap_pte_range it means the pmd was not none already and it was not huge,
    and it can't become huge from under us (khugepaged locking explained
    above).

    All we need is to enforce that there is no way anymore that in a code
    path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
    can run into a hugepmd. The overhead of a barrier() is just a compiler
    tweak and should not be measurable (I only added it for THP builds). I
    don't exclude different compiler versions may have prevented the race
    too by caching the value of *pmd on the stack (that hasn't been
    verified, but it wouldn't be impossible considering
    pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
    and there's no external function called in between pmd_trans_huge and
    pmd_none_or_clear_bad).

    if (pmd_trans_huge(*pmd)) {
    if (next-addr != HPAGE_PMD_SIZE) {
    VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
    split_huge_page_pmd(vma->vm_mm, pmd);
    } else if (zap_huge_pmd(tlb, vma, pmd, addr))
    continue;
    /* fall through */
    }
    if (pmd_none_or_clear_bad(pmd))

    Because this race condition could be exercised without special
    privileges this was reported in CVE-2012-1179.

    The race was identified and fully explained by Ulrich who debugged it.
    I'm quoting his accurate explanation below, for reference.

    ====== start quote =======
    mapcount 0 page_mapcount 1
    kernel BUG at mm/huge_memory.c:1384!

    At some point prior to the panic, a "bad pmd ..." message similar to the
    following is logged on the console:

    mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).

    The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
    the page's PMD table entry.

    143 void pmd_clear_bad(pmd_t *pmd)
    144 {
    -> 145 pmd_ERROR(*pmd);
    146 pmd_clear(pmd);
    147 }

    After the PMD table entry has been cleared, there is an inconsistency
    between the actual number of PMD table entries that are mapping the page
    and the page's map count (_mapcount field in struct page). When the page
    is subsequently reclaimed, __split_huge_page() detects this inconsistency.

    1381 if (mapcount != page_mapcount(page))
    1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
    1383 mapcount, page_mapcount(page));
    -> 1384 BUG_ON(mapcount != page_mapcount(page));

    The root cause of the problem is a race of two threads in a multithreaded
    process. Thread B incurs a page fault on a virtual address that has never
    been accessed (PMD entry is zero) while Thread A is executing an madvise()
    system call on a virtual address within the same 2 MB (huge page) range.

    virtual address space
    .---------------------.
    | |
    | |
    .-|---------------------|
    | | |
    | | |< |/////////////////////| > A(range)
    page | |/////////////////////|-'
    | | |
    | | |
    '-|---------------------|
    | |
    | |
    '---------------------'

    - Thread A is executing an madvise(..., MADV_DONTNEED) system call
    on the virtual address range "A(range)" shown in the picture.

    sys_madvise
    // Acquire the semaphore in shared mode.
    down_read(¤t->mm->mmap_sem)
    ...
    madvise_vma
    switch (behavior)
    case MADV_DONTNEED:
    madvise_dontneed
    zap_page_range
    unmap_vmas
    unmap_page_range
    zap_pud_range
    zap_pmd_range
    //
    // Assume that this huge page has never been accessed.
    // I.e. content of the PMD entry is zero (not mapped).
    //
    if (pmd_trans_huge(*pmd)) {
    // We don't get here due to the above assumption.
    }
    //
    // Assume that Thread B incurred a page fault and
    .---------> // sneaks in here as shown below.
    | //
    | if (pmd_none_or_clear_bad(pmd))
    | {
    | if (unlikely(pmd_bad(*pmd)))
    | pmd_clear_bad
    | {
    | pmd_ERROR
    | // Log "bad pmd ..." message here.
    | pmd_clear
    | // Clear the page's PMD entry.
    | // Thread B incremented the map count
    | // in page_add_new_anon_rmap(), but
    | // now the page is no longer mapped
    | // by a PMD entry (-> inconsistency).
    | }
    | }
    |
    v
    - Thread B is handling a page fault on virtual address "B(fault)" shown
    in the picture.

    ...
    do_page_fault
    __do_page_fault
    // Acquire the semaphore in shared mode.
    down_read_trylock(&mm->mmap_sem)
    ...
    handle_mm_fault
    if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
    // We get here due to the above assumption (PMD entry is zero).
    do_huge_pmd_anonymous_page
    alloc_hugepage_vma
    // Allocate a new transparent huge page here.
    ...
    __do_huge_pmd_anonymous_page
    ...
    spin_lock(&mm->page_table_lock)
    ...
    page_add_new_anon_rmap
    // Here we increment the page's map count (starts at -1).
    atomic_set(&page->_mapcount, 0)
    set_pmd_at
    // Here we set the page's PMD entry which will be cleared
    // when Thread A calls pmd_clear_bad().
    ...
    spin_unlock(&mm->page_table_lock)

    The mmap_sem does not prevent the race because both threads are acquiring
    it in shared mode (down_read). Thread B holds the page_table_lock while
    the page's map count and PMD table entry are updated. However, Thread A
    does not synchronize on that lock.

    ====== end quote =======

    [akpm@linux-foundation.org: checkpatch fixes]
    Reported-by: Ulrich Obergfell
    Signed-off-by: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Dave Jones
    Acked-by: Larry Woodman
    Acked-by: Rik van Riel
    Cc: Mark Salter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrea Arcangeli
     
  • commit f910381a55cdaa097030291f272f6e6e4380c39a upstream.

    Add a div64_long macro which is used to devide a 64bit number by a long (which
    can be 4 bytes on 32bit systems and 8 bytes on 64bit systems).

    Suggested-by: Thomas Gleixner
    Signed-off-by: Sasha Levin
    Cc: johnstul@us.ibm.com
    Link: http://lkml.kernel.org/r/1331829374-31543-1-git-send-email-levinsasha928@gmail.com
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Sasha Levin
     

20 Mar, 2012

3 commits

  • commit 62d3c5439c534b0e6c653fc63e6d8c67be3a57b1 upstream.

    This patch (as1519) fixes a bug in the block layer's disk-events
    polling. The polling is done by a work routine queued on the
    system_nrt_wq workqueue. Since that workqueue isn't freezable, the
    polling continues even in the middle of a system sleep transition.

    Obviously, polling a suspended drive for media changes and such isn't
    a good thing to do; in the case of USB mass-storage devices it can
    lead to real problems requiring device resets and even re-enumeration.

    The patch fixes things by creating a new system-wide, non-reentrant,
    freezable workqueue and using it for disk-events polling.

    Signed-off-by: Alan Stern
    Acked-by: Tejun Heo
    Acked-by: Rafael J. Wysocki
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Alan Stern
     
  • commit fe316bf2d5847bc5dd975668671a7b1067603bc7 upstream.

    Since 2.6.39 (1196f8b), when a driver returns -ENOMEDIUM for open(),
    __blkdev_get() calls rescan_partitions() to remove
    in-kernel partition structures and raise KOBJ_CHANGE uevent.

    However it ends up calling driver's revalidate_disk without open
    and could cause oops.

    In the case of SCSI:

    process A process B
    ----------------------------------------------
    sys_open
    __blkdev_get
    sd_open
    returns -ENOMEDIUM
    scsi_remove_device

    rescan_partitions
    sd_revalidate_disk

    Oopses are reported here:
    http://marc.info/?l=linux-scsi&m=132388619710052

    This patch separates the partition invalidation from rescan_partitions()
    and use it for -ENOMEDIUM case.

    Reported-by: Huajun Li
    Signed-off-by: Jun'ichi Nomura
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jun'ichi Nomura
     
  • [ Upstream commit 03606895cd98c0a628b17324fd7b5ff15db7e3cd ]

    Niccolo Belli reported ipsec crashes in case we handle a frame without
    mac header (atm in his case)

    Before copying mac header, better make sure it is present.

    Bugzilla reference: https://bugzilla.kernel.org/show_bug.cgi?id=42809

    Reported-by: Niccolò Belli
    Tested-by: Niccolò Belli
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

13 Mar, 2012

4 commits

  • commit c49d005b6cc8491fad5b24f82805be2d6bcbd3dd upstream.

    A hardware bug in the OMAP4 HDMI PHY causes physical damage to the board
    if the HDMI PHY is kept powered on when the cable is not connected.

    This patch solves the problem by adding hot-plug-detection into the HDMI
    IP driver. This is not a real HPD support in the sense that nobody else
    than the IP driver gets to know about the HPD events, but is only meant
    to fix the HW bug.

    The strategy is simple: If the display device is turned off by the user,
    the PHY power is set to OFF. When the display device is turned on by the
    user, the PHY power is set either to LDOON or TXON, depending on whether
    the HDMI cable is connected.

    The reason to avoid PHY OFF when the display device is on, but the cable
    is disconnected, is that when the PHY is turned OFF, the HDMI IP is not
    "ticking" and thus the DISPC does not receive pixel clock from the HDMI
    IP. This would, for example, prevent any VSYNCs from happening, and
    would thus affect the users of omapdss. By using LDOON when the cable is
    disconnected we'll avoid the HW bug, but keep the HDMI working as usual
    from the user's point of view.

    Signed-off-by: Tomi Valkeinen
    Signed-off-by: Greg Kroah-Hartman

    Tomi Valkeinen
     
  • commit 5189fa19a4b2b4c3bec37c3a019d446148827717 upstream.

    There is only one error code to return for a bad user-space buffer
    pointer passed to a system call in the same address space as the
    system call is executed, and that is EFAULT. Furthermore, the
    low-level access routines, which catch most of the faults, return
    EFAULT already.

    Signed-off-by: H. Peter Anvin
    Reviewed-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    H. Peter Anvin
     
  • commit c8e252586f8d5de906385d8cf6385fee289a825e upstream.

    The regset common infrastructure assumed that regsets would always
    have .get and .set methods, but not necessarily .active methods.
    Unfortunately people have since written regsets without .set methods.

    Rather than putting in stub functions everywhere, handle regsets with
    null .get or .set methods explicitly.

    Signed-off-by: H. Peter Anvin
    Reviewed-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    H. Peter Anvin
     
  • commit 3c761ea05a8900a907f32b628611873f6bef24b2 upstream.

    The autofs compat handling fix caused a compile failure when
    CONFIG_COMPAT isn't defined.

    Instead of adding random #ifdef'fery in autofs, let's just make the
    compat helpers earlier to use: without CONFIG_COMPAT, is_compat_task()
    just hardcodes to zero.

    We could probably do something similar for a number of other cases where
    we have #ifdef's in code, but this is the low-hanging fruit.

    Reported-and-tested-by: Andreas Schwab
    Signed-off-by: Linus Torvalds
    Cc: Jonathan Nieder
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     

01 Mar, 2012

7 commits

  • commit 28d82dc1c4edbc352129f97f4ca22624d1fe61de upstream.

    The current epoll code can be tickled to run basically indefinitely in
    both loop detection path check (on ep_insert()), and in the wakeup paths.
    The programs that tickle this behavior set up deeply linked networks of
    epoll file descriptors that cause the epoll algorithms to traverse them
    indefinitely. A couple of these sample programs have been previously
    posted in this thread: https://lkml.org/lkml/2011/2/25/297.

    To fix the loop detection path check algorithms, I simply keep track of
    the epoll nodes that have been already visited. Thus, the loop detection
    becomes proportional to the number of epoll file descriptor and links.
    This dramatically decreases the run-time of the loop check algorithm. In
    one diabolical case I tried it reduced the run-time from 15 mintues (all
    in kernel time) to .3 seconds.

    Fixing the wakeup paths could be done at wakeup time in a similar manner
    by keeping track of nodes that have already been visited, but the
    complexity is harder, since there can be multiple wakeups on different
    cpus...Thus, I've opted to limit the number of possible wakeup paths when
    the paths are created.

    This is accomplished, by noting that the end file descriptor points that
    are found during the loop detection pass (from the newly added link), are
    actually the sources for wakeup events. I keep a list of these file
    descriptors and limit the number and length of these paths that emanate
    from these 'source file descriptors'. In the current implemetation I
    allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of
    length 4 and 10 of length 5. Note that it is sufficient to check the
    'source file descriptors' reachable from the newly added link, since no
    other 'source file descriptors' will have newly added links. This allows
    us to check only the wakeup paths that may have gotten too long, and not
    re-check all possible wakeup paths on the system.

    In terms of the path limit selection, I think its first worth noting that
    the most common case for epoll, is probably the model where you have 1
    epoll file descriptor that is monitoring n number of 'source file
    descriptors'. In this case, each 'source file descriptor' has a 1 path of
    length 1. Thus, I believe that the limits I'm proposing are quite
    reasonable and in fact may be too generous. Thus, I'm hoping that the
    proposed limits will not prevent any workloads that currently work to
    fail.

    In terms of locking, I have extended the use of the 'epmutex' to all
    epoll_ctl add and remove operations. Currently its only used in a subset
    of the add paths. I need to hold the epmutex, so that we can correctly
    traverse a coherent graph, to check the number of paths. I believe that
    this additional locking is probably ok, since its in the setup/teardown
    paths, and doesn't affect the running paths, but it certainly is going to
    add some extra overhead. Also, worth noting is that the epmuex was
    recently added to the ep_ctl add operations in the initial path loop
    detection code using the argument that it was not on a critical path.

    Another thing to note here, is the length of epoll chains that is allowed.
    Currently, eventpoll.c defines:

    /* Maximum number of nesting allowed inside epoll sets */
    #define EP_MAX_NESTS 4

    This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS
    + 1). However, this limit is currently only enforced during the loop
    check detection code, and only when the epoll file descriptors are added
    in a certain order. Thus, this limit is currently easily bypassed. The
    newly added check for wakeup paths, stricly limits the wakeup paths to a
    length of 5, regardless of the order in which ep's are linked together.
    Thus, a side-effect of the new code is a more consistent enforcement of
    the graph depth.

    Thus far, I've tested this, using the sample programs previously
    mentioned, which now either return quickly or return -EINVAL. I've also
    testing using the piptest.c epoll tester, which showed no difference in
    performance. I've also created a number of different epoll networks and
    tested that they behave as expectded.

    I believe this solves the original diabolical test cases, while still
    preserving the sane epoll nesting.

    Signed-off-by: Jason Baron
    Cc: Nelson Elhage
    Cc: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jason Baron
     
  • commit d80e731ecab420ddcb79ee9d0ac427acbc187b4b upstream.

    This patch is intentionally incomplete to simplify the review.
    It ignores ep_unregister_pollwait() which plays with the same wqh.
    See the next change.

    epoll assumes that the EPOLL_CTL_ADD'ed file controls everything
    f_op->poll() needs. In particular it assumes that the wait queue
    can't go away until eventpoll_release(). This is not true in case
    of signalfd, the task which does EPOLL_CTL_ADD uses its ->sighand
    which is not connected to the file.

    This patch adds the special event, POLLFREE, currently only for
    epoll. It expects that init_poll_funcptr()'ed hook should do the
    necessary cleanup. Perhaps it should be defined as EPOLLFREE in
    eventpoll.

    __cleanup_sighand() is changed to do wake_up_poll(POLLFREE) if
    ->signalfd_wqh is not empty, we add the new signalfd_cleanup()
    helper.

    ep_poll_callback(POLLFREE) simply does list_del_init(task_list).
    This make this poll entry inconsistent, but we don't care. If you
    share epoll fd which contains our sigfd with another process you
    should blame yourself. signalfd is "really special". I simply do
    not know how we can define the "right" semantics if it used with
    epoll.

    The main problem is, epoll calls signalfd_poll() once to establish
    the connection with the wait queue, after that signalfd_poll(NULL)
    returns the different/inconsistent results depending on who does
    EPOLL_CTL_MOD/signalfd_read/etc. IOW: apart from sigmask, signalfd
    has nothing to do with the file, it works with the current thread.

    In short: this patch is the hack which tries to fix the symptoms.
    It also assumes that nobody can take tasklist_lock under epoll
    locks, this seems to be true.

    Note:

    - we do not have wake_up_all_poll() but wake_up_poll()
    is fine, poll/epoll doesn't use WQ_FLAG_EXCLUSIVE.

    - signalfd_cleanup() uses POLLHUP along with POLLFREE,
    we need a couple of simple changes in eventpoll.c to
    make sure it can't be "lost".

    Reported-by: Maxime Bizon
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Oleg Nesterov
     
  • commit 4949314c7283ea4f9ade182ca599583b89f7edd6 upstream.

    We need to handle >1 page control cdbs, so extend the code to do a vmap
    if bigger than 1 page. It seems like kmap() is still preferable if just
    a page, fewer TLB shootdowns(?), so keep using that when possible.

    Rename function pair for their new scope.

    Signed-off-by: Andy Grover
    Signed-off-by: Nicholas Bellinger
    Signed-off-by: Greg Kroah-Hartman

    Andy Grover
     
  • commit d9f5343e35d9138432657202afa8e3ddb2ade360 upstream.

    Somehow we ended up with duplicate hub feature #defines in ch11.h.
    Tatyana Brokhman first created the USB 3.0 hub feature macros in 2.6.38
    with commit 0eadcc09203349b11ca477ec367079b23d32ab91 "usb: USB3.0 ch11
    definitions". In 2.6.39, I modified a patch from John Youn that added
    similar macros in a different place in the same file, and committed
    dbe79bbe9dcb22cb3651c46f18943477141ca452 "USB 3.0 Hub Changes".

    Some of the #defines used different names for the same values. Others
    used exactly the same names with the same values, like these gems:

    #define USB_PORT_FEAT_BH_PORT_RESET 28
    ...
    #define USB_PORT_FEAT_BH_PORT_RESET 28

    According to my very geeky husband (who looked it up in the C99 spec),
    it is allowed to have object-like macros with duplicate names as long as
    the replacement list is exactly the same. However, he recalled that
    some compilers will give warnings when they find duplicate macros. It's
    probably best to remove the duplicates in the stable tree, so that the
    code compiles for everyone.

    The macros are now fixed to move the feature requests that are specific
    to USB 3.0 hubs into a new section (out of the USB 2.0 hub feature
    section), and use the most common macro name.

    This patch should be backported to 2.6.39.

    Signed-off-by: Sarah Sharp
    Cc: Tatyana Brokhman
    Cc: John Youn
    Cc: Jamey Sharp
    Signed-off-by: Greg Kroah-Hartman

    Sarah Sharp
     
  • [ Upstream commit 16bda13d90c8d5da243e2cfa1677e62ecce26860 ]

    Just like skb->cb[], so that qdisc_skb_cb can be encapsulated inside
    of other data structures.

    This is intended to be used by IPoIB so that it can remember
    addressing information stored at hard_header_ops->create() time that
    it can fetch when the packet gets to the transmit routine.

    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David S. Miller
     
  • [ Upstream commit e6b45241c57a83197e5de9166b3b0d32ac562609 ]

    Eric Dumazet found that commit 813b3b5db83
    (ipv4: Use caller's on-stack flowi as-is in output
    route lookups.) that comes in 3.0 added a regression.
    The problem appears to be that resulting flowi4_oif is
    used incorrectly as input parameter to some routing lookups.
    The result is that when connecting to local port without
    listener if the IP address that is used is not on a loopback
    interface we incorrectly assign RTN_UNICAST to the output
    route because no route is matched by oif=lo. The RST packet
    can not be sent immediately by tcp_v4_send_reset because
    it expects RTN_LOCAL.

    So, change ip_route_connect and ip_route_newports to
    update the flowi4 fields that are input parameters because
    we do not want unnecessary binding to oif.

    To make it clear what are the input parameters that
    can be modified during lookup and to show which fields of
    floiw4 are reused add a new function to update the flowi4
    structure: flowi4_update_output.

    Thanks to Yurij M. Plotnikov for providing a bug report including a
    program to reproduce the problem.

    Thanks to Eric Dumazet for tracking the problem down to
    tcp_v4_send_reset and providing initial fix.

    Reported-by: Yurij M. Plotnikov
    Signed-off-by: Julian Anastasov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Julian Anastasov
     
  • commit 331818f1c468a24e581aedcbe52af799366a9dfe upstream.

    Commit bf118a342f10dafe44b14451a1392c3254629a1f (NFSv4: include bitmap
    in nfsv4 get acl data) introduces the 'acl_scratch' page for the case
    where we may need to decode multi-page data. However it fails to take
    into account the fact that the variable may be NULL (for the case where
    we're not doing multi-page decode), and it also attaches it to the
    encoding xdr_stream rather than the decoding one.

    The immediate result is an Oops in nfs4_xdr_enc_getacl due to the
    call to page_address() with a NULL page pointer.

    Signed-off-by: Trond Myklebust
    Cc: Andy Adamson
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     

21 Feb, 2012

5 commits

  • commit f2ea0f5f04c97b48c88edccba52b0682fbe45087 upstream.

    Use standard ror64() instead of hand-written.
    There is no standard ror64, so create it.

    The difference is shift value being "unsigned int" instead of uint64_t
    (for which there is no reason). gcc starts to emit native ROR instructions
    which it doesn't do for some reason currently. This should make the code
    faster.

    Patch survives in-tree crypto test and ping flood with hmac(sha512) on.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Herbert Xu
    Signed-off-by: Greg Kroah-Hartman

    Alexey Dobriyan
     
  • commit f9c2a0dc42a6938ff2a80e55ca2bbd1d5581c72e upstream.

    Current PIO mode makes a kernel crash with CONFIG_HIGHMEM.
    Highmem pages have a NULL from sg_virt(sg).
    This patch fixes the following problem.

    Unable to handle kernel NULL pointer dereference at virtual address 00000000
    pgd = c0004000
    [00000000] *pgd=00000000
    Internal error: Oops: 817 [#1] PREEMPT SMP
    Modules linked in:
    CPU: 0 Not tainted (3.0.15-01423-gdbf465f #589)
    PC is at dw_mci_pull_data32+0x4c/0x9c
    LR is at dw_mci_read_data_pio+0x54/0x1f0
    pc : [] lr : [] psr: 20000193
    sp : c0619d48 ip : c0619d70 fp : c0619d6c
    r10: 00000000 r9 : 00000002 r8 : 00001000
    r7 : 00000200 r6 : 00000000 r5 : e1dd3100 r4 : 00000000
    r3 : 65622023 r2 : 0000007f r1 : eeb96000 r0 : e1dd3100
    Flags: nzCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment
    xkernel
    Control: 10c5387d Table: 61e2004a DAC: 00000015
    Process swapper (pid: 0, stack limit = 0xc06182f0)
    Stack: (0xc0619d48 to 0xc061a000)
    9d40: e1dd3100 e1a4f000 00000000 e1dd3100 e1a4f000 00000200
    9d60: c0619da4 c0619d70 c035988c c03587e4 c0619d9c e18158f4 e1dd3100 e1dd3100
    9d80: 00000020 00000000 00000000 00000020 c06e8a84 00000000 c0619e04 c0619da8
    9da0: c0359b24 c0359844 e18158f4 e1dd3164 e1dd3168 e1dd3150 3d02fc79 e1dd3154
    9dc0: e1dd3178 00000000 00000020 00000000 e1dd3150 00000000 c10dd7e8 e1a84900
    9de0: c061e7cc 00000000 00000000 0000008d c06e8a84 c061e780 c0619e4c c0619e08
    9e00: c00c4738 c0359a34 3d02fc79 00000000 c0619e4c c05a1698 c05a1670 c05a165c
    9e20: c04de8b0 c061e780 c061e7cc e1a84900 ffffed68 0000008d c0618000 00000000
    9e40: c0619e6c c0619e50 c00c48b4 c00c46c8 c061e780 c00423ac c061e7cc ffffed68
    9e60: c0619e8c c0619e70 c00c7358 c00c487c 0000008d ffffee38 c0618000 ffffed68
    9e80: c0619ea4 c0619e90 c00c4258 c00c72b0 c00423ac ffffee38 c0619ecc c0619ea8
    9ea0: c004241c c00c4234 ffffffff f8810000 0000006d 00000002 00000001 7fffffff
    9ec0: c0619f44 c0619ed0 c0048bc0 c00423c4 220ae7a9 00000000 386f0d30 0005d3a4
    9ee0: c00423ac c10dd0b8 c06f2cd8 c0618000 c0594778 c003a674 7fffffff c0619f44
    9f00: 386f0d30 c0619f18 c00a6f94 c005be3c 80000013 ffffffff 386f0d30 0005d3a4
    9f20: 386f0d30 0005d2d1 c10dd0a8 c10dd0b8 c06f2cd8 c0618000 c0619f74 c0619f48
    9f40: c0345858 c005be00 c00a2440 c0618000 c0618000 c00410d8 c06c1944 c00410fc
    9f60: c0594778 c003a674 c0619f9c c0619f78 c004a7e8 c03457b4 c0618000 c06c18f8
    9f80: 00000000 c0039c70 c06c18d4 c003a674 c0619fb4 c0619fa0 c04ceafc c004a714
    9fa0: c06287b4 c06c18f8 c0619ff4 c0619fb8 c0008b68 c04cea68 c0008578 00000000
    9fc0: 00000000 c003a674 00000000 10c5387d c0628658 c003aa78 c062f1c4 4000406a
    9fe0: 413fc090 00000000 00000000 c0619ff8 40008044 c0008858 00000000 00000000
    Backtrace:
    [] (dw_mci_pull_data32+0x0/0x9c) from [] (dw_mci_read_data_pio+0x54/0x1f0)
    r6:00000200 r5:e1a4f000 r4:e1dd3100
    [] (dw_mci_read_data_pio+0x0/0x1f0) from [] (dw_mci_interrupt+0xfc/0x4a4)
    [] (dw_mci_interrupt+0x0/0x4a4) from [] (handle_irq_event_percpu+0x7c/0x1b4)
    [] (handle_irq_event_percpu+0x0/0x1b4) from [] (handle_irq_event+0x44/0x64)
    [] (handle_irq_event+0x0/0x64) from [] (handle_fasteoi_irq+0xb4/0x124)
    r7:ffffed68 r6:c061e7cc r5:c00423ac r4:c061e780
    [] (handle_fasteoi_irq+0x0/0x124) from [] (generic_handle_irq+0x30/0x38)
    r7:ffffed68 r6:c0618000 r5:ffffee38 r4:0000008d
    [] (generic_handle_irq+0x0/0x38) from [] (asm_do_IRQ+0x64/0xe0)
    r5:ffffee38 r4:c00423ac
    [] (asm_do_IRQ+0x0/0xe0) from [] (__irq_svc+0x80/0x14c)
    Exception stack(0xc0619ed0 to 0xc0619f18)

    Signed-off-by: Seungwon Jeon
    Acked-by: Will Newton
    Signed-off-by: Chris Ball
    Signed-off-by: Greg Kroah-Hartman

    Seungwon Jeon
     
  • commit 977b7e3a52a7421ad33a393a38ece59f3d41c2fa upstream.

    When a SD card is hot removed without umount, del_gendisk() will call
    bdi_unregister() without destroying/freeing it. This leaves the bdi in
    the bdi->dev = NULL, bdi->wb.task = NULL, bdi->bdi_list removed state.

    When sync(2) gets the bdi before bdi_unregister() and calls
    bdi_queue_work() after the unregister, trace_writeback_queue will be
    dereferencing the NULL bdi->dev. Fix it with a simple test for NULL.

    LKML-reference: http://lkml.org/lkml/2012/1/18/346
    Reported-by: Rabin Vincent
    Tested-by: Namjae Jeon
    Signed-off-by: Wu Fengguang
    Signed-off-by: Greg Kroah-Hartman

    Wu Fengguang
     
  • commit 15eb77a07c714ac80201abd0a9568888bcee6276 upstream.

    bdi_prune_sb() resets sb->s_bdi to default_backing_dev_info when the
    tearing down the original bdi. Fix trace_writeback_single_inode to
    use sb->s_bdi=default_backing_dev_info rather than bdi->dev=NULL for a
    teared down bdi.

    Reported-by: Rabin Vincent
    Tested-by: Rabin Vincent
    Signed-off-by: Wu Fengguang
    Signed-off-by: Greg Kroah-Hartman

    Wu Fengguang
     
  • commit 3310225dfc71a35a2cc9340c15c0e08b14b3c754 upstream.

    PROP_MAX_SHIFT should be set to 32.

    2) overflow: (bdi_dirty * numerator) could easily overflow if numerator
    used up to 48 bits, leaving only 16 bits to bdi_dirty

    Cc: Peter Zijlstra
    Reported-by: Ilya Tumaykin
    Tested-by: Ilya Tumaykin
    Signed-off-by: Wu Fengguang
    Signed-off-by: Greg Kroah-Hartman

    Wu Fengguang
     

14 Feb, 2012

3 commits

  • commit 9c0a835a9d9aed41bcf9c287f5069133a6e2a87b upstream.

    The usb/ch9.h will be installed to /usr/include/linux,
    and be used from user space.
    But le16_to_cpu() is only defined for kernel code.
    Without this patch, user space compile will be broken.
    Special thanks to Stefan Becker

    Reported-by: Stefan Becker
    Signed-off-by: Kuninori Morimoto
    Signed-off-by: Felipe Balbi
    Signed-off-by: Greg Kroah-Hartman

    Kuninori Morimoto
     
  • commit d020283dc694c9ec31b410f522252f7a8397e67d upstream.

    Looks like change "PM QoS: Move and rename the implementation files"
    merged during the 3.2 development cycle made PM QoS depend on
    CONFIG_PM which depends on (PM_SLEEP || PM_RUNTIME).

    That breaks CPU C-states with kernels not having these CONFIGs, causing CPUs
    to spend time in Polling loop idle instead of going into deep C-states,
    consuming way way more power. This is with either acpi idle or intel idle
    enabled.

    Either CONFIG_PM should be enabled with any pm_qos users or
    the !CONFIG_PM pm_qos_request() should return sane defaults not to break
    the existing users. Here's is the patch for the latter option.

    [rjw: Modified the changelog slightly.]

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Greg Kroah-Hartman

    Venkatesh Pallipadi
     
  • commit 181e9bdef37bfcaa41f3ab6c948a2a0d60a268b5 upstream.

    Commit 2aede851ddf08666f68ffc17be446420e9d2a056

    PM / Hibernate: Freeze kernel threads after preallocating memory

    introduced a mechanism by which kernel threads were frozen after
    the preallocation of hibernate image memory to avoid problems with
    frozen kernel threads not responding to memory freeing requests.
    However, it overlooked the s2disk code path in which the
    SNAPSHOT_CREATE_IMAGE ioctl was run directly after SNAPSHOT_FREE,
    which caused freeze_workqueues_begin() to BUG(), because it saw
    that worqueues had been already frozen.

    Although in principle this issue might be addressed by removing
    the relevant BUG_ON() from freeze_workqueues_begin(), that would
    reintroduce the very problem that commit 2aede851ddf08666f68ffc17be4
    attempted to avoid into that particular code path. For this reason,
    to fix the issue at hand, introduce thaw_kernel_threads() and make
    the SNAPSHOT_FREE ioctl execute it.

    Special thanks to Srivatsa S. Bhat for detailed analysis of the
    problem.

    Reported-and-tested-by: Jiri Slaby
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Srivatsa S. Bhat
    Signed-off-by: Greg Kroah-Hartman

    Rafael J. Wysocki
     

07 Feb, 2012

1 commit

  • commit 3c076351c4027a56d5005a39a0b518a4ba393ce2 upstream.

    Right now we forcibly clear ASPM state on all devices if the BIOS indicates
    that the feature isn't supported. Based on the Microsoft presentation
    "PCI Express In Depth for Windows Vista and Beyond", I'm starting to think
    that this may be an error. The implication is that unless the platform
    grants full control via _OSC, Windows will not touch any PCIe features -
    including ASPM. In that case clearing ASPM state would be an error unless
    the platform has granted us that control.

    This patch reworks the ASPM disabling code such that the actual clearing
    of state is triggered by a successful handoff of PCIe control to the OS.
    The general ASPM code undergoes some changes in order to ensure that the
    ability to clear the bits isn't overridden by ASPM having already been
    disabled. Further, this theoretically now allows for situations where
    only a subset of PCIe roots hand over control, leaving the others in the
    BIOS state.

    It's difficult to know for sure that this is the right thing to do -
    there's zero public documentation on the interaction between all of these
    components. But enough vendors enable ASPM on platforms and then set this
    bit that it seems likely that they're expecting the OS to leave them alone.

    Measured to save around 5W on an idle Thinkpad X220.

    Signed-off-by: Matthew Garrett
    Signed-off-by: Jesse Barnes
    Signed-off-by: Greg Kroah-Hartman

    Matthew Garrett
     

04 Feb, 2012

2 commits

  • [ Upstream commit 5ee4433efe99b9f39f6eff5052a177bbcfe72cea ]

    By definition net_generic should never be called when it can return
    NULL. Fail conspicously with a BUG_ON to make it clear when people mess
    up that a NULL return should never happen.

    Recently there was a bug in the CAIF subsystem where it was registered
    with register_pernet_device instead of register_pernet_subsys. It was
    erroneously concluded that net_generic could validly return NULL and
    that net_assign_generic was buggy (when it was just inefficient).
    Hopefully this BUG_ON will prevent people to coming to similar erroneous
    conclusions in the futrue.

    Signed-off-by: Eric W. Biederman
    Tested-by: Sasha Levin
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit 598781d71119827b454fd75d46f84755bca6f0c6 upstream.

    If the master tries to authenticate a client using drm_authmagic and
    that client has already closed its drm file descriptor,
    either wilfully or because it was terminated, the
    call to drm_authmagic will dereference a stale pointer into kmalloc'ed memory
    and corrupt it.

    Typically this results in a hard system hang.

    This patch fixes that problem by removing any authentication tokens
    (struct drm_magic_entry) open for a file descriptor when that file
    descriptor is closed.

    Signed-off-by: Thomas Hellstrom
    Reviewed-by: Daniel Vetter
    Signed-off-by: Dave Airlie
    Signed-off-by: Greg Kroah-Hartman

    Thomas Hellstrom
     

26 Jan, 2012

12 commits

  • commit 245132643e1cfcd145bbc86a716c1818371fcb93 upstream.

    Commit cc39c6a9bbde ("mm: account skipped entries to avoid looping in
    find_get_pages") correctly fixed an infinite loop; but left a problem
    that find_get_pages() on shmem would return 0 (appearing to callers to
    mean end of tree) when it meets a run of nr_pages swap entries.

    The only uses of find_get_pages() on shmem are via pagevec_lookup(),
    called from invalidate_mapping_pages(), and from shmctl SHM_UNLOCK's
    scan_mapping_unevictable_pages(). The first is already commented, and
    not worth worrying about; but the second can leave pages on the
    Unevictable list after an unusual sequence of swapping and locking.

    Fix that by using shmem_find_get_pages_and_swap() (then ignoring the
    swap) instead of pagevec_lookup().

    But I don't want to contaminate vmscan.c with shmem internals, nor
    shmem.c with LRU locking. So move scan_mapping_unevictable_pages() into
    shmem.c, renaming it shmem_unlock_mapping(); and rename
    check_move_unevictable_page() to check_move_unevictable_pages(), looping
    down an array of pages, oftentimes under the same lock.

    Leave out the "rotate unevictable list" block: that's a leftover from
    when this was used for /proc/sys/vm/scan_unevictable_pages, whose flawed
    handling involved looking at pages at tail of LRU.

    Was there significance to the sequence first ClearPageUnevictable, then
    test page_evictable, then SetPageUnevictable here? I think not, we're
    under LRU lock, and have no barriers between those.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Eric Dumazet
    Cc: Johannes Weiner
    Cc: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Hugh Dickins
     
  • commit cd4ca7afc61d3b18fcd635002459fb6b1d701099 upstream.

    Update xc4000 tuner definition, number 81 is already in use by
    TUNER_PARTSNIC_PTI_5NF05.

    Signed-off-by: Miroslav Slugen
    Signed-off-by: Mauro Carvalho Chehab
    Signed-off-by: Greg Kroah-Hartman

    Miroslav Slugen
     
  • commit 895f3022523361e9b383cf48f51feb1f7d5e7e53 upstream.

    The target code was not setting the additional sense length field in the
    sense data it returned, which meant that at least the Linux stack
    ignored the ASC/ASCQ fields. For example, without this patch, on a
    tcm_loop device:

    # sg_raw -v /dev/sda 2 0 0 0 0 0

    gives

    cdb to send: 02 00 00 00 00 00
    SCSI Status: Check Condition

    Sense Information:
    Fixed format, current; Sense key: Illegal Request
    Raw sense data (in hex):
    70 00 05 00 00 00 00 00

    while after the patch we correctly get the following (which matches what
    a regular disk returns):

    cdb to send: 02 00 00 00 00 00
    SCSI Status: Check Condition

    Sense Information:
    Fixed format, current; Sense key: Illegal Request
    Additional sense: Invalid command operation code
    Raw sense data (in hex):
    70 00 05 00 00 00 00 0a 00 00 00 00 20 00 00 00
    00 00

    Signed-off-by: Roland Dreier
    Signed-off-by: Nicholas Bellinger
    Signed-off-by: Greg Kroah-Hartman

    Roland Dreier
     
  • commit 8df0eb7c9d96f9e82f233ee8b74e0f0c8471f868 upstream.

    In SRAT v1, we had 8bit proximity domain (PXM) fields; SRAT v2 provides
    32bits for these. The new fields were reserved before.
    According to the ACPI spec, the OS must disregrard reserved fields.
    In order to know whether or not, we must know what version the SRAT
    table has.

    This patch stores the SRAT table revision for later consumption
    by arch specific __init functions.

    Signed-off-by: Kurt Garloff
    Signed-off-by: Len Brown
    Signed-off-by: Greg Kroah-Hartman

    Kurt Garloff
     
  • commit 0bfc96cb77224736dfa35c3c555d37b3646ef35e upstream.

    [ Changes with respect to 3.3: return -ENOTTY from scsi_verify_blk_ioctl
    and -ENOIOCTLCMD from sd_compat_ioctl. ]

    Linux allows executing the SG_IO ioctl on a partition or LVM volume, and
    will pass the command to the underlying block device. This is
    well-known, but it is also a large security problem when (via Unix
    permissions, ACLs, SELinux or a combination thereof) a program or user
    needs to be granted access only to part of the disk.

    This patch lets partitions forward a small set of harmless ioctls;
    others are logged with printk so that we can see which ioctls are
    actually sent. In my tests only CDROM_GET_CAPABILITY actually occurred.
    Of course it was being sent to a (partition on a) hard disk, so it would
    have failed with ENOTTY and the patch isn't changing anything in
    practice. Still, I'm treating it specially to avoid spamming the logs.

    In principle, this restriction should include programs running with
    CAP_SYS_RAWIO. If for example I let a program access /dev/sda2 and
    /dev/sdb, it still should not be able to read/write outside the
    boundaries of /dev/sda2 independent of the capabilities. However, for
    now programs with CAP_SYS_RAWIO will still be allowed to send the
    ioctls. Their actions will still be logged.

    This patch does not affect the non-libata IDE driver. That driver
    however already tests for bd != bd->bd_contains before issuing some
    ioctl; it could be restricted further to forbid these ioctls even for
    programs running with CAP_SYS_ADMIN/CAP_SYS_RAWIO.

    Cc: linux-scsi@vger.kernel.org
    Cc: Jens Axboe
    Cc: James Bottomley
    Signed-off-by: Paolo Bonzini
    [ Make it also print the command name when warning - Linus ]
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Paolo Bonzini
     
  • commit 577ebb374c78314ac4617242f509e2f5e7156649 upstream.

    Introduce a wrapper around scsi_cmd_ioctl that takes a block device.

    The function will then be enhanced to detect partition block devices
    and, in that case, subject the ioctls to whitelisting.

    Cc: linux-scsi@vger.kernel.org
    Cc: Jens Axboe
    Cc: James Bottomley
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Paolo Bonzini
     
  • commit eaf5f9073533cde21c7121c136f1c3f072d9cf59 upstream.

    Two (or more) concurrent calls of shrink_dcache_parent() on the same dentry may
    cause shrink_dcache_parent() to loop forever.

    Here's what appears to happen:

    1 - CPU0: select_parent(P) finds C and puts it on dispose list, returns 1

    2 - CPU1: select_parent(P) locks P->d_lock

    3 - CPU0: shrink_dentry_list() locks C->d_lock
    dentry_kill(C) tries to lock P->d_lock but fails, unlocks C->d_lock

    4 - CPU1: select_parent(P) locks C->d_lock,
    moves C from dispose list being processed on CPU0 to the new
    dispose list, returns 1

    5 - CPU0: shrink_dentry_list() finds dispose list empty, returns

    6 - Goto 2 with CPU0 and CPU1 switched

    Basically select_parent() steals the dentry from shrink_dentry_list() and thinks
    it found a new one, causing shrink_dentry_list() to think it's making progress
    and loop over and over.

    One way to trigger this is to make udev calls stat() on the sysfs file while it
    is going away.

    Having a file in /lib/udev/rules.d/ with only this one rule seems to the trick:

    ATTR{vendor}=="0x8086", ATTR{device}=="0x10ca", ENV{PCI_SLOT_NAME}="%k", ENV{MATCHADDR}="$attr{address}", RUN+="/bin/true"

    Then execute the following loop:

    while true; do
    echo -bond0 > /sys/class/net/bonding_masters
    echo +bond0 > /sys/class/net/bonding_masters
    echo -bond1 > /sys/class/net/bonding_masters
    echo +bond1 > /sys/class/net/bonding_masters
    done

    One fix would be to check all callers and prevent concurrent calls to
    shrink_dcache_parent(). But I think a better solution is to stop the
    stealing behavior.

    This patch adds a new dentry flag that is set when the dentry is added to the
    dispose list. The flag is cleared in dentry_lru_del() in case the dentry gets a
    new reference just before being pruned.

    If the dentry has this flag, select_parent() will skip it and let
    shrink_dentry_list() retry pruning it. With select_parent() skipping those
    dentries there will not be the appearance of progress (new dentries found) when
    there is none, hence shrink_dcache_parent() will not loop forever.

    Set the flag is also set in prune_dcache_sb() for consistency as suggested by
    Linus.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Miklos Szeredi
     
  • commit 2fefb8a09e7ed251ae8996e0c69066e74c5aa560 upstream.

    There's no reason I can see that we need to call sv_shutdown between
    closing the two lists of sockets.

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Greg Kroah-Hartman

    J. Bruce Fields
     
  • commit 6c06108be53ca5e94d8b0e93883d534dd9079646 upstream.

    If ctrls->count is too high the multiplication could overflow and
    array_size would be lower than expected. Mauro and Hans Verkuil
    suggested that we cap it at 1024. That comes from the maximum
    number of controls with lots of room for expantion.

    $ grep V4L2_CID include/linux/videodev2.h | wc -l
    211

    Signed-off-by: Dan Carpenter
    Signed-off-by: Mauro Carvalho Chehab
    Signed-off-by: Greg Kroah-Hartman

    Dan Carpenter
     
  • commit ab936cbcd02072a34b60d268f94440fd5cf1970b upstream.

    Commit ef6a3c6311 ("mm: add replace_page_cache_page() function") added a
    function replace_page_cache_page(). This function replaces a page in the
    radix-tree with a new page. WHen doing this, memory cgroup needs to fix
    up the accounting information. memcg need to check PCG_USED bit etc.

    In some(many?) cases, 'newpage' is on LRU before calling
    replace_page_cache(). So, memcg's LRU accounting information should be
    fixed, too.

    This patch adds mem_cgroup_replace_page_cache() and removes the old hooks.
    In that function, old pages will be unaccounted without touching
    res_counter and new page will be accounted to the memcg (of old page).
    WHen overwriting pc->mem_cgroup of newpage, take zone->lru_lock and avoid
    races with LRU handling.

    Background:
    replace_page_cache_page() is called by FUSE code in its splice() handling.
    Here, 'newpage' is replacing oldpage but this newpage is not a newly allocated
    page and may be on LRU. LRU mis-accounting will be critical for memory cgroup
    because rmdir() checks the whole LRU is empty and there is no account leak.
    If a page is on the other LRU than it should be, rmdir() will fail.

    This bug was added in March 2011, but no bug report yet. I guess there
    are not many people who use memcg and FUSE at the same time with upstream
    kernels.

    The result of this bug is that admin cannot destroy a memcg because of
    account leak. So, no panic, no deadlock. And, even if an active cgroup
    exist, umount can succseed. So no problem at shutdown.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Miklos Szeredi
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    KAMEZAWA Hiroyuki
     
  • commit 1f536b9e9f85456df93614b3c2f6a1a2b7d7cb9b upstream.

    Building an ARM target we get the following warnings:

    CC arch/arm/kernel/setup.o
    In file included from arch/arm/kernel/setup.c:39:
    arch/arm/include/asm/elf.h:102:1: warning: "vmcore_elf64_check_arch" redefined
    In file included from arch/arm/kernel/setup.c:24:
    include/linux/crash_dump.h:30:1: warning: this is the location of the previous definition

    Quoting Russell King:

    "linux/crash_dump.h makes no attempt to include asm/elf.h, but it depends
    on stuff in asm/elf.h to determine how stuff inside this file is defined
    at parse time.

    So, if asm/elf.h is included after linux/crash_dump.h or not at all, you
    get a different result from the situation where asm/elf.h is included
    before."

    So add elf.h header to crash_dump.h to avoid this problem.

    The original discussion about this can be found at:
    http://www.spinics.net/lists/arm-kernel/msg154113.html

    Signed-off-by: Fabio Estevam
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Fabio Estevam
     
  • commit 9e7860cee18241633eddb36a4c34c7b61d8cecbc upstream.

    Haogang Chen found out that:

    There is a potential integer overflow in process_msg() that could result
    in cross-domain attack.

    body = kmalloc(msg->hdr.len + 1, GFP_NOIO | __GFP_HIGH);

    When a malicious guest passes 0xffffffff in msg->hdr.len, the subsequent
    call to xb_read() would write to a zero-length buffer.

    The other end of this connection is always the xenstore backend daemon
    so there is no guest (malicious or otherwise) which can do this. The
    xenstore daemon is a trusted component in the system.

    However this seem like a reasonable robustness improvement so we should
    have it.

    And Ian when read the API docs found that:
    The payload length (len field of the header) is limited to 4096
    (XENSTORE_PAYLOAD_MAX) in both directions. If a client exceeds the
    limit, its xenstored connection will be immediately killed by
    xenstored, which is usually catastrophic from the client's point of
    view. Clients (particularly domains, which cannot just reconnect)
    should avoid this.

    so this patch checks against that instead.

    This also avoids a potential integer overflow pointed out by Haogang Chen.

    Signed-off-by: Ian Campbell
    Cc: Haogang Chen
    Signed-off-by: Konrad Rzeszutek Wilk
    Signed-off-by: Greg Kroah-Hartman

    Ian Campbell