23 Aug, 2012

4 commits

  • Pull pwm fixes from Thierry Reding:
    "These patches fix the Samsung PWM driver and perform some minor
    cleanups like fixing checkpatch and sparse warnings.

    Two redundant error messages are removed and the Kconfig help text for
    the PWM subsystem is made more descriptive."

    * tag 'for-3.6-rc3' of git://gitorious.org/linux-pwm/linux-pwm:
    pwm: Improve Kconfig help text
    pwm: core: Fix coding style issues
    pwm: vt8500: Fix coding style issue
    pwm: Remove a redundant error message when devm_request_and_ioremap fails
    pwm: samsung: add missing device pointer to struct pwm_chip
    pwm: Add missing static storage class specifiers in core.c file

    Linus Torvalds
     
  • Pull ceph fixes from Sage Weil:
    "Jim's fix closes a narrow race introduced with the msgr changes. One
    fix resolves problems with debugfs initialization that Yan found when
    multiple client instances are created (e.g., two clusters mounted, or
    rbd + cephfs), another one fixes problems with mounting a nonexistent
    server subdirectory, and the last one fixes a divide by zero error
    from unsanitized ioctl input that Dan Carpenter found."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    ceph: avoid divide by zero in __validate_layout()
    libceph: avoid truncation due to racing banners
    ceph: tolerate (and warn on) extraneous dentry from mds
    libceph: delay debugfs initialization until we learn global_id

    Linus Torvalds
     
  • Pull NFS client bugfixes from Trond Myklebust:
    - NFSv3 mounts need to fail if the FSINFO rpc call fails
    - Ensure that the NFS commit cache gets torn down when we unload the
    NFS module.
    - Fix memory scribble issues when interrupting a LAYOUTGET rpc call
    - Fix NFSv4 legacy idmapper regressions
    - Fix issues with the NFSv4 getacl command
    - Fix a regression when using the legacy "mount -t nfs4"

    * tag 'nfs-for-3.6-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    NFSv3: Ensure that do_proc_get_root() reports errors correctly
    NFSv4: Ensure that nfs4_alloc_client cleans up on error.
    NFS: return -ENOKEY when the upcall fails to map the name
    NFS: Clear key construction data if the idmap upcall fails
    NFSv4: Don't use private xdr_stream fields in decode_getacl
    NFSv4: Fix the acl cache size calculation
    NFSv4: Fix pointer arithmetic in decode_getacl
    NFS: Alias the nfs module to nfs4
    NFS: Fix a regression when loading the NFS v4 module
    NFSv4.1: Remove a bogus BUG_ON() in nfs4_layoutreturn_done
    pnfs-obj: Better IO pattern in case of unaligned offset
    NFS41: add pg_layout_private to nfs_pageio_descriptor
    pnfs: nfs4_proc_layoutget returns void
    pnfs: defer release of pages in layoutget
    nfs: tear down caches in nfs_init_writepagecache when allocation fails

    Linus Torvalds
     
  • Pull assorted fixes - mostly vfs - from Al Viro:
    "Assorted fixes, with an unexpected detour into vfio refcounting logics
    (fell out when digging in an analog of eventpoll race in there)."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    task_work: add a scheduling point in task_work_run()
    fs: fix fs/namei.c kernel-doc warnings
    eventpoll: use-after-possible-free in epoll_create1()
    vfio: grab vfio_device reference *before* exposing the sucker via fd_install()
    vfio: get rid of vfio_device_put()/vfio_group_get_device* races
    vfio: get rid of open-coding kref_put_mutex
    introduce kref_put_mutex()
    vfio: don't dereference after kfree...
    mqueue: lift mnt_want_write() outside ->i_mutex, clean up a bit

    Linus Torvalds
     

22 Aug, 2012

36 commits

  • It seems commit 4a9d4b02 (switch fput to task_work_add) reintroduced
    the problem addressed in commit 944be0b2 (close_files(): add scheduling
    point)

    If a server process with a lot of files (say 2 million tcp sockets)
    is killed, we can spend a lot of time in task_work_run() and trigger
    a soft lockup.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Al Viro

    Eric Dumazet
     
  • Fix kernel-doc warnings in fs/namei.c:

    Warning(fs/namei.c:360): No description found for parameter 'inode'
    Warning(fs/namei.c:672): No description found for parameter 'nd'

    Signed-off-by: Randy Dunlap
    Cc: Alexander Viro
    Cc: linux-fsdevel@vger.kernel.org
    Signed-off-by: Al Viro

    Randy Dunlap
     
  • As soon as we'd installed the file into descriptor table, it can
    get closed by another thread. Freeing ep in process...

    Signed-off-by: Al Viro

    Al Viro
     
  • It's not critical (anymore) since another thread closing the file will block
    on ->device_lock before it gets to dropping the final reference, but it's
    definitely cleaner that way...

    Acked-by: Alex Williamson
    Signed-off-by: Al Viro

    Al Viro
     
  • we really need to make sure that dropping the last reference happens
    under the group->device_lock; otherwise a loop (under device_lock)
    might find vfio_device instance that is being freed right now, has
    already dropped the last reference and waits on device_lock to exclude
    the sucker from the list.

    Acked-by: Alex Williamson
    Signed-off-by: Al Viro

    Al Viro
     
  • Acked-by: Alex Williamson
    Signed-off-by: Al Viro

    Al Viro
     
  • equivalent of
    mutex_lock(mutex);
    if (!kref_put(kref, release))
    mutex_unlock(mutex);

    Signed-off-by: Al Viro

    Al Viro
     
  • Acked-by: Alex Williamson
    Signed-off-by: Al Viro

    Al Viro
     
  • Merge fixes from Andrew Morton.

    Random drivers and some VM fixes.

    * emailed patches from Andrew Morton : (17 commits)
    mm: compaction: Abort async compaction if locks are contended or taking too long
    mm: have order > 0 compaction start near a pageblock with free pages
    rapidio/tsi721: fix unused variable compiler warning
    rapidio/tsi721: fix inbound doorbell interrupt handling
    drivers/rtc/rtc-rs5c348.c: fix hour decoding in 12-hour mode
    mm: correct page->pfmemalloc to fix deactivate_slab regression
    drivers/rtc/rtc-pcf2123.c: initialize dynamic sysfs attributes
    mm/compaction.c: fix deferring compaction mistake
    drivers/misc/sgi-xp/xpc_uv.c: SGI XPC fails to load when cpu 0 is out of IRQ resources
    string: do not export memweight() to userspace
    hugetlb: update hugetlbpage.txt
    checkpatch: add control statement test to SINGLE_STATEMENT_DO_WHILE_MACRO
    mm: hugetlbfs: correctly populate shared pmd
    cciss: fix incorrect scsi status reporting
    Documentation: update mount option in filesystem/vfat.txt
    mm: change nr_ptes BUG_ON to WARN_ON
    cs5535-clockevt: typo, it's MFGPT, not MFPGT

    Linus Torvalds
     
  • Pull media fixes from Mauro Carvalho Chehab:
    "For bug fixes, at soc_camera, si470x, uvcvideo, iguanaworks IR driver,
    radio_shark Kbuild fixes, and at the V4L2 core (radio fixes)."

    * 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
    [media] media: soc_camera: don't clear pix->sizeimage in JPEG mode
    [media] media: mx2_camera: Fix clock handling for i.MX27
    [media] video: mx2_camera: Use clk_prepare_enable/clk_disable_unprepare
    [media] video: mx1_camera: Use clk_prepare_enable/clk_disable_unprepare
    [media] media: mx3_camera: buf_init() add buffer state check
    [media] radio-shark2: Only compile led support when CONFIG_LED_CLASS is set
    [media] radio-shark: Only compile led support when CONFIG_LED_CLASS is set
    [media] radio-shark*: Call cancel_work_sync from disconnect rather then release
    [media] radio-shark*: Remove work-around for dangling pointer in usb intfdata
    [media] Add USB dependency for IguanaWorks USB IR Transceiver
    [media] Add missing logging for rangelow/high of hwseek
    [media] VIDIOC_ENUM_FREQ_BANDS fix
    [media] mem2mem_testdev: fix querycap regression
    [media] si470x: v4l2-compliance fixes
    [media] DocBook: Remove a spurious character
    [media] uvcvideo: Reset the bytesused field when recycling an erroneous buffer

    Linus Torvalds
     
  • Pull networking update from David Miller:
    "A couple weeks of bug fixing in there. The largest chunk is all the
    broken crap Amerigo Wang found in the netpoll layer."

    1) netpoll and it's users has several serious bugs:
    a) uses GFP_KERNEL with locks held
    b) interfaces requiring interrupts disabled are called with them
    enabled
    c) and vice versa
    d) VLAN tag demuxing, as per all other RX packet input paths, is not
    applied

    All from Amerigo Wang.

    2) Hopefully cure the ipv4 mapped ipv6 address TCP early demux bugs for
    good, from Neal Cardwell.

    3) Unlike AF_UNIX, AF_PACKET sockets don't set a default credentials
    when the user doesn't specify one explicitly during sendmsg().
    Instead we attach an empty (zero) SCM credential block which is
    definitely not what we want. Fix from Eric Dumazet.

    4) IPv6 illegally invokes netdevice notifiers with RCU lock held, fix
    from Ben Hutchings.

    5) inet_csk_route_child_sock() checks wrong inet options pointer, fix
    from Christoph Paasch.

    6) When AF_PACKET is used for transmit, packet loopback doesn't behave
    properly when a socket fanout is enabled, from Eric Leblond.

    7) On bluetooth l2cap channel create failure, we leak the socket, from
    Jaganath Kanakkassery.

    8) Fix all the netprio file handling bugs found by Al Viro, from John
    Fastabend.

    9) Several error return and NULL deref bug fixes in networking drivers
    from Julia Lawall.

    10) A large smattering of struct padding et al. kernel memory leaks to
    userspace found of Mathias Krause.

    11) Conntrack expections in netfilter can access an uninitialized timer,
    fix from Pablo Neira Ayuso.

    12) Several netfilter SIP tracker bug fixes from Patrick McHardy.

    13) IPSEC ipv6 routes are not initialized correctly all the time,
    resulting in an OOPS in inet_putpeer(). Also from Patrick McHardy.

    14) Bridging does rcu_dereference() outside of RCU protected area, from
    Stephen Hemminger.

    15) Fix routing cache removal performance regression when looking up
    output routes that have a local destination. From Zheng Yan.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (87 commits)
    af_netlink: force credentials passing [CVE-2012-3520]
    ipv4: fix ip header ident selection in __ip_make_skb()
    ipv4: Use newinet->inet_opt in inet_csk_route_child_sock()
    tcp: fix possible socket refcount problem
    net: tcp: move sk_rx_dst_set call after tcp_create_openreq_child()
    net/core/dev.c: fix kernel-doc warning
    netconsole: remove a redundant netconsole_target_put()
    net: ipv6: fix oops in inet_putpeer()
    net/stmmac: fix issue of clk_get for Loongson1B.
    caif: Do not dereference NULL in chnl_recv_cb()
    af_packet: don't emit packet on orig fanout group
    drivers/net/irda: fix error return code
    drivers/net/wan/dscc4.c: fix error return code
    drivers/net/wimax/i2400m/fw.c: fix error return code
    smsc75xx: add missing entry to MAINTAINERS
    net: qmi_wwan: new devices: UML290 and K5006-Z
    net: sh_eth: Add eth support for R8A7779 device
    netdev/phy: skip disabled mdio-mux nodes
    dt: introduce for_each_available_child_of_node, of_get_next_available_child
    net: netprio: fix cgrp create and write priomap race
    ...

    Linus Torvalds
     
  • Jim Schutt reported a problem that pointed at compaction contending
    heavily on locks. The workload is straight-forward and in his own words;

    The systems in question have 24 SAS drives spread across 3 HBAs,
    running 24 Ceph OSD instances, one per drive. FWIW these servers
    are dual-socket Intel 5675 Xeons w/48 GB memory. I've got ~160
    Ceph Linux clients doing dd simultaneously to a Ceph file system
    backed by 12 of these servers.

    Early in the test everything looks fine

    procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    31 15 0 287216 576 38606628 0 0 2 1158 2 14 1 3 95 0 0
    27 15 0 225288 576 38583384 0 0 18 2222016 203357 134876 11 56 17 15 0
    28 17 0 219256 576 38544736 0 0 11 2305932 203141 146296 11 49 23 17 0
    6 18 0 215596 576 38552872 0 0 7 2363207 215264 166502 12 45 22 20 0
    22 18 0 226984 576 38596404 0 0 3 2445741 223114 179527 12 43 23 22 0

    and then it goes to pot

    procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    163 8 0 464308 576 36791368 0 0 11 22210 866 536 3 13 79 4 0
    207 14 0 917752 576 36181928 0 0 712 1345376 134598 47367 7 90 1 2 0
    123 12 0 685516 576 36296148 0 0 429 1386615 158494 60077 8 84 5 3 0
    123 12 0 598572 576 36333728 0 0 1107 1233281 147542 62351 7 84 5 4 0
    622 7 0 660768 576 36118264 0 0 557 1345548 151394 59353 7 85 4 3 0
    223 11 0 283960 576 36463868 0 0 46 1107160 121846 33006 6 93 1 1 0

    Note that system CPU usage is very high blocks being written out has
    dropped by 42%. He analysed this with perf and found

    perf record -g -a sleep 10
    perf report --sort symbol --call-graph fractal,5
    34.63% [k] _raw_spin_lock_irqsave
    |
    |--97.30%-- isolate_freepages
    | compaction_alloc
    | unmap_and_move
    | migrate_pages
    | compact_zone
    | compact_zone_order
    | try_to_compact_pages
    | __alloc_pages_direct_compact
    | __alloc_pages_slowpath
    | __alloc_pages_nodemask
    | alloc_pages_vma
    | do_huge_pmd_anonymous_page
    | handle_mm_fault
    | do_page_fault
    | page_fault
    | |
    | |--87.39%-- skb_copy_datagram_iovec
    | | tcp_recvmsg
    | | inet_recvmsg
    | | sock_recvmsg
    | | sys_recvfrom
    | | system_call
    | | __recv
    | | |
    | | --100.00%-- (nil)
    | |
    | --12.61%-- memcpy
    --2.70%-- [...]

    There was other data but primarily it is all showing that compaction is
    contended heavily on the zone->lock and zone->lru_lock.

    commit [b2eef8c0: mm: compaction: minimise the time IRQs are disabled
    while isolating pages for migration] noted that it was possible for
    migration to hold the lru_lock for an excessive amount of time. Very
    broadly speaking this patch expands the concept.

    This patch introduces compact_checklock_irqsave() to check if a lock
    is contended or the process needs to be scheduled. If either condition
    is true then async compaction is aborted and the caller is informed.
    The page allocator will fail a THP allocation if compaction failed due
    to contention. This patch also introduces compact_trylock_irqsave()
    which will acquire the lock only if it is not contended and the process
    does not need to schedule.

    Reported-by: Jim Schutt
    Tested-by: Jim Schutt
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit 7db8889ab05b ("mm: have order > 0 compaction start off where it
    left") introduced a caching mechanism to reduce the amount work the free
    page scanner does in compaction. However, it has a problem. Consider
    two process simultaneously scanning free pages

    C
    Process A M S F
    |---------------------------------------|
    Process B M FS

    C is zone->compact_cached_free_pfn
    S is cc->start_pfree_pfn
    M is cc->migrate_pfn
    F is cc->free_pfn

    In this diagram, Process A has just reached its migrate scanner, wrapped
    around and updated compact_cached_free_pfn accordingly.

    Simultaneously, Process B finishes isolating in a block and updates
    compact_cached_free_pfn again to the location of its free scanner.

    Process A moves to "end_of_zone - one_pageblock" and runs this check

    if (cc->order > 0 && (!cc->wrapped ||
    zone->compact_cached_free_pfn >
    cc->start_free_pfn))
    pfn = min(pfn, zone->compact_cached_free_pfn);

    compact_cached_free_pfn is above where it started so the free scanner
    skips almost the entire space it should have scanned. When there are
    multiple processes compacting it can end in a situation where the entire
    zone is not being scanned at all. Further, it is possible for two
    processes to ping-pong update to compact_cached_free_pfn which is just
    random.

    Overall, the end result wrecks allocation success rates.

    There is not an obvious way around this problem without introducing new
    locking and state so this patch takes a different approach.

    First, it gets rid of the skip logic because it's not clear that it
    matters if two free scanners happen to be in the same block but with
    racing updates it's too easy for it to skip over blocks it should not.

    Second, it updates compact_cached_free_pfn in a more limited set of
    circumstances.

    If a scanner has wrapped, it updates compact_cached_free_pfn to the end
    of the zone. When a wrapped scanner isolates a page, it updates
    compact_cached_free_pfn to point to the highest pageblock it
    can isolate pages from.

    If a scanner has not wrapped when it has finished isolated pages it
    checks if compact_cached_free_pfn is pointing to the end of the
    zone. If so, the value is updated to point to the highest
    pageblock that pages were isolated from. This value will not
    be updated again until a free page scanner wraps and resets
    compact_cached_free_pfn.

    This is not optimal and it can still race but the compact_cached_free_pfn
    will be pointing to or very near a pageblock with free pages.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Fix unused variable compiler warning when built with CONFIG_RAPIDIO_DEBUG
    option off.

    This patch is applicable to kernel versions starting from v3.2

    Signed-off-by: Alexandre Bounine
    Cc: Matt Porter
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Bounine
     
  • Make sure that there is no doorbell messages left behind due to disabled
    interrupts during inbound doorbell processing.

    The most common case for this bug is loss of rionet JOIN messages in
    systems with three or more rionet participants and MSI or MSI-X enabled.
    As result, requests for packet transfers may finish with "destination
    unreachable" error message.

    This patch is applicable to kernel versions starting from v3.2.

    Signed-off-by: Alexandre Bounine
    Cc: Matt Porter
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Bounine
     
  • Correct the offset by subtracting 20 from tm_hour before taking the
    modulo 12.

    [ "Why 20?" I hear you ask. Or at least I did.

    Here's the reason why: RS5C348_BIT_PM is 32, and is - stupidly -
    included in the RS5C348_HOURS_MASK define. So it's really subtracting
    out that bit to get "hour+12". But then because it does things modulo
    12, it needs to add the 12 in again afterwards anyway.

    This code is confused. It would be much clearer if RS5C348_HOURS_MASK
    just didn't include the RS5C348_BIT_PM bit at all, then it wouldn't
    need to do the silly subtract either.

    Whatever. It's all just math, the end result is the same. - Linus ]

    Reported-by: James Nute
    Tested-by: James Nute
    Signed-off-by: Atsushi Nemoto
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Atsushi Nemoto
     
  • Commit cfd19c5a9ecf ("mm: only set page->pfmemalloc when
    ALLOC_NO_WATERMARKS was used") tried to narrow down page->pfmemalloc
    setting, but it missed some places the pfmemalloc should be set.

    So, in __slab_alloc, the unalignment pfmemalloc and ALLOC_NO_WATERMARKS
    cause incorrect deactivate_slab() on our core2 server:

    64.73% fio [kernel.kallsyms] [k] _raw_spin_lock
    |
    --- _raw_spin_lock
    |
    |---0.34%-- deactivate_slab
    | __slab_alloc
    | kmem_cache_alloc
    | |

    That causes our fio sync write performance to have a 40% regression.

    Move the checking in get_page_from_freelist() which resolves this issue.

    Signed-off-by: Alex Shi
    Acked-by: Mel Gorman
    Cc: David Miller
    Tested-by: Eric Dumazet
    Tested-by: Sage Weil
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Shi
     
  • Dynamically allocated sysfs attributes must be initialized using
    sysfs_attr_init(), otherwise lockdep complains: BUG: key

    not in
    .data!

    Found by Linux Driver Verification project (linuxtesting.org).

    Signed-off-by: Ilya Shchepetkov
    Cc: Chris Verges
    Cc: Christian Pellegrin
    Cc: Alessandro Zummo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ilya Shchepetkov
     
  • Commit aff622495c9a ("vmscan: only defer compaction for failed order and
    higher") fixed bad deferring policy but made mistake about checking
    compact_order_failed in __compact_pgdat(). So it can't update
    compact_order_failed with the new order. This ends up preventing
    correct operation of policy deferral. This patch fixes it.

    Signed-off-by: Minchan Kim
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • On many of our larger systems, CPU 0 has had all of its IRQ resources
    consumed before XPC loads. Worst cases on machines with multiple 10
    GigE cards and multiple IB cards have depleted the entire first socket
    of IRQs.

    This patch makes selecting the node upon which IRQs are allocated (as
    well as all the other GRU Message Queue structures) specifiable as a
    module load param and has a default behavior of searching all nodes/cpus
    for an available resources.

    [akpm@linux-foundation.org: fix build: include cpu.h and module.h]
    Signed-off-by: Robin Holt
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     
  • Fix the following warning:

    usr/include/linux/string.h:8: userspace cannot reference function or variable defined in the kernel

    Signed-off-by: WANG Cong
    Acked-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    WANG Cong
     
  • Commit f0f57b2b1488 ("mm: move hugepage test examples to
    tools/testing/selftests/vm") moved map_hugetlb.c, hugepage-shm.c and
    hugepage-mmap.c tests into tools/testing/selftests/vm/ directory, but it
    didn't update hugetlbpage.txt

    Signed-off-by: Zhouping Liu
    Acked-by: Dave Young
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhouping Liu
     
  • Commit b13edf7ff2dd ("checkpatch: add checks for do {} while (0) macro
    misuses") added a test that is overly simplistic for single statement
    macros.

    Macros that start with control tests should be enclosed in a do {} while
    (0) loop.

    Add the necessary control tests to the check.

    Signed-off-by: Joe Perches
    Acked-by: Andy Whitcroft
    Tested-by: Franz Schrober
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Each page mapped in a process's address space must be correctly
    accounted for in _mapcount. Normally the rules for this are
    straightforward but hugetlbfs page table sharing is different. The page
    table pages at the PMD level are reference counted while the mapcount
    remains the same.

    If this accounting is wrong, it causes bugs like this one reported by
    Larry Woodman:

    kernel BUG at mm/filemap.c:135!
    invalid opcode: 0000 [#1] SMP
    CPU 22
    Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi]
    Pid: 18001, comm: mpitest Tainted: G W 3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2
    RIP: 0010:[] [] __delete_from_page_cache+0x15d/0x170
    Process mpitest (pid: 18001, threadinfo ffff880428972000, task ffff880428b5cc20)
    Call Trace:
    delete_from_page_cache+0x40/0x80
    truncate_hugepages+0x115/0x1f0
    hugetlbfs_evict_inode+0x18/0x30
    evict+0x9f/0x1b0
    iput_final+0xe3/0x1e0
    iput+0x3e/0x50
    d_kill+0xf8/0x110
    dput+0xe2/0x1b0
    __fput+0x162/0x240

    During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
    shared page tables with the check dst_pte == src_pte. The logic is if
    the PMD page is the same, they must be shared. This assumes that the
    sharing is between the parent and child. However, if the sharing is
    with a different process entirely then this check fails as in this
    diagram:

    parent
    |
    ------------>pmd
    src_pte----------> data page
    ^
    other--------->pmd--------------------|
    ^
    child-----------|
    dst_pte

    For this situation to occur, it must be possible for Parent and Other to
    have faulted and failed to share page tables with each other. This is
    possible due to the following style of race.

    PROC A PROC B
    copy_hugetlb_page_range copy_hugetlb_page_range
    src_pte == huge_pte_offset src_pte == huge_pte_offset
    !src_pte so no sharing !src_pte so no sharing

    (time passes)

    hugetlb_fault hugetlb_fault
    huge_pte_alloc huge_pte_alloc
    huge_pmd_share huge_pmd_share
    LOCK(i_mmap_mutex)
    find nothing, no sharing
    UNLOCK(i_mmap_mutex)
    LOCK(i_mmap_mutex)
    find nothing, no sharing
    UNLOCK(i_mmap_mutex)
    pmd_alloc pmd_alloc
    LOCK(instantiation_mutex)
    fault
    UNLOCK(instantiation_mutex)
    LOCK(instantiation_mutex)
    fault
    UNLOCK(instantiation_mutex)

    These two processes are not poing to the same data page but are not
    sharing page tables because the opportunity was missed. When either
    process later forks, the src_pte == dst pte is potentially insufficient.
    As the check falls through, the wrong PTE information is copied in
    (harmless but wrong) and the mapcount is bumped for a page mapped by a
    shared page table leading to the BUG_ON.

    This patch addresses the issue by moving pmd_alloc into huge_pmd_share
    which guarantees that the shared pud is populated in the same critical
    section as pmd. This also means that huge_pte_offset test in
    huge_pmd_share is serialized correctly now which in turn means that the
    success of the sharing will be higher as the racing tasks see the pud
    and pmd populated together.

    Race identified and changelog written mostly by Mel Gorman.

    {akpm@linux-foundation.org: attempt to make the huge_pmd_share() comment comprehensible, clean up coding style]
    Reported-by: Larry Woodman
    Tested-by: Larry Woodman
    Reviewed-by: Mel Gorman
    Signed-off-by: Michal Hocko
    Reviewed-by: Rik van Riel
    Cc: David Gibson
    Cc: Ken Chen
    Cc: Cong Wang
    Cc: Hillf Danton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Delete code which sets SCSI status incorrectly as it's already been set
    correctly above this incorrect code. The bug was introduced in 2009 by
    commit b0e15f6db111 ("cciss: fix typo that causes scsi status to be
    lost.")

    Signed-off-by: Stephen M. Cameron
    Reported-by: Roel van Meer
    Tested-by: Roel van Meer
    Cc: Jens Axboe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen M. Cameron
     
  • Update two mount options(discard, nfs) in vfat.txt.

    Signed-off-by: Namjae Jeon
    Acked-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namjae Jeon
     
  • Occasionally an isolated BUG_ON(mm->nr_ptes) gets reported, indicating
    that not all the page tables allocated could be found and freed when
    exit_mmap() tore down the user address space.

    There's usually nothing we can say about it, beyond that it's probably a
    sign of some bad memory or memory corruption; though it might still
    indicate a bug in vma or page table management (and did recently reveal a
    race in THP, fixed a few months ago).

    But one overdue change we can make is from BUG_ON to WARN_ON.

    It's fairly likely that the system will crash shortly afterwards in some
    other way (for example, the BUG_ON(page_mapped(page)) in
    __delete_from_page_cache(), once an inode mapped into the lost page tables
    gets evicted); but might tell us more before that.

    Change the BUG_ON(page_mapped) to WARN_ON too? Later perhaps: I'm less
    eager, since that one has several times led to fixes.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Signed-off-by: Jens Rottmann
    Cc: Thomas Gleixner
    Cc: John Stultz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jens Rottmann
     
  • If "l->stripe_unit" is zero the the mod on the next line will cause a
    divide by zero bug. This comes from the copy_from_user() in
    ceph_ioctl_set_layout_policy(). Passing 0 is valid, though (it means
    "do not change") so avoid the % check in that case.

    Reported-by: Dan Carpenter
    Signed-off-by: Sage Weil
    Reviewed-by: Alex Elder

    Sage Weil
     
  • Because the Ceph client messenger uses a non-blocking connect, it is
    possible for the sending of the client banner to race with the
    arrival of the banner sent by the peer.

    When ceph_sock_state_change() notices the connect has completed, it
    schedules work to process the socket via con_work(). During this
    time the peer is writing its banner, and arrival of the peer banner
    races with con_work().

    If con_work() calls try_read() before the peer banner arrives, there
    is nothing for it to do, after which con_work() calls try_write() to
    send the client's banner. In this case Ceph's protocol negotiation
    can complete succesfully.

    The server-side messenger immediately sends its banner and addresses
    after accepting a connect request, *before* actually attempting to
    read or verify the banner from the client. As a result, it is
    possible for the banner from the server to arrive before con_work()
    calls try_read(). If that happens, try_read() will read the banner
    and prepare protocol negotiation info via prepare_write_connect().
    prepare_write_connect() calls con_out_kvec_reset(), which discards
    the as-yet-unsent client banner. Next, con_work() calls
    try_write(), which sends the protocol negotiation info rather than
    the banner that the peer is expecting.

    The result is that the peer sees an invalid banner, and the client
    reports "negotiation failed".

    Fix this by moving con_out_kvec_reset() out of
    prepare_write_connect() to its callers at all locations except the
    one where the banner might still need to be sent.

    [elder@inktak.com: added note about server-side behavior]

    Signed-off-by: Jim Schutt
    Reviewed-by: Alex Elder

    Jim Schutt
     
  • If the MDS gives us a dentry and we weren't prepared to handle it,
    WARN_ON_ONCE instead of crashing.

    Reported-by: Yan, Zheng
    Signed-off-by: Sage Weil
    Reviewed-by: Alex Elder

    Sage Weil
     
  • Pablo Neira Ayuso discovered that avahi and
    potentially NetworkManager accept spoofed Netlink messages because of a
    kernel bug. The kernel passes all-zero SCM_CREDENTIALS ancillary data
    to the receiver if the sender did not provide such data, instead of not
    including any such data at all or including the correct data from the
    peer (as it is the case with AF_UNIX).

    This bug was introduced in commit 16e572626961
    (af_unix: dont send SCM_CREDENTIALS by default)

    This patch forces passing credentials for netlink, as
    before the regression.

    Another fix would be to not add SCM_CREDENTIALS in
    netlink messages if not provided by the sender, but it
    might break some programs.

    With help from Florian Weimer & Petr Matousek

    This issue is designated as CVE-2012-3520

    Signed-off-by: Eric Dumazet
    Cc: Petr Matousek
    Cc: Florian Weimer
    Cc: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Christian Casteyde reported a kmemcheck 32-bit read from uninitialized
    memory in __ip_select_ident().

    It turns out that __ip_make_skb() called ip_select_ident() before
    properly initializing iph->daddr.

    This is a bug uncovered by commit 1d861aa4b3fb (inet: Minimize use of
    cached route inetpeer.)

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=46131

    Reported-by: Christian Casteyde
    Signed-off-by: Eric Dumazet
    Cc: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Since 0e734419923bd ("ipv4: Use inet_csk_route_child_sock() in DCCP and
    TCP."), inet_csk_route_child_sock() is called instead of
    inet_csk_route_req().

    However, after creating the child-sock in tcp/dccp_v4_syn_recv_sock(),
    ireq->opt is set to NULL, before calling inet_csk_route_child_sock().
    Thus, inside inet_csk_route_child_sock() opt is always NULL and the
    SRR-options are not respected anymore.
    Packets sent by the server won't have the correct destination-IP.

    This patch fixes it by accessing newinet->inet_opt instead of ireq->opt
    inside inet_csk_route_child_sock().

    Reported-by: Luca Boccassi
    Signed-off-by: Christoph Paasch
    Signed-off-by: David S. Miller

    Christoph Paasch
     
  • Commit 6f458dfb40 (tcp: improve latencies of timer triggered events)
    added bug leading to following trace :

    [ 2866.131281] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
    [ 2866.131726]
    [ 2866.132188] =========================
    [ 2866.132281] [ BUG: held lock freed! ]
    [ 2866.132281] 3.6.0-rc1+ #622 Not tainted
    [ 2866.132281] -------------------------
    [ 2866.132281] kworker/0:1/652 is freeing memory ffff880019ec0000-ffff880019ec0a1f, with a lock still held there!
    [ 2866.132281] (sk_lock-AF_INET-RPC){+.+...}, at: [] tcp_sendmsg+0x29/0xcc6
    [ 2866.132281] 4 locks held by kworker/0:1/652:
    [ 2866.132281] #0: (rpciod){.+.+.+}, at: [] process_one_work+0x1de/0x47f
    [ 2866.132281] #1: ((&task->u.tk_work)){+.+.+.}, at: [] process_one_work+0x1de/0x47f
    [ 2866.132281] #2: (sk_lock-AF_INET-RPC){+.+...}, at: [] tcp_sendmsg+0x29/0xcc6
    [ 2866.132281] #3: (&icsk->icsk_retransmit_timer){+.-...}, at: [] run_timer_softirq+0x1ad/0x35f
    [ 2866.132281]
    [ 2866.132281] stack backtrace:
    [ 2866.132281] Pid: 652, comm: kworker/0:1 Not tainted 3.6.0-rc1+ #622
    [ 2866.132281] Call Trace:
    [ 2866.132281] [] debug_check_no_locks_freed+0x112/0x159
    [ 2866.132281] [] ? __sk_free+0xfd/0x114
    [ 2866.132281] [] kmem_cache_free+0x6b/0x13a
    [ 2866.132281] [] __sk_free+0xfd/0x114
    [ 2866.132281] [] sk_free+0x1c/0x1e
    [ 2866.132281] [] tcp_write_timer+0x51/0x56
    [ 2866.132281] [] run_timer_softirq+0x218/0x35f
    [ 2866.132281] [] ? run_timer_softirq+0x1ad/0x35f
    [ 2866.132281] [] ? rb_commit+0x58/0x85
    [ 2866.132281] [] ? tcp_write_timer_handler+0x148/0x148
    [ 2866.132281] [] __do_softirq+0xcb/0x1f9
    [ 2866.132281] [] ? _raw_spin_unlock+0x29/0x2e
    [ 2866.132281] [] call_softirq+0x1c/0x30
    [ 2866.132281] [] do_softirq+0x4a/0xa6
    [ 2866.132281] [] irq_exit+0x51/0xad
    [ 2866.132281] [] do_IRQ+0x9d/0xb4
    [ 2866.132281] [] common_interrupt+0x6f/0x6f
    [ 2866.132281] [] ? sched_clock_cpu+0x58/0xd1
    [ 2866.132281] [] ? _raw_spin_unlock_irqrestore+0x4c/0x56
    [ 2866.132281] [] mod_timer+0x178/0x1a9
    [ 2866.132281] [] sk_reset_timer+0x19/0x26
    [ 2866.132281] [] tcp_rearm_rto+0x99/0xa4
    [ 2866.132281] [] tcp_event_new_data_sent+0x6e/0x70
    [ 2866.132281] [] tcp_write_xmit+0x7de/0x8e4
    [ 2866.132281] [] ? __alloc_skb+0xa0/0x1a1
    [ 2866.132281] [] __tcp_push_pending_frames+0x2e/0x8a
    [ 2866.132281] [] tcp_sendmsg+0xb32/0xcc6
    [ 2866.132281] [] inet_sendmsg+0xaa/0xd5
    [ 2866.132281] [] ? inet_autobind+0x5f/0x5f
    [ 2866.132281] [] ? trace_clock_local+0x9/0xb
    [ 2866.132281] [] sock_sendmsg+0xa3/0xc4
    [ 2866.132281] [] ? rb_reserve_next_event+0x26f/0x2d5
    [ 2866.132281] [] ? native_sched_clock+0x29/0x6f
    [ 2866.132281] [] ? sched_clock+0x9/0xd
    [ 2866.132281] [] ? trace_clock_local+0x9/0xb
    [ 2866.132281] [] kernel_sendmsg+0x37/0x43
    [ 2866.132281] [] xs_send_kvec+0x77/0x80
    [ 2866.132281] [] xs_sendpages+0x6f/0x1a0
    [ 2866.132281] [] ? try_to_del_timer_sync+0x55/0x61
    [ 2866.132281] [] xs_tcp_send_request+0x55/0xf1
    [ 2866.132281] [] xprt_transmit+0x89/0x1db
    [ 2866.132281] [] ? call_connect+0x3c/0x3c
    [ 2866.132281] [] call_transmit+0x1c5/0x20e
    [ 2866.132281] [] __rpc_execute+0x6f/0x225
    [ 2866.132281] [] ? call_connect+0x3c/0x3c
    [ 2866.132281] [] rpc_async_schedule+0x28/0x34
    [ 2866.132281] [] process_one_work+0x24d/0x47f
    [ 2866.132281] [] ? process_one_work+0x1de/0x47f
    [ 2866.132281] [] ? __rpc_execute+0x225/0x225
    [ 2866.132281] [] worker_thread+0x236/0x317
    [ 2866.132281] [] ? process_scheduled_works+0x2f/0x2f
    [ 2866.132281] [] kthread+0x9a/0xa2
    [ 2866.132281] [] kernel_thread_helper+0x4/0x10
    [ 2866.132281] [] ? retint_restore_args+0x13/0x13
    [ 2866.132281] [] ? __init_kthread_worker+0x5a/0x5a
    [ 2866.132281] [] ? gs_change+0x13/0x13
    [ 2866.308506] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
    [ 2866.309689] =============================================================================
    [ 2866.310254] BUG TCP (Not tainted): Object already free
    [ 2866.310254] -----------------------------------------------------------------------------
    [ 2866.310254]

    The bug comes from the fact that timer set in sk_reset_timer() can run
    before we actually do the sock_hold(). socket refcount reaches zero and
    we free the socket too soon.

    timer handler is not allowed to reduce socket refcnt if socket is owned
    by the user, or we need to change sk_reset_timer() implementation.

    We should take a reference on the socket in case TCP_DELACK_TIMER_DEFERRED
    or TCP_DELACK_TIMER_DEFERRED bit are set in tsq_flags

    Also fix a typo in tcp_delack_timer(), where TCP_WRITE_TIMER_DEFERRED
    was used instead of TCP_DELACK_TIMER_DEFERRED.

    For consistency, use same socket refcount change for TCP_MTU_REDUCED_DEFERRED,
    even if not fired from a timer.

    Reported-by: Fengguang Wu
    Tested-by: Fengguang Wu
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Pull audit-tree fixes from Miklos Szeredi:
    "The audit subsystem maintainers (Al and Eric) are not responding to
    repeated resends. Eric did ack them a while ago, but no response
    since then. So I'm sending these directly to you."

    * 'audit-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
    audit: clean up refcounting in audit-tree
    audit: fix refcounting in audit-tree
    audit: don't free_chunk() after fsnotify_add_mark()

    Linus Torvalds