15 Nov, 2014

3 commits

  • Pull ACPI and power management fixes from Rafael Wysocki:
    "These are three regression fixes, two recent (generic power domains,
    suspend-to-idle) and one older (cpufreq), an ACPI blacklist entry for
    one more machine having problems with Windows 8 compatibility, a minor
    cpufreq driver fix (cpufreq-dt) and a fixup for new callback
    definitions (generic power domains).

    Specifics:

    - Fix a crash in the suspend-to-idle code path introduced by a recent
    commit that forgot to check a pointer against NULL before
    dereferencing it (Dmitry Eremin-Solenikov).

    - Fix a boot crash on Exynos5 introduced by a recent commit making
    that platform use generic Device Tree bindings for power domains
    which exposed a weakness in the generic power domains framework
    leading to that crash (Ulf Hansson).

    - Fix a crash during system resume on systems where cpufreq depends
    on Operation Performance Points (OPP) for functionality, but
    CONFIG_OPP is not set. This leads the cpufreq driver registration
    to fail, but the resume code attempts to restore the pre-suspend
    cpufreq configuration (which does not exist) nevertheless and
    crashes. From Geert Uytterhoeven.

    - Add a new ACPI blacklist entry for Dell Vostro 3546 that has
    problems if it is reported as Windows 8 compatible to the BIOS
    (Adam Lee).

    - Fix swapped arguments in an error message in the cpufreq-dt driver
    (Abhilash Kesavan).

    - Fix up the prototypes of new callbacks in struct generic_pm_domain
    to make them more useful. Users of those callbacks will be added
    in 3.19 and it's better for them to be based on the correct struct
    definition in mainline from the start. From Ulf Hansson and Kevin
    Hilman"

    * tag 'pm+acpi-3.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    PM / Domains: Fix initial default state of the need_restore flag
    PM / sleep: Fix entering suspend-to-IDLE if no freeze_oops is set
    PM / Domains: Change prototype for the attach and detach callbacks
    cpufreq: Avoid crash in resume on SMP without OPP
    cpufreq: cpufreq-dt: Fix arguments in clock failure error message
    ACPI / blacklist: blacklist Win8 OSI for Dell Vostro 3546

    Linus Torvalds
     
  • Pull firewire fix from Stefan Richter:
    "IEEE 1394 (FireWire) subsystem fix: The character device file
    interface for raw 1394 I/O took uninitialized kernel stack as
    substitute for missing ioctl() argument data. This could partially
    show up in subsequent read() output"

    * tag 'firewire-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394:
    firewire: cdev: prevent kernel stack leaking into ioctl arguments

    Linus Torvalds
     
  • Pull vfs fix from Al Viro:
    "Fix for a really embarrassing braino in iov_iter. Kudos to paulus..."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    Fix thinko in iov_iter_single_seg_count

    Linus Torvalds
     

14 Nov, 2014

37 commits

  • * pm-domains:
    PM / Domains: Fix initial default state of the need_restore flag
    PM / Domains: Change prototype for the attach and detach callbacks

    * pm-sleep:
    PM / sleep: Fix entering suspend-to-IDLE if no freeze_oops is set

    * pm-cpufreq:
    cpufreq: Avoid crash in resume on SMP without OPP
    cpufreq: cpufreq-dt: Fix arguments in clock failure error message

    Rafael J. Wysocki
     
  • * acpi-blacklist:
    ACPI / blacklist: blacklist Win8 OSI for Dell Vostro 3546

    Rafael J. Wysocki
     
  • Found by the UC-KLEE tool: A user could supply less input to
    firewire-cdev ioctls than write- or write/read-type ioctl handlers
    expect. The handlers used data from uninitialized kernel stack then.

    This could partially leak back to the user if the kernel subsequently
    generated fw_cdev_event_'s (to be read from the firewire-cdev fd)
    which notably would contain the _u64 closure field which many of the
    ioctl argument structures contain.

    The fact that the handlers would act on random garbage input is a
    lesser issue since all handlers must check their input anyway.

    The fix simply always null-initializes the entire ioctl argument buffer
    regardless of the actual length of expected user input. That is, a
    runtime overhead of memset(..., 40) is added to each firewirew-cdev
    ioctl() call. [Comment from Clemens Ladisch: This part of the stack is
    most likely to be already in the cache.]

    Remarks:
    - There was never any leak from kernel stack to the ioctl output
    buffer itself. IOW, it was not possible to read kernel stack by a
    read-type or write/read-type ioctl alone; the leak could at most
    happen in combination with read()ing subsequent event data.
    - The actual expected minimum user input of each ioctl from
    include/uapi/linux/firewire-cdev.h is, in bytes:
    [0x00] = 32, [0x05] = 4, [0x0a] = 16, [0x0f] = 20, [0x14] = 16,
    [0x01] = 36, [0x06] = 20, [0x0b] = 4, [0x10] = 20, [0x15] = 20,
    [0x02] = 20, [0x07] = 4, [0x0c] = 0, [0x11] = 0, [0x16] = 8,
    [0x03] = 4, [0x08] = 24, [0x0d] = 20, [0x12] = 36, [0x17] = 12,
    [0x04] = 20, [0x09] = 24, [0x0e] = 4, [0x13] = 40, [0x18] = 4.

    Reported-by: David Ramos
    Cc:
    Signed-off-by: Stefan Richter

    Stefan Richter
     
  • Pull virtio bugfix from Michael S Tsirkin:
    "This fixes a crash in virtio console multi-channel mode that got
    introduced in -rc1"

    * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
    virtio_console: move early VQ enablement

    Linus Torvalds
     
  • Pull networking fixes from David Miller:

    1) sunhme driver lacks DMA mapping error checks, based upon a report by
    Meelis Roos.

    2) Fix memory leak in mvpp2 driver, from Sudip Mukherjee.

    3) DMA memory allocation sizes are wrong in systemport ethernet driver,
    fix from Florian Fainelli.

    4) Fix use after free in mac80211 defragmentation code, from Johannes
    Berg.

    5) Some networking uapi headers missing from Kbuild file, from Stephen
    Hemminger.

    6) TUN driver gets csum_start offset wrong when VLAN accel is enabled,
    and macvtap has a similar bug, from Herbert Xu.

    7) Adjust several tunneling drivers to set dev->iflink after registry,
    because registry sets that to -1 overwriting whatever we did. From
    Steffen Klassert.

    8) Geneve forgets to set inner tunneling type, causing GSO segmentation
    to fail on some NICs. From Jesse Gross.

    9) Fix several locking bugs in stmmac driver, from Fabrice Gasnier and
    Giuseppe CAVALLARO.

    10) Fix spurious timeouts with NewReno on low traffic connections, from
    Marcelo Leitner.

    11) Fix descriptor updates in enic driver, from Govindarajulu
    Varadarajan.

    12) PPP calls bpf_prog_create() with locks held, which isn't kosher.
    Fix from Takashi Iwai.

    13) Fix NULL deref in SCTP with malformed INIT packets, from Daniel
    Borkmann.

    14) psock_fanout selftest accesses past the end of the mmap ring, fix
    from Shuah Khan.

    15) Fix PTP timestamping for VLAN packets, from Richard Cochran.

    16) netlink_unbind() calls in netlink pass wrong initial argument, from
    Hiroaki SHIMODA.

    17) vxlan socket reuse accidently reuses a socket when the address
    family is different, so we have to explicitly check this, from
    Marcelo Lietner.

    18) Fix missing include in nft_reject_bridge.c breaking the build on ppc
    and other architectures, from Guenter Roeck.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (75 commits)
    vxlan: Do not reuse sockets for a different address family
    smsc911x: power-up phydev before doing a software reset.
    lib: rhashtable - Remove weird non-ASCII characters from comments
    net/smsc911x: Fix delays in the PHY enable/disable routines
    net/smsc911x: Fix rare soft reset timeout issue due to PHY power-down mode
    netlink: Properly unbind in error conditions.
    net: ptp: fix time stamp matching logic for VLAN packets.
    cxgb4 : dcb open-lldp interop fixes
    selftests/net: psock_fanout seg faults in sock_fanout_read_ring()
    net: bcmgenet: apply MII configuration in bcmgenet_open()
    net: bcmgenet: connect and disconnect from the PHY state machine
    net: qualcomm: Fix dependency
    ixgbe: phy: fix uninitialized status in ixgbe_setup_phy_link_tnx
    net: phy: Correctly handle MII ioctl which changes autonegotiation.
    ipv6: fix IPV6_PKTINFO with v4 mapped
    net: sctp: fix memory leak in auth key management
    net: sctp: fix NULL pointer dereference in af->from_addr_param on malformed packet
    net: ppp: Don't call bpf_prog_create() in ppp_lock
    net/mlx4_en: Advertize encapsulation offloads features only when VXLAN tunnel is set
    cxgb4 : Fix bug in DCB app deletion
    ...

    Linus Torvalds
     
  • Merge misc fixes from Andrew Morton:
    "15 fixes"

    * emailed patches from Andrew Morton :
    MAINTAINERS: add IIO include files
    kernel/panic.c: update comments for print_tainted
    mem-hotplug: reset node present pages when hot-adding a new pgdat
    mem-hotplug: reset node managed pages when hot-adding a new pgdat
    mm/debug-pagealloc: correct freepage accounting and order resetting
    fanotify: fix notification of groups with inode & mount marks
    mm, compaction: prevent infinite loop in compact_zone
    mm: alloc_contig_range: demote pages busy message from warn to info
    mm/slab: fix unalignment problem on Malta with EVA due to slab merge
    mm/page_alloc: restrict max order of merging on isolated pageblock
    mm/page_alloc: move freepage counting logic to __free_one_page()
    mm/page_alloc: add freepage on isolate pageblock to correct buddy list
    mm/page_alloc: fix incorrect isolation behavior by rechecking migratetype
    mm/compaction: skip the range until proper target pageblock is met
    zram: avoid kunmap_atomic() of a NULL pointer

    Linus Torvalds
     
  • Pull Ceph fixes from Sage Weil:
    "There is an overflow bug fix for cephfs from Zheng, a fix for handling
    large authentication ticket buffers in libceph from Ilya, and a few
    fixes for the request handling code from Ilya that affect RBD volumes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    libceph: change from BUG to WARN for __remove_osd() asserts
    libceph: clear r_req_lru_item in __unregister_linger_request()
    libceph: unlink from o_linger_requests when clearing r_osd
    libceph: do not crash on large auth tickets
    ceph: fix flush tid comparision

    Linus Torvalds
     
  • Pull HID fixes from Jiri Kosina:

    - fix for an oops in HID core upon repeated subdriver insertion/removal
    under certain circumstances, by Benjamin Tissoires

    - quirk for another Elan Touchscreen device, by Adel Gadllah

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid:
    HID: core: cleanup .claimed field on disconnect
    HID: usbhid: enable always-poll quirk for Elan Touchscreen 0103

    Linus Torvalds
     
  • Files under include/linux/iio were not reported as part of the IIO
    subsystem.

    Signed-off-by: Daniel Baluta
    Reported-by: Cristina Ciocan
    Reviewed-by: Jingoo Han
    Cc: Hartmut Knaack
    Cc: Lars-Peter Clausen
    Cc: Peter Meerwald
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Baluta
     
  • Commit 69361eef9056 ("panic: add TAINT_SOFTLOCKUP") added the 'L' flag,
    but failed to update the comments for print_tainted(). So, update the
    comments.

    Signed-off-by: Xie XiuQi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xie XiuQi
     
  • When memory is hot-added, all the memory is in offline state. So clear
    all zones' present_pages because they will be updated in online_pages()
    and offline_pages(). Otherwise, /proc/zoneinfo will corrupt:

    When the memory of node2 is offline:

    # cat /proc/zoneinfo
    ......
    Node 2, zone Movable
    ......
    spanned 8388608
    present 8388608
    managed 0

    When we online memory on node2:

    # cat /proc/zoneinfo
    ......
    Node 2, zone Movable
    ......
    spanned 8388608
    present 16777216
    managed 8388608

    Signed-off-by: Tang Chen
    Reviewed-by: Yasuaki Ishimatsu
    Cc: [3.16+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • In free_area_init_core(), zone->managed_pages is set to an approximate
    value for lowmem, and will be adjusted when the bootmem allocator frees
    pages into the buddy system.

    But free_area_init_core() is also called by hotadd_new_pgdat() when
    hot-adding memory. As a result, zone->managed_pages of the newly added
    node's pgdat is set to an approximate value in the very beginning.

    Even if the memory on that node has node been onlined,
    /sys/device/system/node/nodeXXX/meminfo has wrong value:

    hot-add node2 (memory not onlined)
    cat /sys/device/system/node/node2/meminfo
    Node 2 MemTotal: 33554432 kB
    Node 2 MemFree: 0 kB
    Node 2 MemUsed: 33554432 kB
    Node 2 Active: 0 kB

    This patch fixes this problem by reset node managed pages to 0 after
    hot-adding a new node.

    1. Move reset_managed_pages_done from reset_node_managed_pages() to
    reset_all_zones_managed_pages()
    2. Make reset_node_managed_pages() non-static
    3. Call reset_node_managed_pages() in hotadd_new_pgdat() after pgdat
    is initialized

    Signed-off-by: Tang Chen
    Signed-off-by: Yasuaki Ishimatsu
    Cc: [3.16+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • One thing I did in this patch is fixing freepage accounting. If we
    clear guard page and link it onto isolate buddy list, we should not
    increase freepage count. This patch adds conditional branch to skip
    counting in this case. Without this patch, this overcounting happens
    frequently if guard order is set and CMA is used.

    Another thing fixed in this patch is the target to reset order. In
    __free_one_page(), we check the buddy page whether it is a guard page or
    not. And, if so, we should clear guard attribute on the buddy page and
    reset order of it to 0. But, current code resets original page's order
    rather than buddy one's. Maybe, this doesn't have any problem, because
    whole merged page's order will be re-assigned soon. But, it is better
    to correct code.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Gioh Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • fsnotify() needs to merge inode and mount marks lists when notifying
    groups about events so that ignore masks from inode marks are reflected
    in mount mark notifications and groups are notified in proper order
    (according to priorities).

    Currently the sorting of the lists done by fsnotify_add_inode_mark() /
    fsnotify_add_vfsmount_mark() and fsnotify() differed which resulted
    ignore masks not being used in some cases.

    Fix the problem by always using the same comparison function when
    sorting / merging the mark lists.

    Thanks to Heinrich Schuchardt for improvements of my patch.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=87721
    Signed-off-by: Jan Kara
    Reported-by: Heinrich Schuchardt
    Tested-by: Heinrich Schuchardt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Several people have reported occasionally seeing processes stuck in
    compact_zone(), even triggering soft lockups, in 3.18-rc2+.

    Testing a revert of commit e14c720efdd7 ("mm, compaction: remember
    position within pageblock in free pages scanner") fixed the issue,
    although the stuck processes do not appear to involve the free scanner.

    Finally, by code inspection, the bug was found in isolate_migratepages()
    which uses a slightly different condition to detect if the migration and
    free scanners have met, than compact_finished(). That has not been a
    problem until commit e14c720efdd7 allowed the free scanner position
    between individual invocations to be in the middle of a pageblock.

    In a relatively rare case, the migration scanner position can end up at
    the beginning of a pageblock, with the free scanner position in the
    middle of the same pageblock. If it's the migration scanner's turn,
    isolate_migratepages() exits immediately (without updating the
    position), while compact_finished() decides to continue compaction,
    resulting in a potentially infinite loop. The system can recover only
    if another process creates enough high-order pages to make the watermark
    checks in compact_finished() pass.

    This patch fixes the immediate problem by bumping the migration
    scanner's position to meet the free scanner in isolate_migratepages(),
    when both are within the same pageblock. This causes compact_finished()
    to terminate properly. A more robust check in compact_finished() is
    planned as a cleanup for better future maintainability.

    Fixes: e14c720efdd73 ("mm, compaction: remember position within pageblock in free pages scanner)
    Signed-off-by: Vlastimil Babka
    Reported-by: P. Christeas
    Tested-by: P. Christeas
    Link: http://marc.info/?l=linux-mm&m=141508604232522&w=2
    Reported-by: Norbert Preining
    Tested-by: Norbert Preining
    Link: https://lkml.org/lkml/2014/11/4/904
    Reported-by: Pavel Machek
    Link: https://lkml.org/lkml/2014/11/7/164
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Having test_pages_isolated failure message as a warning confuses users
    into thinking that it is more serious than it really is. In reality, if
    called via CMA, allocation will be retried so a single
    test_pages_isolated failure does not prevent allocation from succeeding.

    Demote the warning message to an info message and reformat it such that
    the text "failed" does not appear and instead a less worrying "PFNS
    busy" is used.

    This message is trivially reproducible on a 10GB x86 machine on 3.16.y
    kernels configured with CONFIG_DMA_CMA.

    Signed-off-by: Michal Nazarewicz
    Cc: Laurent Pinchart
    Cc: Peter Hurley
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Nazarewicz
     
  • Unlike SLUB, sometimes, object isn't started at the beginning of the
    slab in SLAB. This causes the unalignment problem after slab merging is
    supported by commit 12220dea07f1 ("mm/slab: support slab merge").

    Following is the report from Markos that fail to boot on Malta with EVA.

    Calibrating delay loop... 19.86 BogoMIPS (lpj=99328)
    pid_max: default: 32768 minimum: 301
    Mount-cache hash table entries: 4096 (order: 0, 16384 bytes)
    Mountpoint-cache hash table entries: 4096 (order: 0, 16384 bytes)
    Kernel bug detected[#1]:
    CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.17.0-05639-g12220dea07f1 #1631
    task: 1f04f5d8 ti: 1f050000 task.ti: 1f050000
    epc : 80141190 alloc_unbound_pwq+0x234/0x304
    Not tainted
    ra : 80141184 alloc_unbound_pwq+0x228/0x304
    Process swapper/0 (pid: 1, threadinfo=1f050000, task=1f04f5d8, tls=00000000)
    Call Trace:
    alloc_unbound_pwq+0x234/0x304
    apply_workqueue_attrs+0x11c/0x294
    __alloc_workqueue_key+0x23c/0x470
    init_workqueues+0x320/0x400
    do_one_initcall+0xe8/0x23c
    kernel_init_freeable+0x9c/0x224
    kernel_init+0x10/0x100
    ret_from_kernel_thread+0x14/0x1c
    [ end trace cb88537fdc8fa200 ]
    Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b

    alloc_unbound_pwq() allocates slab object from pool_workqueue. This
    kmem_cache requires 256 bytes alignment, but, current merging code
    doesn't honor that, and merge it with kmalloc-256. kmalloc-256 requires
    only cacheline size alignment so that above failure occurs. However, in
    x86, kmalloc-256 is luckily aligned in 256 bytes, so the problem didn't
    happen on it.

    To fix this problem, this patch introduces alignment mismatch check in
    find_mergeable(). This will fix the problem.

    Signed-off-by: Joonsoo Kim
    Reported-by: Markos Chandras
    Tested-by: Markos Chandras
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Current pageblock isolation logic could isolate each pageblock
    individually. This causes freepage accounting problem if freepage with
    pageblock order on isolate pageblock is merged with other freepage on
    normal pageblock. We can prevent merging by restricting max order of
    merging to pageblock order if freepage is on isolate pageblock.

    A side-effect of this change is that there could be non-merged buddy
    freepage even if finishing pageblock isolation, because undoing
    pageblock isolation is just to move freepage from isolate buddy list to
    normal buddy list rather than to consider merging. So, the patch also
    makes undoing pageblock isolation consider freepage merge. When
    un-isolation, freepage with more than pageblock order and it's buddy are
    checked. If they are on normal pageblock, instead of just moving, we
    isolate the freepage and free it in order to get merged.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Yasuaki Ishimatsu
    Cc: Zhang Yanfei
    Cc: Tang Chen
    Cc: Naoya Horiguchi
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Wen Congyang
    Cc: Marek Szyprowski
    Cc: Michal Nazarewicz
    Cc: Laura Abbott
    Cc: Heesub Shin
    Cc: "Aneesh Kumar K.V"
    Cc: Ritesh Harjani
    Cc: Gioh Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • All the caller of __free_one_page() has similar freepage counting logic,
    so we can move it to __free_one_page(). This reduce line of code and
    help future maintenance.

    This is also preparation step for "mm/page_alloc: restrict max order of
    merging on isolated pageblock" which fix the freepage counting problem
    on freepage with more than pageblock order.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Yasuaki Ishimatsu
    Cc: Zhang Yanfei
    Cc: Tang Chen
    Cc: Naoya Horiguchi
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Wen Congyang
    Cc: Marek Szyprowski
    Cc: Michal Nazarewicz
    Cc: Laura Abbott
    Cc: Heesub Shin
    Cc: "Aneesh Kumar K.V"
    Cc: Ritesh Harjani
    Cc: Gioh Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • In free_pcppages_bulk(), we use cached migratetype of freepage to
    determine type of buddy list where freepage will be added. This
    information is stored when freepage is added to pcp list, so if
    isolation of pageblock of this freepage begins after storing, this
    cached information could be stale. In other words, it has original
    migratetype rather than MIGRATE_ISOLATE.

    There are two problems caused by this stale information.

    One is that we can't keep these freepages from being allocated.
    Although this pageblock is isolated, freepage will be added to normal
    buddy list so that it could be allocated without any restriction. And
    the other problem is incorrect freepage accounting. Freepages on
    isolate pageblock should not be counted for number of freepage.

    Following is the code snippet in free_pcppages_bulk().

    /* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
    __free_one_page(page, page_to_pfn(page), zone, 0, mt);
    trace_mm_page_pcpu_drain(page, 0, mt);
    if (likely(!is_migrate_isolate_page(page))) {
    __mod_zone_page_state(zone, NR_FREE_PAGES, 1);
    if (is_migrate_cma(mt))
    __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
    }

    As you can see above snippet, current code already handle second
    problem, incorrect freepage accounting, by re-fetching pageblock
    migratetype through is_migrate_isolate_page(page).

    But, because this re-fetched information isn't used for
    __free_one_page(), first problem would not be solved. This patch try to
    solve this situation to re-fetch pageblock migratetype before
    __free_one_page() and to use it for __free_one_page().

    In addition to move up position of this re-fetch, this patch use
    optimization technique, re-fetching migratetype only if there is isolate
    pageblock. Pageblock isolation is rare event, so we can avoid
    re-fetching in common case with this optimization.

    This patch also correct migratetype of the tracepoint output.

    Signed-off-by: Joonsoo Kim
    Acked-by: Minchan Kim
    Acked-by: Michal Nazarewicz
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Yasuaki Ishimatsu
    Cc: Zhang Yanfei
    Cc: Tang Chen
    Cc: Naoya Horiguchi
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Wen Congyang
    Cc: Marek Szyprowski
    Cc: Laura Abbott
    Cc: Heesub Shin
    Cc: "Aneesh Kumar K.V"
    Cc: Ritesh Harjani
    Cc: Gioh Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Before describing bugs itself, I first explain definition of freepage.

    1. pages on buddy list are counted as freepage.
    2. pages on isolate migratetype buddy list are *not* counted as freepage.
    3. pages on cma buddy list are counted as CMA freepage, too.

    Now, I describe problems and related patch.

    Patch 1: There is race conditions on getting pageblock migratetype that
    it results in misplacement of freepages on buddy list, incorrect
    freepage count and un-availability of freepage.

    Patch 2: Freepages on pcp list could have stale cached information to
    determine migratetype of buddy list to go. This causes misplacement of
    freepages on buddy list and incorrect freepage count.

    Patch 4: Merging between freepages on different migratetype of
    pageblocks will cause freepages accouting problem. This patch fixes it.

    Without patchset [3], above problem doesn't happens on my CMA allocation
    test, because CMA reserved pages aren't used at all. So there is no
    chance for above race.

    With patchset [3], I did simple CMA allocation test and get below
    result:

    - Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
    - run kernel build (make -j16) on background
    - 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
    - Result: more than 5000 freepage count are missed

    With patchset [3] and this patchset, I found that no freepage count are
    missed so that I conclude that problems are solved.

    On my simple memory offlining test, these problems also occur on that
    environment, too.

    This patch (of 4):

    There are two paths to reach core free function of buddy allocator,
    __free_one_page(), one is free_one_page()->__free_one_page() and the
    other is free_hot_cold_page()->free_pcppages_bulk()->__free_one_page().
    Each paths has race condition causing serious problems. At first, this
    patch is focused on first type of freepath. And then, following patch
    will solve the problem in second type of freepath.

    In the first type of freepath, we got migratetype of freeing page
    without holding the zone lock, so it could be racy. There are two cases
    of this race.

    1. pages are added to isolate buddy list after restoring orignal
    migratetype

    CPU1 CPU2

    get migratetype => return MIGRATE_ISOLATE
    call free_one_page() with MIGRATE_ISOLATE

    grab the zone lock
    unisolate pageblock
    release the zone lock

    grab the zone lock
    call __free_one_page() with MIGRATE_ISOLATE
    freepage go into isolate buddy list,
    although pageblock is already unisolated

    This may cause two problems. One is that we can't use this page anymore
    until next isolation attempt of this pageblock, because freepage is on
    isolate buddy list. The other is that freepage accouting could be wrong
    due to merging between different buddy list. Freepages on isolate buddy
    list aren't counted as freepage, but ones on normal buddy list are
    counted as freepage. If merge happens, buddy freepage on normal buddy
    list is inevitably moved to isolate buddy list without any consideration
    of freepage accouting so it could be incorrect.

    2. pages are added to normal buddy list while pageblock is isolated.
    It is similar with above case.

    This also may cause two problems. One is that we can't keep these
    freepages from being allocated. Although this pageblock is isolated,
    freepage would be added to normal buddy list so that it could be
    allocated without any restriction. And the other problem is same as
    case 1, that it, incorrect freepage accouting.

    This race condition would be prevented by checking migratetype again
    with holding the zone lock. Because it is somewhat heavy operation and
    it isn't needed in common case, we want to avoid rechecking as much as
    possible. So this patch introduce new variable, nr_isolate_pageblock in
    struct zone to check if there is isolated pageblock. With this, we can
    avoid to re-check migratetype in common case and do it only if there is
    isolated pageblock or migratetype is MIGRATE_ISOLATE. This solve above
    mentioned problems.

    Changes from v3:
    Add one more check in free_one_page() that checks whether migratetype is
    MIGRATE_ISOLATE or not. Without this, abovementioned case 1 could happens.

    Signed-off-by: Joonsoo Kim
    Acked-by: Minchan Kim
    Acked-by: Michal Nazarewicz
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Yasuaki Ishimatsu
    Cc: Zhang Yanfei
    Cc: Tang Chen
    Cc: Naoya Horiguchi
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Wen Congyang
    Cc: Marek Szyprowski
    Cc: Laura Abbott
    Cc: Heesub Shin
    Cc: "Aneesh Kumar K.V"
    Cc: Ritesh Harjani
    Cc: Gioh Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Commit 7d49d8868336 ("mm, compaction: reduce zone checking frequency in
    the migration scanner") has a side-effect that changes the iteration
    range calculation. Before the change, block_end_pfn is calculated using
    start_pfn, but now it blindly adds pageblock_nr_pages to the previous
    value.

    This causes the problem that isolation_start_pfn is larger than
    block_end_pfn when we isolate the page with more than pageblock order.
    In this case, isolation would fail due to an invalid range parameter.

    To prevent this, this patch implements skipping the range until a proper
    target pageblock is met. Without this patch, CMA with more than
    pageblock order always fails but with this patch it will succeed.

    Signed-off-by: Joonsoo Kim
    Cc: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • zram could kunmap_atomic() a NULL pointer in a rare situation: a zram
    page becomes a full-zeroed page after a partial write io. The current
    code doesn't handle this case and performs kunmap_atomic() on a NULL
    pointer, which panics the kernel.

    This patch fixes this issue.

    Signed-off-by: Weijie Yang
    Cc: Sergey Senozhatsky
    Cc: Dan Streetman
    Cc: Nitin Gupta
    Cc: Weijie Yang
    Acked-by: Jerome Marchand
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Weijie Yang
     
  • Currently, we only match against local port number in order to reuse
    socket. But if this new vxlan wants an IPv6 socket and a IPv4 one bound
    to that port, vxlan will reuse an IPv4 socket as IPv6 and a panic will
    follow. The following steps reproduce it:

    # ip link add vxlan6 type vxlan id 42 group 229.10.10.10 \
    srcport 5000 6000 dev eth0
    # ip link add vxlan7 type vxlan id 43 group ff0e::110 \
    srcport 5000 6000 dev eth0
    # ip link set vxlan6 up
    # ip link set vxlan7 up

    [ 4.187481] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
    ...
    [ 4.188076] Call Trace:
    [ 4.188085] [] ? ipv6_sock_mc_join+0x3a/0x630
    [ 4.188098] [] vxlan_igmp_join+0x66/0xd0 [vxlan]
    [ 4.188113] [] process_one_work+0x220/0x710
    [ 4.188125] [] ? process_one_work+0x1b4/0x710
    [ 4.188138] [] worker_thread+0x11b/0x3a0
    [ 4.188149] [] ? process_one_work+0x710/0x710

    So address family must also match in order to reuse a socket.

    Reported-by: Jean-Tsung Hsiao
    Signed-off-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller

    Marcelo Leitner
     
  • With commit be9dad1f9f26604fb ("net: phy: suspend phydev when going
    to HALTED"), the PHY device will be put in a low-power mode using
    BMCR_PDOWN if the the interface is set down. The smsc911x driver does
    a software_reset opening the device driver (ndo_open). In such case,
    the PHY must be powered-up before access to any register and before
    calling the software_reset function. Otherwise, as the PHY is powered
    down the software reset fails and the interface can not be enabled
    again.

    This patch fixes this scenario that is easy to reproduce setting down
    the network interface and setting up again.

    $ ifconfig eth0 down
    $ ifconfig eth0 up
    ifconfig: SIOCSIFFLAGS: Input/output error

    Signed-off-by: Enric Balletbo i Serra
    Signed-off-by: David S. Miller

    Enric Balletbo i Serra
     
  • My editor spewed garbage that looked like memory corruption on
    my screen. It turns out that a number of occurences of "fi" got
    turned into a ligature.

    This patch replaces these ligatures with the ASCII letters "fi".

    Signed-off-by: Herbert Xu

    Cheers,
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • Increased delay in the smsc911x_phy_disable_energy_detect (from 1ms to 2ms).
    Dropped delays in the smsc911x_phy_enable_energy_detect (100ms and 1ms).

    The patch affect SMSC LAN generation 4 chips with integrated PHY (LAN9221).

    I saw problems with soft reset due to wrong udelay timings.
    After I fixed udelay, I measured the time needed to bring integrated PHY
    from power-down to operational mode (the time beetween clearing EDPWRDOWN
    bit and soft reset complete event). I got 1ms (measured using ktime_get).
    The value is equal to the current value (1ms) used in the
    smsc911x_phy_disable_energy_detect. It is near the upper bound and in order
    to avoid rare soft reset faults it is doubled (2ms).

    I don't know official timing for bringing up integrated PHY as specs doesn't
    clarify this (or may be I didn't found).

    It looks safe to drop delays before and after setting EDPWRDOWN bit
    (enable PHY power-down mode). I didn't saw any regressions with the patch.

    The patch was reviewed by Steve Glendinning and Microchip Team.

    Signed-off-by: Alexander Kochetkov
    Acked-by: Steve Glendinning
    Signed-off-by: David S. Miller

    Alexander Kochetkov
     
  • The patch affect SMSC LAN generation 4 chips with integrated PHY (LAN9221).

    It is possible that PHY could enter power-down mode (ENERGYON clear),
    between ENERGYON bit check in smsc911x_phy_disable_energy_detect and SRST
    bit set in smsc911x_soft_reset. This could happen, for example, if someone
    disconnect ethernet cable between the checks. The PHY in a power-down mode
    would prevent the MAC portion of chip to be software reseted.

    Initially found by code review, confirmed later using test case.

    This is low probability issue, and in order to reproduce it you have to
    run the script:

    while true; do
    ifconfig eth0 down
    ifconfig eth0 up || break
    done

    While the script is running you have to plug/unplug ethernet cable many
    times (using gpio controlled ethernet switch, for example) until get:

    [ 4516.477783] ADDRCONF(NETDEV_UP): eth0: link is not ready
    [ 4516.512207] smsc911x smsc911x.0: eth0: SMSC911x/921x identified at 0xce006000, IRQ: 336
    [ 4516.524658] ADDRCONF(NETDEV_UP): eth0: link is not ready
    [ 4516.559082] smsc911x smsc911x.0: eth0: SMSC911x/921x identified at 0xce006000, IRQ: 336
    [ 4516.571990] ADDRCONF(NETDEV_UP): eth0: link is not ready
    ifconfig: SIOCSIFFLAGS: Input/output error

    The patch was reviewed by Steve Glendinning and Microchip Team.

    Signed-off-by: Alexander Kochetkov
    Acked-by: Steve Glendinning
    Signed-off-by: David S. Miller

    Alexander Kochetkov
     
  • No reason to use BUG_ON for osd request list assertions.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Alex Elder

    Ilya Dryomov
     
  • kick_requests() can put linger requests on the notarget list. This
    means we need to clear the much-overloaded req->r_req_lru_item in
    __unregister_linger_request() as well, or we get an assertion failure
    in ceph_osdc_release_request() - !list_empty(&req->r_req_lru_item).

    AFAICT the assumption was that registered linger requests cannot be on
    any of req->r_req_lru_item lists, but that's clearly not the case.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Alex Elder

    Ilya Dryomov
     
  • Requests have to be unlinked from both osd->o_requests (normal
    requests) and osd->o_linger_requests (linger requests) lists when
    clearing req->r_osd. Otherwise __unregister_linger_request() gets
    confused and we trip over a !list_empty(&osd->o_linger_requests)
    assert in __remove_osd().

    MON=1 OSD=1:

    # cat remove-osd.sh
    #!/bin/bash
    rbd create --size 1 test
    DEV=$(rbd map test)
    ceph osd out 0
    sleep 3
    rbd map dne/dne # obtain a new osdmap as a side effect
    rbd unmap $DEV & # will block
    sleep 3
    ceph osd in 0

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Alex Elder

    Ilya Dryomov
     
  • Large (greater than 32k, the value of PAGE_ALLOC_COSTLY_ORDER) auth
    tickets will have their buffers vmalloc'ed, which leads to the
    following crash in crypto:

    [ 28.685082] BUG: unable to handle kernel paging request at ffffeb04000032c0
    [ 28.686032] IP: [] scatterwalk_pagedone+0x22/0x80
    [ 28.686032] PGD 0
    [ 28.688088] Oops: 0000 [#1] PREEMPT SMP
    [ 28.688088] Modules linked in:
    [ 28.688088] CPU: 0 PID: 878 Comm: kworker/0:2 Not tainted 3.17.0-vm+ #305
    [ 28.688088] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
    [ 28.688088] Workqueue: ceph-msgr con_work
    [ 28.688088] task: ffff88011a7f9030 ti: ffff8800d903c000 task.ti: ffff8800d903c000
    [ 28.688088] RIP: 0010:[] [] scatterwalk_pagedone+0x22/0x80
    [ 28.688088] RSP: 0018:ffff8800d903f688 EFLAGS: 00010286
    [ 28.688088] RAX: ffffeb04000032c0 RBX: ffff8800d903f718 RCX: ffffeb04000032c0
    [ 28.688088] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8800d903f750
    [ 28.688088] RBP: ffff8800d903f688 R08: 00000000000007de R09: ffff8800d903f880
    [ 28.688088] R10: 18df467c72d6257b R11: 0000000000000000 R12: 0000000000000010
    [ 28.688088] R13: ffff8800d903f750 R14: ffff8800d903f8a0 R15: 0000000000000000
    [ 28.688088] FS: 00007f50a41c7700(0000) GS:ffff88011fc00000(0000) knlGS:0000000000000000
    [ 28.688088] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [ 28.688088] CR2: ffffeb04000032c0 CR3: 00000000da3f3000 CR4: 00000000000006b0
    [ 28.688088] Stack:
    [ 28.688088] ffff8800d903f698 ffffffff81392ca8 ffff8800d903f6e8 ffffffff81395d32
    [ 28.688088] ffff8800dac96000 ffff880000000000 ffff8800d903f980 ffff880119b7e020
    [ 28.688088] ffff880119b7e010 0000000000000000 0000000000000010 0000000000000010
    [ 28.688088] Call Trace:
    [ 28.688088] [] scatterwalk_done+0x38/0x40
    [ 28.688088] [] scatterwalk_done+0x38/0x40
    [ 28.688088] [] blkcipher_walk_done+0x182/0x220
    [ 28.688088] [] crypto_cbc_encrypt+0x15f/0x180
    [ 28.688088] [] ? crypto_aes_set_key+0x30/0x30
    [ 28.688088] [] ceph_aes_encrypt2+0x29c/0x2e0
    [ 28.688088] [] ceph_encrypt2+0x93/0xb0
    [ 28.688088] [] ceph_x_encrypt+0x4a/0x60
    [ 28.688088] [] ? ceph_buffer_new+0x5d/0xf0
    [ 28.688088] [] ceph_x_build_authorizer.isra.6+0x297/0x360
    [ 28.688088] [] ? kmem_cache_alloc_trace+0x11b/0x1c0
    [ 28.688088] [] ? ceph_auth_create_authorizer+0x36/0x80
    [ 28.688088] [] ceph_x_create_authorizer+0x63/0xd0
    [ 28.688088] [] ceph_auth_create_authorizer+0x54/0x80
    [ 28.688088] [] get_authorizer+0x80/0xd0
    [ 28.688088] [] prepare_write_connect+0x18b/0x2b0
    [ 28.688088] [] try_read+0x1e59/0x1f10

    This is because we set up crypto scatterlists as if all buffers were
    kmalloc'ed. Fix it.

    Cc: stable@vger.kernel.org
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • TID of cap flush ack is 64 bits, but ceph_inode_info::flushing_cap_tid
    is only 16 bits. 16 bits should be plenty to let the cap flush updates
    pipeline appropriately, but we need to cast in the proper direction when
    comparing these differently-sized versions. So downcast the 64-bits one
    to 16 bits.

    Reflects ceph.git commit a5184cf46a6e867287e24aeb731634828467cd98.

    Signed-off-by: Yan, Zheng
    Reviewed-by: Ilya Dryomov

    Yan, Zheng
     
  • The branches of the if (i->type & ITER_BVEC) statement in
    iov_iter_single_seg_count() are the wrong way around; if ITER_BVEC is
    clear then we use i->bvec, when we should be using i->iov. This fixes
    it.

    In my case, the symptom that this caused was that a KVM guest doing
    filesystem operations on a virtual disk would result in one of qemu's
    threads on the host going into an infinite loop in
    generic_perform_write(). The loop would hit the copied == 0 case and
    call iov_iter_single_seg_count() to reduce the number of bytes to try
    to process, but because of the error, iov_iter_single_seg_count()
    would just return i->count and the loop made no progress and continued
    forever.

    Cc: stable@vger.kernel.org # 3.16+
    Signed-off-by: Paul Mackerras
    Signed-off-by: Al Viro

    Paul Mackerras
     
  • Pull sound fixes from Takashi Iwai:
    "Things get calming down, now we have only a few fix patches: a trivial
    fix for memory leak in usb-audio, a patch for the new HD-audio PCI id,
    a device-specific mute-LED fix, and a slightly big patch to cover the
    missing COEF inits of various Realtek codecs"

    * tag 'sound-3.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
    ALSA: hda - Add mute LED control for Lenovo Ideapad Z560
    ALSA: hda/realtek - Change EAPD to verb control
    ALSA: usb-audio: Fix memory leak in FTU quirk
    ALSA: hda_intel: Add DeviceIDs for Sunrise Point-LP

    Linus Torvalds
     
  • Pull SELinux fixlet from James Morris:
    "WARN_ONCE() here will unnecessarily terrify users"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
    selinux: convert WARN_ONCE() to printk() in selinux_nlmsg_perm()

    Linus Torvalds
     
  • Pull audit fixes from Paul Moore:
    "After he sent the initial audit pull request for 3.18, Eric asked me
    to take over the management of the audit tree, hence this pull request
    to fix a couple of problems with audit.

    As you can see below, the changes are minimal: adding some whitespace
    to a string so userspace parses it correctly, and fixing a problem
    with audit's usage of fsnotify that was causing audit watch rules to
    be lost. Neither of these patches were very controversial on the
    mailing lists and they fix real problems, getting them into 3.18 would
    be a good thing"

    * 'stable-3.18' of git://git.infradead.org/users/pcmoore/audit:
    audit: keep inode pinned
    audit: AUDIT_FEATURE_CHANGE message format missing delimiting space

    Linus Torvalds