01 Oct, 2013

24 commits

  • This fixes a race in both msgrcv() and msgsnd() between finding the msg
    and actually dealing with the queue, as another thread can delete shmid
    underneath us if we are preempted before acquiring the
    kern_ipc_perm.lock.

    Manfred illustrates this nicely:

    Assume a preemptible kernel that is preempted just after

    msq = msq_obtain_object_check(ns, msqid)

    in do_msgrcv(). The only lock that is held is rcu_read_lock().

    Now the other thread processes IPC_RMID. When the first task is
    resumed, then it will happily wait for messages on a deleted queue.

    Fix this by checking for if the queue has been deleted after taking the
    lock.

    Signed-off-by: Davidlohr Bueso
    Reported-by: Manfred Spraul
    Cc: Rik van Riel
    Cc: Mike Galbraith
    Cc: [3.11]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • In commit 0a2b9d4c7967 ("ipc/sem.c: move wake_up_process out of the
    spinlock section"), the update of semaphore's sem_otime(last semop time)
    was moved to one central position (do_smart_update).

    But since do_smart_update() is only called for operations that modify
    the array, this means that wait-for-zero semops do not update sem_otime
    anymore.

    The fix is simple:
    Non-alter operations must update sem_otime.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Manfred Spraul
    Reported-by: Jia He
    Tested-by: Jia He
    Cc: Davidlohr Bueso
    Cc: Mike Galbraith
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • The lack of one reference count against poisoned page for hwpoison_inject
    w/o hwpoison_filter enabled result in hwpoison detect -1 users still
    referenced the page, however, the number should be 0 except the poison
    handler held one after successfully unmap. This patch fix it by hold one
    referenced count against poisoned page for hwpoison_inject w/ and w/o
    hwpoison_filter enabled.

    Before patch:

    [ 71.902112] Injecting memory failure at pfn 224706
    [ 71.902137] MCE 0x224706: dirty LRU page recovery: Failed
    [ 71.902138] MCE 0x224706: dirty LRU page still referenced by -1 users

    After patch:

    [ 94.710860] Injecting memory failure at pfn 215b68
    [ 94.710885] MCE 0x215b68: dirty LRU page recovery: Recovered

    Reviewed-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Signed-off-by: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • If the page is poisoned by software injection w/ MF_COUNT_INCREASED
    flag, there is a false report during the 2nd attempt at page recovery
    which is not truthful.

    This patch fixes it by reporting the first attempt to try free buddy
    page recovery if MF_COUNT_INCREASED is set.

    Before patch:

    [ 346.332041] Injecting memory failure at pfn 200010
    [ 346.332189] MCE 0x200010: free buddy, 2nd try page recovery: Delayed

    After patch:

    [ 297.742600] Injecting memory failure at pfn 200010
    [ 297.742941] MCE 0x200010: free buddy page recovery: Delayed

    Reviewed-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Signed-off-by: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • PageTransHuge() can't guarantee the page is a transparent huge page
    since it returns true for both transparent huge and hugetlbfs pages.

    This patch fixes it by checking the page is also !hugetlbfs page.

    Before patch:

    [ 121.571128] Injecting memory failure at pfn 23a200
    [ 121.571141] MCE 0x23a200: huge page recovery: Delayed
    [ 140.355100] MCE: Memory failure is now running on 0x23a200

    After patch:

    [ 94.290793] Injecting memory failure at pfn 23a000
    [ 94.290800] MCE 0x23a000: huge page recovery: Delayed
    [ 105.722303] MCE: Software-unpoisoned page 0x23a000

    Signed-off-by: Wanpeng Li
    Reviewed-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • madvise_hwpoison won't check if the page is small page or huge page and
    traverses in small page granularity against the range unconditionally,
    which result in a printk flood "MCE xxx: already hardware poisoned" if
    the page is a huge page.

    This patch fixes it by using compound_order(compound_head(page)) for
    huge page iterator.

    Testcase:

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define PAGES_TO_TEST 3
    #define PAGE_SIZE 4096 * 512

    int main(void)
    {
    char *mem;
    int i;

    mem = mmap(NULL, PAGES_TO_TEST * PAGE_SIZE,
    PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, 0, 0);

    if (madvise(mem, PAGES_TO_TEST * PAGE_SIZE, MADV_HWPOISON) == -1)
    return -1;

    munmap(mem, PAGES_TO_TEST * PAGE_SIZE);

    return 0;
    }

    Signed-off-by: Wanpeng Li
    Reviewed-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • Recently commit bab55417b10c ("block: support embedded device command
    line partition") introduced CONFIG_CMDLINE_PARSER. However, that name
    is too generic and sounds like it enables/disables generic kernel boot
    arg processing, when it really is block specific.

    Before this option becomes a part of a full/final release, add the BLK_
    prefix to it so that it is clear in absence of any other context that it
    is block specific.

    In addition, fix up the following less critical items:
    - help text was not really at all helpful.
    - index file for Documentation was not updated
    - add the new arg to Documentation/kernel-parameters.txt
    - clarify wording in source comments

    Signed-off-by: Paul Gortmaker
    Cc: Jens Axboe
    Cc: Cai Zhiyong
    Cc: Wei Yongjun
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Gortmaker
     
  • The function __munlock_pagevec_fill() introduced in commit 7a8010cd3627
    ("mm: munlock: manual pte walk in fast path instead of
    follow_page_mask()") uses pmd_addr_end() for restricting its operation
    within current page table.

    This is insufficient on architectures/configurations where pmd is folded
    and pmd_addr_end() just returns the end of the full range to be walked.
    In this case, it allows pte++ to walk off the end of a page table
    resulting in unpredictable behaviour.

    This patch fixes the function by using pgd_addr_end() and pud_addr_end()
    before pmd_addr_end(), which will yield correct page table boundary on
    all configurations. This is similar to what existing page walkers do
    when walking each level of the page table.

    Additionaly, the patch clarifies a comment for get_locked_pte() call in the
    function.

    Signed-off-by: Vlastimil Babka
    Reported-by: Fengguang Wu
    Reviewed-by: Bob Liu
    Cc: Jörn Engel
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Isolated balloon pages can wrongly end up in LRU lists when
    migrate_pages() finishes its round without draining all the isolated
    page list.

    The same issue can happen when reclaim_clean_pages_from_list() tries to
    reclaim pages from an isolated page list, before migration, in the CMA
    path. Such balloon page leak opens a race window against LRU lists
    shrinkers that leads us to the following kernel panic:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
    IP: [] shrink_page_list+0x24e/0x897
    PGD 3cda2067 PUD 3d713067 PMD 0
    Oops: 0000 [#1] SMP
    CPU: 0 PID: 340 Comm: kswapd0 Not tainted 3.12.0-rc1-22626-g4367597 #87
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    RIP: shrink_page_list+0x24e/0x897
    RSP: 0000:ffff88003da499b8 EFLAGS: 00010286
    RAX: 0000000000000000 RBX: ffff88003e82bd60 RCX: 00000000000657d5
    RDX: 0000000000000000 RSI: 000000000000031f RDI: ffff88003e82bd40
    RBP: ffff88003da49ab0 R08: 0000000000000001 R09: 0000000081121a45
    R10: ffffffff81121a45 R11: ffff88003c4a9a28 R12: ffff88003e82bd40
    R13: ffff88003da0e800 R14: 0000000000000001 R15: ffff88003da49d58
    FS: 0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00000000067d9000 CR3: 000000003ace5000 CR4: 00000000000407b0
    Call Trace:
    shrink_inactive_list+0x240/0x3de
    shrink_lruvec+0x3e0/0x566
    __shrink_zone+0x94/0x178
    shrink_zone+0x3a/0x82
    balance_pgdat+0x32a/0x4c2
    kswapd+0x2f0/0x372
    kthread+0xa2/0xaa
    ret_from_fork+0x7c/0xb0
    Code: 80 7d 8f 01 48 83 95 68 ff ff ff 00 4c 89 e7 e8 5a 7b 00 00 48 85 c0 49 89 c5 75 08 80 7d 8f 00 74 3e eb 31 48 8b 80 18 01 00 00 8b 74 0d 48 8b 78 30 be 02 00 00 00 ff d2 eb
    RIP [] shrink_page_list+0x24e/0x897
    RSP
    CR2: 0000000000000028
    ---[ end trace 703d2451af6ffbfd ]---
    Kernel panic - not syncing: Fatal exception

    This patch fixes the issue, by assuring the proper tests are made at
    putback_movable_pages() & reclaim_clean_pages_from_list() to avoid
    isolated balloon pages being wrongly reinserted in LRU lists.

    [akpm@linux-foundation.org: clarify awkward comment text]
    Signed-off-by: Rafael Aquini
    Reported-by: Luiz Capitulino
    Tested-by: Luiz Capitulino
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     
  • The FAULT_FLAG_WRITE flag has been set based on uninitialized variable.

    Fixes a regression added by commit 759496ba6407 ("arch: mm: pass
    userspace fault flag to generic fault handler")

    Signed-off-by: Felipe Pena
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Felipe Pena
     
  • patch(1) can't handle zero-length files - it appears to simply not create
    the file, so my powerpc build fails.

    Put something in here to make life easier.

    Cc: Hugh Dickins
    Cc: Frederic Weisbecker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Many NILFS2 users were reported about strange file system corruption
    (for example):

    NILFS: bad btree node (blocknr=185027): level = 0, flags = 0x0, nchildren = 768
    NILFS error (device sda4): nilfs_bmap_last_key: broken bmap (inode number=11540)

    But such error messages are consequence of file system's issue that takes
    place more earlier. Fortunately, Jerome Poulin
    and Anton Eliasson were reported about another
    issue not so recently. These reports describe the issue with segctor
    thread's crash:

    BUG: unable to handle kernel paging request at 0000000000004c83
    IP: nilfs_end_page_io+0x12/0xd0 [nilfs2]

    Call Trace:
    nilfs_segctor_do_construct+0xf25/0x1b20 [nilfs2]
    nilfs_segctor_construct+0x17b/0x290 [nilfs2]
    nilfs_segctor_thread+0x122/0x3b0 [nilfs2]
    kthread+0xc0/0xd0
    ret_from_fork+0x7c/0xb0

    These two issues have one reason. This reason can raise third issue
    too. Third issue results in hanging of segctor thread with eating of
    100% CPU.

    REPRODUCING PATH:

    One of the possible way or the issue reproducing was described by
    Jermoe me Poulin :

    1. init S to get to single user mode.
    2. sysrq+E to make sure only my shell is running
    3. start network-manager to get my wifi connection up
    4. login as root and launch "screen"
    5. cd /boot/log/nilfs which is a ext3 mount point and can log when NILFS dies.
    6. lscp | xz -9e > lscp.txt.xz
    7. mount my snapshot using mount -o cp=3360839,ro /dev/vgUbuntu/root /mnt/nilfs
    8. start a screen to dump /proc/kmsg to text file since rsyslog is killed
    9. start a screen and launch strace -f -o find-cat.log -t find
    /mnt/nilfs -type f -exec cat {} > /dev/null \;
    10. start a screen and launch strace -f -o apt-get.log -t apt-get update
    11. launch the last command again as it did not crash the first time
    12. apt-get crashes
    13. ps aux > ps-aux-crashed.log
    13. sysrq+W
    14. sysrq+E wait for everything to terminate
    15. sysrq+SUSB

    Simplified way of the issue reproducing is starting kernel compilation
    task and "apt-get update" in parallel.

    REPRODUCIBILITY:

    The issue is reproduced not stable [60% - 80%]. It is very important to
    have proper environment for the issue reproducing. The critical
    conditions for successful reproducing:

    (1) It should have big modified file by mmap() way.

    (2) This file should have the count of dirty blocks are greater that
    several segments in size (for example, two or three) from time to time
    during processing.

    (3) It should be intensive background activity of files modification
    in another thread.

    INVESTIGATION:

    First of all, it is possible to see that the reason of crash is not valid
    page address:

    NILFS [nilfs_segctor_complete_write]:2100 bh->b_count 0, bh->b_blocknr 13895680, bh->b_size 13897727, bh->b_page 0000000000001a82
    NILFS [nilfs_segctor_complete_write]:2101 segbuf->sb_segnum 6783

    Moreover, value of b_page (0x1a82) is 6786. This value looks like segment
    number. And b_blocknr with b_size values look like block numbers. So,
    buffer_head's pointer points on not proper address value.

    Detailed investigation of the issue is discovered such picture:

    [-----------------------------SEGMENT 6783-------------------------------]
    NILFS [nilfs_segctor_do_construct]:2310 nilfs_segctor_begin_construction
    NILFS [nilfs_segctor_do_construct]:2321 nilfs_segctor_collect
    NILFS [nilfs_segctor_do_construct]:2336 nilfs_segctor_assign
    NILFS [nilfs_segctor_do_construct]:2367 nilfs_segctor_update_segusage
    NILFS [nilfs_segctor_do_construct]:2371 nilfs_segctor_prepare_write
    NILFS [nilfs_segctor_do_construct]:2376 nilfs_add_checksums_on_logs
    NILFS [nilfs_segctor_do_construct]:2381 nilfs_segctor_write
    NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111149024, segbuf->sb_segnum 6783

    [-----------------------------SEGMENT 6784-------------------------------]
    NILFS [nilfs_segctor_do_construct]:2310 nilfs_segctor_begin_construction
    NILFS [nilfs_segctor_do_construct]:2321 nilfs_segctor_collect
    NILFS [nilfs_lookup_dirty_data_buffers]:782 bh->b_count 1, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
    NILFS [nilfs_lookup_dirty_data_buffers]:783 bh->b_assoc_buffers.next ffff8802174a6798, bh->b_assoc_buffers.prev ffff880221cffee8
    NILFS [nilfs_segctor_do_construct]:2336 nilfs_segctor_assign
    NILFS [nilfs_segctor_do_construct]:2367 nilfs_segctor_update_segusage
    NILFS [nilfs_segctor_do_construct]:2371 nilfs_segctor_prepare_write
    NILFS [nilfs_segctor_do_construct]:2376 nilfs_add_checksums_on_logs
    NILFS [nilfs_segctor_do_construct]:2381 nilfs_segctor_write
    NILFS [nilfs_segbuf_submit_bh]:575 bh->b_count 1, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
    NILFS [nilfs_segbuf_submit_bh]:576 segbuf->sb_segnum 6784
    NILFS [nilfs_segbuf_submit_bh]:577 bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880218bcdf50
    NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111150080, segbuf->sb_segnum 6784, segbuf->sb_nbio 0
    [----------] ditto
    NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111164416, segbuf->sb_segnum 6784, segbuf->sb_nbio 15

    [-----------------------------SEGMENT 6785-------------------------------]
    NILFS [nilfs_segctor_do_construct]:2310 nilfs_segctor_begin_construction
    NILFS [nilfs_segctor_do_construct]:2321 nilfs_segctor_collect
    NILFS [nilfs_lookup_dirty_data_buffers]:782 bh->b_count 2, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
    NILFS [nilfs_lookup_dirty_data_buffers]:783 bh->b_assoc_buffers.next ffff880219277e80, bh->b_assoc_buffers.prev ffff880221cffc88
    NILFS [nilfs_segctor_do_construct]:2367 nilfs_segctor_update_segusage
    NILFS [nilfs_segctor_do_construct]:2371 nilfs_segctor_prepare_write
    NILFS [nilfs_segctor_do_construct]:2376 nilfs_add_checksums_on_logs
    NILFS [nilfs_segctor_do_construct]:2381 nilfs_segctor_write
    NILFS [nilfs_segbuf_submit_bh]:575 bh->b_count 2, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
    NILFS [nilfs_segbuf_submit_bh]:576 segbuf->sb_segnum 6785
    NILFS [nilfs_segbuf_submit_bh]:577 bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880222cc7ee8
    NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111165440, segbuf->sb_segnum 6785, segbuf->sb_nbio 0
    [----------] ditto
    NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111177728, segbuf->sb_segnum 6785, segbuf->sb_nbio 12

    NILFS [nilfs_segctor_do_construct]:2399 nilfs_segctor_wait
    NILFS [nilfs_segbuf_wait]:676 segbuf->sb_segnum 6783
    NILFS [nilfs_segbuf_wait]:676 segbuf->sb_segnum 6784
    NILFS [nilfs_segbuf_wait]:676 segbuf->sb_segnum 6785

    NILFS [nilfs_segctor_complete_write]:2100 bh->b_count 0, bh->b_blocknr 13895680, bh->b_size 13897727, bh->b_page 0000000000001a82

    BUG: unable to handle kernel paging request at 0000000000001a82
    IP: [] nilfs_end_page_io+0x12/0xd0 [nilfs2]

    Usually, for every segment we collect dirty files in list. Then, dirty
    blocks are gathered for every dirty file, prepared for write and
    submitted by means of nilfs_segbuf_submit_bh() call. Finally, it takes
    place complete write phase after calling nilfs_end_bio_write() on the
    block layer. Buffers/pages are marked as not dirty on final phase and
    processed files removed from the list of dirty files.

    It is possible to see that we had three prepare_write and submit_bio
    phases before segbuf_wait and complete_write phase. Moreover, segments
    compete between each other for dirty blocks because on every iteration
    of segments processing dirty buffer_heads are added in several lists of
    payload_buffers:

    [SEGMENT 6784]: bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880218bcdf50
    [SEGMENT 6785]: bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880222cc7ee8

    The next pointer is the same but prev pointer has changed. It means
    that buffer_head has next pointer from one list but prev pointer from
    another. Such modification can be made several times. And, finally, it
    can be resulted in various issues: (1) segctor hanging, (2) segctor
    crashing, (3) file system metadata corruption.

    FIX:
    This patch adds:

    (1) setting of BH_Async_Write flag in nilfs_segctor_prepare_write()
    for every proccessed dirty block;

    (2) checking of BH_Async_Write flag in
    nilfs_lookup_dirty_data_buffers() and
    nilfs_lookup_dirty_node_buffers();

    (3) clearing of BH_Async_Write flag in nilfs_segctor_complete_write(),
    nilfs_abort_logs(), nilfs_forget_buffer(), nilfs_clear_dirty_page().

    Reported-by: Jerome Poulin
    Reported-by: Anton Eliasson
    Cc: Paul Fertser
    Cc: ARAI Shun-ichi
    Cc: Piotr Szymaniak
    Cc: Juan Barry Manuel Canham
    Cc: Zahid Chowdhury
    Cc: Elmer Zhang
    Cc: Kenneth Langga
    Signed-off-by: Vyacheslav Dubeyko
    Acked-by: Ryusuke Konishi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vyacheslav Dubeyko
     
  • Han Pingtian found a typo in Documentation/kernel-parameters.txt about
    "kernelcore=", that "kernelcore" should be replaced with "Movable" here.

    Signed-off-by: Weiping Pan
    Acked-by: Mel Gorman
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Weiping Pan
     
  • The "force" parameter in __blk_queue_bounce was being ignored, which
    means that stable page snapshots are not always happening (on ext3).
    This of course leads to DIF disks reporting checksum errors, so fix this
    regression.

    The regression was introduced in commit 6bc454d15004 ("bounce: Refactor
    __blk_queue_bounce to not use bi_io_vec")

    Reported-by: Mel Gorman
    Signed-off-by: Darrick J. Wong
    Cc: Kent Overstreet
    Cc: [3.10+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     
  • If /proc/sys/kernel/core_pattern contains only "|", a NULL pointer
    dereference happens upon core dump because argv_split("") returns
    argv[0] == NULL.

    This bug was once fixed by commit 264b83c07a84 ("usermodehelper: check
    subprocess_info->path != NULL") but was by error reintroduced by commit
    7f57cfa4e2aa ("usermodehelper: kill the sub_info->path[0] check").

    This bug seems to exist since 2.6.19 (the version which core dump to
    pipe was added). Depending on kernel version and config, some side
    effect might happen immediately after this oops (e.g. kernel panic with
    2.6.32-358.18.1.el6).

    Signed-off-by: Tetsuo Handa
    Acked-by: Oleg Nesterov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • The proc interface is not aware of sem_lock(), it instead calls
    ipc_lock_object() directly. This means that simple semop() operations
    can run in parallel with the proc interface. Right now, this is
    uncritical, because the implementation doesn't do anything that requires
    a proper synchronization.

    But it is dangerous and therefore should be fixed.

    Signed-off-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Mike Galbraith
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • Operations that need access to the whole array must guarantee that there
    are no simple operations ongoing. Right now this is achieved by
    spin_unlock_wait(sem->lock) on all semaphores.

    If complex_count is nonzero, then this spin_unlock_wait() is not
    necessary, because it was already performed in the past by the thread
    that increased complex_count and even though sem_perm.lock was dropped
    inbetween, no simple operation could have started, because simple
    operations cannot start when complex_count is non-zero.

    Signed-off-by: Manfred Spraul
    Cc: Mike Galbraith
    Cc: Rik van Riel
    Reviewed-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • The exclusion of complex operations in sem_lock() is insufficient: after
    acquiring the per-semaphore lock, a simple op must first check that
    sem_perm.lock is not locked and only after that test check
    complex_count. The current code does it the other way around - and that
    creates a race. Details are below.

    The patch is a complete rewrite of sem_lock(), based in part on the code
    from Mike Galbraith. It removes all gotos and all loops and thus the
    risk of livelocks.

    I have tested the patch (together with the next one) on my i3 laptop and
    it didn't cause any problems.

    The bug is probably also present in 3.10 and 3.11, but for these kernels
    it might be simpler just to move the test of sma->complex_count after
    the spin_is_locked() test.

    Details of the bug:

    Assume:
    - sma->complex_count = 0.
    - Thread 1: semtimedop(complex op that must sleep)
    - Thread 2: semtimedop(simple op).

    Pseudo-Trace:

    Thread 1: sem_lock(): acquire sem_perm.lock
    Thread 1: sem_lock(): check for ongoing simple ops
    Nothing ongoing, thread 2 is still before sem_lock().
    Thread 1: try_atomic_semop()
    <<< preempted.

    Thread 2: sem_lock():
    static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
    int nsops)
    {
    int locknum;
    again:
    if (nsops == 1 && !sma->complex_count) {
    struct sem *sem = sma->sem_base + sops->sem_num;

    /* Lock just the semaphore we are interested in. */
    spin_lock(&sem->lock);

    /*
    * If sma->complex_count was set while we were spinning,
    * we may need to look at things we did not lock here.
    */
    if (unlikely(sma->complex_count)) {
    spin_unlock(&sem->lock);
    goto lock_array;
    }
    <<<<<<<<<
    <<< complex_count is still 0.
    <<<
    <<< Here it is preempted
    <<<<<<<<<

    Thread 1: try_atomic_semop() returns, notices that it must sleep.
    Thread 1: increases sma->complex_count.
    Thread 1: drops sem_perm.lock
    Thread 2:
    /*
    * Another process is holding the global lock on the
    * sem_array; we cannot enter our critical section,
    * but have to wait for the global lock to be released.
    */
    if (unlikely(spin_is_locked(&sma->sem_perm.lock))) {
    spin_unlock(&sem->lock);
    spin_unlock_wait(&sma->sem_perm.lock);
    goto again;
    }
    <<< sem_perm.lock already dropped, thus no "goto again;"

    locknum = sops->sem_num;

    Signed-off-by: Manfred Spraul
    Cc: Mike Galbraith
    Cc: Rik van Riel
    Cc: Davidlohr Bueso
    Cc: [3.10+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • We've been getting warnings about an excessive amount of time spent
    allocating pages for migration during memory compaction without
    scheduling. isolate_freepages_block() already periodically checks for
    contended locks or the need to schedule, but isolate_freepages() never
    does.

    When a zone is massively long and no suitable targets can be found, this
    iteration can be quite expensive without ever doing cond_resched().

    Check periodically for the need to reschedule while the compaction free
    scanner iterates.

    Signed-off-by: David Rientjes
    Reviewed-by: Rik van Riel
    Reviewed-by: Wanpeng Li
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • A high setting of max_map_count, and a process core-dumping with a large
    enough vm_map_count could result in an NT_FILE note not being written,
    and the kernel crashing immediately later because it has assumed
    otherwise.

    Reproduction of the oops-causing bug described here:

    https://lkml.org/lkml/2013/8/30/50

    Rge ussue originated in commit 2aa362c49c31 ("coredump: extend core dump
    note section to contain file names of mapped file") from Oct 4, 2012.

    This patch make that section optional in that case. fill_files_note()
    should signify the error, and also let the info struct in
    elf_core_dump() be zero-initialized so that we can check for the
    optionally written note.

    [akpm@linux-foundation.org: avoid abusing E2BIG, remove a couple of not-really-needed local variables]
    [akpm@linux-foundation.org: fix sparse warning]
    Signed-off-by: Dan Aloni
    Cc: Al Viro
    Cc: Denys Vlasenko
    Reported-by: Martin MOKREJS
    Tested-by: Martin MOKREJS
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Aloni
     
  • This reverts commit cea27eb2a202 ("mm/memory-hotplug: fix lowmem count
    overflow when offline pages").

    The fixed bug by commit cea27eb was fixed to another way by commit
    3dcc0571cd64 ("mm: correctly update zone->managed_pages"). That commit
    enhances memory_hotplug.c to adjust totalhigh_pages when hot-removing
    memory, for details please refer to:

    http://marc.info/?l=linux-mm&m=136957578620221&w=2

    As a result, commit cea27eb2a202 currently causes duplicated decreasing
    of totalhigh_pages, thus the revert.

    Signed-off-by: Joonyoung Shim
    Reviewed-by: Wanpeng Li
    Cc: Jiang Liu
    Cc: KOSAKI Motohiro
    Cc: Bartlomiej Zolnierkiewicz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonyoung Shim
     
  • Pull AVR32 fixes from Hans-Christian Egtvedt.

    Fix build warnings and use the Kbuild infrastructure for generic headers
    rather than doing it by hand.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/egtvedt/linux-avr32:
    avr32: cast syscall_return to silence compiler warning
    avr32: fix clockevents kernel warning
    avr32: use Kbuild infrastructure to handle the asm-generic headers

    Linus Torvalds
     
  • Pull S+core fixes from Lennox Wu:
    "These updates include updating information of maintainers, fix some
    trivial errors, and add a necessary function for supporting ipv6"

    * tag 'for-linus-20130929' of git://github.com/sctscore/official-linux:
    Score: Update the information of Score maintaners
    Score: Modify the Makefile of Score, remove -mlong-calls for compiling
    Score: Implement the function csum_ipv6_magic
    Score: The commit is for compiling successfully

    Linus Torvalds
     
  • Pull ARC Fixes from Vineet Gupta:
    - Handle unaligned access in zero delay loops
    - spinlock livelock fix for SMP systemC model
    - fix 32bit overflow in access_ok
    - better setup of clockevents

    * tag 'arc-fixes-for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
    ARC: Use clockevents_config_and_register over clockevents_register_device
    ARC: Workaround spinlock livelock in SMP SystemC simulation
    ARC: Fix 32-bit wrap around in access_ok()
    ARC: Handle zero-overhead-loop in unaligned access handler

    Linus Torvalds
     

30 Sep, 2013

11 commits

  • The patch fixes the following compiler warning:
    CC arch/avr32/kernel/process.o
    arch/avr32/kernel/process.c: In function 'copy_thread':
    arch/avr32/kernel/process.c:292: warning: assignment makes integer \
    from pointer without a cast

    Signed-off-by: Gabor Juhos
    Acked-by: Hans-Christian Egtvedt

    Gabor Juhos
     
  • Since commit 01426478df3a8791ff5c8b6b82d409e699cfaf38
    (avr32: Use generic idle loop) the kernel throws the
    following warning on avr32:

    WARNING: at 900322e4 [verbose debug info unavailable]
    Modules linked in:
    CPU: 0 PID: 0 Comm: swapper Not tainted 3.12.0-rc2 #117
    task: 901c3ecc ti: 901c0000 task.ti: 901c0000
    PC is at cpu_idle_poll_ctrl+0x1c/0x38
    LR is at comparator_mode+0x3e/0x40
    pc : [] lr : [] Not tainted
    sp : 901c1f74 r12: 00000000 r11: 901c74a0
    r10: 901d2510 r9 : 00000001 r8 : 901db4de
    r7 : 901c74a0 r6 : 00000001 r5 : 00410020 r4 : 901db574
    r3 : 00410024 r2 : 90206fe0 r1 : 00000000 r0 : 007f0000
    Flags: qvnzc
    Mode bits: hjmde....G
    CPU Mode: Supervisor
    Call trace:
    [] clockevents_set_mode+0x16/0x2e
    [] clockevents_shutdown+0xa/0x1e
    [] clockevents_exchange_device+0x58/0x70
    [] tick_check_new_device+0x38/0x54
    [] clockevents_register_device+0x32/0x90
    [] time_init+0xa8/0x108
    [] start_kernel+0x128/0x23c

    When the 'avr32_comparator' clockevent device is registered,
    the clockevent core sets the mode of that clockevent device
    to CLOCK_EVT_MODE_SHUTDOWN. Due to this, the 'comparator_mode'
    function calls the 'cpu_idle_poll_ctrl' to disables idle poll.
    This results in the aforementioned warning because the polling
    is not enabled yet.

    Change the code to only disable idle poll if it is enabled by
    the same function to avoid the warning.

    Cc: stable@vger.kernel.org
    Signed-off-by: Gabor Juhos
    Acked-by: Hans-Christian Egtvedt

    Gabor Juhos
     
  • Use kbuild to add asm-generic headers that do nothing, also remove the arch
    specific wrapper headers.

    This only affects headers that do nothing but include the generic
    equivalent. It does not touch any header that does a little more.

    Signed-off-by: Steven Rostedt
    Signed-off-by: Hans-Christian Egtvedt

    Steven Rostedt
     
  • Linus Torvalds
     
  • Pull USB fixes from Greg KH:
    "Here are a number of USB driver fixes for 3.12-rc3.

    These are all for host controller issues that have been reported, and
    there's a fix for an annoying error message that gets printed every
    time you remove a USB 3 device from the system that's been bugging me
    for a while"

    * tag 'usb-3.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
    usb: dwc3: add support for Merrifield
    USB: fsl/ehci: fix failure of checking PHY_CLK_VALID during reinitialization
    USB: Fix breakage in ffs_fs_mount()
    fsl/usb: Resolve PHY_CLK_VLD instability issue for ULPI phy
    usb/core/devio.c: Don't reject control message to endpoint with wrong direction bit
    usb: chipidea: USB_CHIPIDEA should depend on HAS_DMA
    usb: chipidea: udc: free pending TD at removal procedure
    usb: chipidea: imx: Add usb_phy_shutdown at probe's error path
    usb: chipidea: Fix memleak for ci->hw_bank.regmap when removal
    usb: chipidea: udc: fix the oops after rmmod gadget
    USB: fix PM config symbol in uhci-hcd, ehci-hcd, and xhci-hcd
    USB: OHCI: accept very late isochronous URBs
    USB: UHCI: accept very late isochronous URBs
    USB: iMX21: accept very late isochronous URBs
    usbcore: check usb device's state before sending a Set SEL control transfer
    xhci: Fix race between ep halt and URB cancellation
    usb: Fix xHCI host issues on remote wakeup.
    xhci: Ensure a command structure points to the correct trb on the command ring
    xhci: Fix oops happening after address device timeout

    Linus Torvalds
     
  • Pull tty/serial fixes from Greg KH:
    "Here are some serial at tty driver fixes for 3.12-rc3

    The serial driver fixes some kref leaks, documentation is moved to the
    proper places, and the tty and n_tty fixes resolve some reported
    regressions. There is still one outstanding tty regression fix that
    isn't in here yet, as I want to test it out some more, it will be sent
    for 3.12-rc4 if it checks out"

    * tag 'tty-3.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
    tty: ar933x_uart: move devicetree binding documentation
    tty: Fix SIGTTOU not sent with tcflush()
    n_tty: Fix EOF push index when termios changes
    serial: pch_uart: remove unnecessary tty_port_tty_get
    serial: pch_uart: fix tty-kref leak in dma-rx path
    serial: pch_uart: fix tty-kref leak in rx-error path
    serial: tegra: fix tty-kref leak

    Linus Torvalds
     
  • Pull staging fixes from Greg KH:
    "Here are some staging driver fixes, MAINTAINER updates, and a new
    device id. All of these have been in the linux-next tree, and are
    pretty simple patches"

    * tag 'staging-3.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
    staging: r8188eu: Add new device ID
    staging: imx-drm: Fix probe failure
    staging: vt6656: [BUG] iwctl_siwencodeext return if device not open
    staging: vt6656: [BUG] main_usb.c oops on device_close move flag earlier.
    staging: vt6656: rxtx.c [BUG] s_vGetFreeContext dead lock on null apTD.
    Staging: rtl8192u: r819xU_cmdpkt: checking NULL value after doing dev_alloc_skb
    staging: usbip: Orphan usbip
    staging: r8188eu: Add files for new drive: Cocci spatch "noderef"
    staging: r8188eu: Cocci spatch "noderef"
    staging: octeon-usb: Cocci spatch "noderef"
    staging: r8188eu: Add files for new drive: Cocci spatch "noderef"
    MAINTAINERS: staging: dgnc and dgap drivers: add maintainer
    staging: lustre: Cocci spatch "noderef"

    Linus Torvalds
     
  • Pull driver core / sysfs fixes from Greg KH:
    "Here are 2 fixes for 3.12-rc3. One fixes a sysfs problem with
    mounting caused by 3.12-rc1, and the other is a bug reported by the
    chromeos developers with the driver core.

    Both have been in linux-next for a bit"

    * tag 'driver-core-3.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
    driver core : Fix use after free of dev->parent in device_shutdown
    sysfs: Allow mounting without CONFIG_NET

    Linus Torvalds
     
  • Pull char/misc driver fixes from Greg KH:
    "Here are some HyperV and MEI driver fixes for 3.12-rc3. They resolve
    some issues that people have been reporting for them"

    * tag 'char-misc-3.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
    Drivers: hv: vmbus: Terminate vmbus version negotiation on timeout
    Drivers: hv: util: Correctly support ws2008R2 and earlier
    mei: cancel stall timers in mei_reset
    mei: bus: stop wait for read during cl state transition
    mei: make me client counters less error prone

    Linus Torvalds
     
  • Pull perf revert from Ingo Molnar:
    "This fixes the 'perf top' regression Markus Trippelsdorf reported"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    Revert "perf symbols: Demangle cloned functions"

    Linus Torvalds
     
  • Pull drm fixes from Dave Airlie:
    "Nothing too major, radeon still has some dpm changes for off by
    default.

    Radeon, intel, msm:
    - radeon: a few more dpm fixes (still off by default), uvd fixes
    - i915: runtime warn backtrace and regression fix
    - msm: iommu changes fallout"

    * 'drm-fixes' of git://people.freedesktop.org/~airlied/linux: (27 commits)
    drm/msm: use drm_gem_dumb_destroy helper
    drm/msm: deal with mach/iommu.h removal
    drm/msm: Remove iommu include from mdp4_kms.c
    drm/msm: Odd PTR_ERR usage
    drm/i915: Fix up usage of SHRINK_STOP
    drm/radeon: fix hdmi audio on DCE3.0/3.1 asics
    drm/i915: preserve pipe A quirk in i9xx_set_pipeconf
    drm/i915/tv: clear adjusted_mode.flags
    drm/i915/dp: increase i2c-over-aux retry interval on AUX DEFER
    drm/radeon/cik: fix overflow in vram fetch
    drm/radeon: add missing hdmi callbacks for rv6xx
    drm/i915: Use a temporary va_list for two-pass string handling
    drm/radeon/uvd: lower msg&fb buffer requirements on UVD3
    drm/radeon: disable tests/benchmarks if accel is disabled
    drm/radeon: don't set default clocks for SI when DPM is disabled
    drm/radeon/dpm/ci: filter clocks based on voltage/clk dep tables
    drm/radeon/dpm/si: filter clocks based on voltage/clk dep tables
    drm/radeon/dpm/ni: filter clocks based on voltage/clk dep tables
    drm/radeon/dpm/btc: filter clocks based on voltage/clk dep tables
    drm/radeon/dpm: fetch the max clk from voltage dep tables helper
    ...

    Linus Torvalds
     

29 Sep, 2013

5 commits

  • This reverts commit de95ab53645a2f0015e0f68ee723f18dce2b8b51.

    Markus Trippelsdorf reported that this commit broke 'perf top':

    > I just see a gray screen with no text at all. Sometimes the
    > following error messages are printed:
    >
    > *** Error in `perf': invalid fastbin entry (free): 0x00000000029b18c0
    > ***
    > *** Error in `perf': malloc(): memory corruption (fast): 0x0000000000ee0b10 ***

    While this code is fixable, the commit itself fails on several levels:

    - it should have been a separate helper function
    - why the heck does it do strchr() twice
    - it casts a const char * over into char *
    - sloppy style
    - it's not even a regression fix!

    So lets revert it and re-try the patch in v3.13.

    Reported-by: Markus Trippelsdorf
    Cc: Andi Kleen
    Cc: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • A small fix + deal with fallout of iommu changes + use new
    drm_gem_dumb_destroy helper.

    * 'msm-fixes-3.12-rc2' of git://people.freedesktop.org/~robclark/linux:
    drm/msm: use drm_gem_dumb_destroy helper
    drm/msm: deal with mach/iommu.h removal
    drm/msm: Remove iommu include from mdp4_kms.c
    drm/msm: Odd PTR_ERR usage

    Dave Airlie
     
  • …nt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

    Pull scheduler, timer and x86 fixes from Ingo Molnar:
    - A context tracking ARM build and functional fix
    - A handful of ARM clocksource/clockevent driver fixes
    - An AMD microcode patch level sysfs reporting fixlet

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    arm: Fix build error with context tracking calls

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    clocksource: em_sti: Set cpu_possible_mask to fix SMP broadcast
    clocksource: of: Respect device tree node status
    clocksource: exynos_mct: Set IRQ affinity when the CPU goes online
    arm: clocksource: mvebu: Use the main timer as clock source from DT

    * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/microcode/AMD: Fix patch level reporting for family 15h

    Linus Torvalds
     
  • Pull perf fixes from Ingo Molnar:
    "A couple of tooling fixlets and a PMU detection printout fix"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/x86: Fix PMU detection printout when no PMU is detected
    perf symbols: Demangle cloned functions
    perf machine: Fix path unpopulated in machine__create_modules()
    perf tools: Explicitly add libdl dependency
    perf probe: Fix probing symbols with optimization suffix
    perf trace: Add mmap2 handler
    perf kmem: Make it work again on non NUMA machines

    Linus Torvalds
     
  • Pull xfs bugfixes from Ben Myers:
    - fix for directory node collapse regression
    - fix for recovery over stale on disk structures
    - fix for eofblocks ioctl
    - fix asserts in xfs_inode_free
    - lock the ail before removing an item from it

    * tag 'xfs-for-linus-v3.12-rc3' of git://oss.sgi.com/xfs/xfs:
    xfs: fix node forward in xfs_node_toosmall
    xfs: log recovery lsn ordering needs uuid check
    xfs: fix XFS_IOC_FREE_EOFBLOCKS definition
    xfs: asserting lock not held during freeing not valid
    xfs: lock the AIL before removing the buffer item

    Linus Torvalds