04 Feb, 2020

2 commits

  • 'PTR_ERR(p) == -E*' is a stronger condition than IS_ERR(p).
    Hence, IS_ERR(p) is unneeded.

    The semantic patch that generates this commit is as follows:

    //
    @@
    expression ptr;
    constant error_code;
    @@
    -IS_ERR(ptr) && (PTR_ERR(ptr) == - error_code)
    +PTR_ERR(ptr) == - error_code
    //

    Link: http://lkml.kernel.org/r/20200106045833.1725-1-masahiroy@kernel.org
    Signed-off-by: Masahiro Yamada
    Cc: Julia Lawall
    Acked-by: Stephen Boyd [drivers/clk/clk.c]
    Acked-by: Bartosz Golaszewski [GPIO]
    Acked-by: Wolfram Sang [drivers/i2c]
    Acked-by: Rafael J. Wysocki [acpi/scan.c]
    Acked-by: Rob Herring
    Cc: Eric Biggers
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masahiro Yamada
     
  • Writing a cloned file triggers a kernel oops and the user-space command
    process is also killed by the system. The bug can be reproduced stably
    via:

    1) create a file under ocfs2 file system directory.

    journalctl -b > aa.txt

    2) create a cloned file for this file.

    reflink aa.txt bb.txt

    3) write the cloned file with dd command.

    dd if=/dev/zero of=bb.txt bs=512 count=1 conv=notrunc

    The dd command is killed by the kernel, then you can see the oops message
    via dmesg command.

    [ 463.875404] BUG: kernel NULL pointer dereference, address: 0000000000000028
    [ 463.875413] #PF: supervisor read access in kernel mode
    [ 463.875416] #PF: error_code(0x0000) - not-present page
    [ 463.875418] PGD 0 P4D 0
    [ 463.875425] Oops: 0000 [#1] SMP PTI
    [ 463.875431] CPU: 1 PID: 2291 Comm: dd Tainted: G OE 5.3.16-2-default
    [ 463.875433] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    [ 463.875500] RIP: 0010:ocfs2_refcount_cow+0xa4/0x5d0 [ocfs2]
    [ 463.875505] Code: 06 89 6c 24 38 89 eb f6 44 24 3c 02 74 be 49 8b 47 28
    [ 463.875508] RSP: 0018:ffffa2cb409dfce8 EFLAGS: 00010202
    [ 463.875512] RAX: ffff8b1ebdca8000 RBX: 0000000000000001 RCX: ffff8b1eb73a9df0
    [ 463.875515] RDX: 0000000000056a01 RSI: 0000000000000000 RDI: 0000000000000000
    [ 463.875517] RBP: 0000000000000001 R08: ffff8b1eb73a9de0 R09: 0000000000000000
    [ 463.875520] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
    [ 463.875522] R13: ffff8b1eb922f048 R14: 0000000000000000 R15: ffff8b1eb922f048
    [ 463.875526] FS: 00007f8f44d15540(0000) GS:ffff8b1ebeb00000(0000) knlGS:0000000000000000
    [ 463.875529] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 463.875532] CR2: 0000000000000028 CR3: 000000003c17a000 CR4: 00000000000006e0
    [ 463.875546] Call Trace:
    [ 463.875596] ? ocfs2_inode_lock_full_nested+0x18b/0x960 [ocfs2]
    [ 463.875648] ocfs2_file_write_iter+0xaf8/0xc70 [ocfs2]
    [ 463.875672] new_sync_write+0x12d/0x1d0
    [ 463.875688] vfs_write+0xad/0x1a0
    [ 463.875697] ksys_write+0xa1/0xe0
    [ 463.875710] do_syscall_64+0x60/0x1f0
    [ 463.875743] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 463.875758] RIP: 0033:0x7f8f4482ed44
    [ 463.875762] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 80 00 00 00
    [ 463.875765] RSP: 002b:00007fff300a79d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
    [ 463.875769] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f8f4482ed44
    [ 463.875771] RDX: 0000000000000200 RSI: 000055f771b5c000 RDI: 0000000000000001
    [ 463.875774] RBP: 0000000000000200 R08: 00007f8f44af9c78 R09: 0000000000000003
    [ 463.875776] R10: 000000000000089f R11: 0000000000000246 R12: 000055f771b5c000
    [ 463.875779] R13: 0000000000000200 R14: 0000000000000000 R15: 000055f771b5c000

    This regression problem was introduced by commit e74540b28556 ("ocfs2:
    protect extent tree in ocfs2_prepare_inode_for_write()").

    Link: http://lkml.kernel.org/r/20200121050153.13290-1-ghe@suse.com
    Fixes: e74540b28556 ("ocfs2: protect extent tree in ocfs2_prepare_inode_for_write()").
    Signed-off-by: Gang He
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Changwei Ge
    Cc: Jun Piao
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gang He
     

01 Feb, 2020

7 commits

  • For the uniform format, we use ocfs2_update_inode_fsync_trans() to
    access t_tid in handle->h_transaction

    Link: http://lkml.kernel.org/r/6ff9a312-5f7d-0e27-fb51-bc4e062fcd97@huawei.com
    Signed-off-by: Yan Wang
    Reviewed-by: Jun Piao
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Cc: Changwei Ge
    Cc: Gang He
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    wangyan
     
  • I found a NULL pointer dereference in ocfs2_update_inode_fsync_trans(),
    handle->h_transaction may be NULL in this situation:

    ocfs2_file_write_iter
    ->__generic_file_write_iter
    ->generic_perform_write
    ->ocfs2_write_begin
    ->ocfs2_write_begin_nolock
    ->ocfs2_write_cluster_by_desc
    ->ocfs2_write_cluster
    ->ocfs2_mark_extent_written
    ->ocfs2_change_extent_flag
    ->ocfs2_split_extent
    ->ocfs2_try_to_merge_extent
    ->ocfs2_extend_rotate_transaction
    ->ocfs2_extend_trans
    ->jbd2_journal_restart
    ->jbd2__journal_restart
    // handle->h_transaction is NULL here
    ->handle->h_transaction = NULL;
    ->start_this_handle
    /* journal aborted due to storage
    network disconnection, return error */
    ->return -EROFS;
    /* line 3806 in ocfs2_try_to_merge_extent (),
    it will ignore ret error. */
    ->ret = 0;
    ->...
    ->ocfs2_write_end
    ->ocfs2_write_end_nolock
    ->ocfs2_update_inode_fsync_trans
    // NULL pointer dereference
    ->oi->i_sync_tid = handle->h_transaction->t_tid;

    The information of NULL pointer dereference as follows:
    JBD2: Detected IO errors while flushing file data on dm-11-45
    Aborting journal on device dm-11-45.
    JBD2: Error -5 detected when updating journal superblock for dm-11-45.
    (dd,22081,3):ocfs2_extend_trans:474 ERROR: status = -30
    (dd,22081,3):ocfs2_try_to_merge_extent:3877 ERROR: status = -30
    Unable to handle kernel NULL pointer dereference at
    virtual address 0000000000000008
    Mem abort info:
    ESR = 0x96000004
    Exception class = DABT (current EL), IL = 32 bits
    SET = 0, FnV = 0
    EA = 0, S1PTW = 0
    Data abort info:
    ISV = 0, ISS = 0x00000004
    CM = 0, WnR = 0
    user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000e74e1338
    [0000000000000008] pgd=0000000000000000
    Internal error: Oops: 96000004 [#1] SMP
    Process dd (pid: 22081, stack limit = 0x00000000584f35a9)
    CPU: 3 PID: 22081 Comm: dd Kdump: loaded
    Hardware name: Huawei TaiShan 2280 V2/BC82AMDD, BIOS 0.98 08/25/2019
    pstate: 60400009 (nZCv daif +PAN -UAO)
    pc : ocfs2_write_end_nolock+0x2b8/0x550 [ocfs2]
    lr : ocfs2_write_end_nolock+0x2a0/0x550 [ocfs2]
    sp : ffff0000459fba70
    x29: ffff0000459fba70 x28: 0000000000000000
    x27: ffff807ccf7f1000 x26: 0000000000000001
    x25: ffff807bdff57970 x24: ffff807caf1d4000
    x23: ffff807cc79e9000 x22: 0000000000001000
    x21: 000000006c6cd000 x20: ffff0000091d9000
    x19: ffff807ccb239db0 x18: ffffffffffffffff
    x17: 000000000000000e x16: 0000000000000007
    x15: ffff807c5e15bd78 x14: 0000000000000000
    x13: 0000000000000000 x12: 0000000000000000
    x11: 0000000000000000 x10: 0000000000000001
    x9 : 0000000000000228 x8 : 000000000000000c
    x7 : 0000000000000fff x6 : ffff807a308ed6b0
    x5 : ffff7e01f10967c0 x4 : 0000000000000018
    x3 : d0bc661572445600 x2 : 0000000000000000
    x1 : 000000001b2e0200 x0 : 0000000000000000
    Call trace:
    ocfs2_write_end_nolock+0x2b8/0x550 [ocfs2]
    ocfs2_write_end+0x4c/0x80 [ocfs2]
    generic_perform_write+0x108/0x1a8
    __generic_file_write_iter+0x158/0x1c8
    ocfs2_file_write_iter+0x668/0x950 [ocfs2]
    __vfs_write+0x11c/0x190
    vfs_write+0xac/0x1c0
    ksys_write+0x6c/0xd8
    __arm64_sys_write+0x24/0x30
    el0_svc_common+0x78/0x130
    el0_svc_handler+0x38/0x78
    el0_svc+0x8/0xc

    To prevent NULL pointer dereference in this situation, we use
    is_handle_aborted() before using handle->h_transaction->t_tid.

    Link: http://lkml.kernel.org/r/03e750ab-9ade-83aa-b000-b9e81e34e539@huawei.com
    Signed-off-by: Yan Wang
    Reviewed-by: Jun Piao
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Cc: Changwei Ge
    Cc: Gang He
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    wangyan
     
  • There are users already and will be more of BITS_TO_BYTES() macro. Move
    it to bitops.h for wider use.

    In the case of ocfs2 the replacement is identical.

    As for bnx2x, there are two places where floor version is used. In the
    first case to calculate the amount of structures that can fit one memory
    page. In this case obviously the ceiling variant is correct and
    original code might have a potential bug, if amount of bits % 8 is not
    0. In the second case the macro is used to calculate bytes transmitted
    in one microsecond. This will work for all speeds which is multiply of
    1Gbps without any change, for the rest new code will give ceiling value,
    for instance 100Mbps will give 13 bytes, while old code gives 12 bytes
    and the arithmetically correct one is 12.5 bytes. Further the value is
    used to setup timer threshold which in any case has its own margins due
    to certain resolution. I don't see here an issue with slightly shifting
    thresholds for low speed connections, the card is supposed to utilize
    highest available rate, which is usually 10Gbps.

    Link: http://lkml.kernel.org/r/20200108121316.22411-1-andriy.shevchenko@linux.intel.com
    Signed-off-by: Andy Shevchenko
    Reviewed-by: Joseph Qi
    Acked-by: Sudarsana Reddy Kalluru
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Changwei Ge
    Cc: Gang He
    Cc: Jun Piao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Shevchenko
     
  • The variable ret is being initialized with a value that is never read
    and it is being updated later with a new value. The initialization is
    redundant and can be removed.

    Addresses Coverity ("Unused value")

    Link: http://lkml.kernel.org/r/20191202164833.62865-1-colin.king@canonical.com
    Signed-off-by: Colin Ian King
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Changwei Ge
    Cc: Gang He
    Cc: Jun Piao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Colin Ian King
     
  • Gang He reports the failure of building fs/ocfs2/ as an external module
    of the kernel installed on the system:

    $ cd fs/ocfs2
    $ make -C /lib/modules/`uname -r`/build M=`pwd` modules

    If you want to make it work reliably, I'd recommend to remove ccflags-y
    from the Makefiles, and to make header paths relative to the C files. I
    think this is the correct usage of the #include "..." directive.

    Link: http://lkml.kernel.org/r/20191227022950.14804-1-ghe@suse.com
    Signed-off-by: Masahiro Yamada
    Signed-off-by: Gang He
    Reported-by: Gang He
    Reviewed-by: Gang He
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Cc: Changwei Ge
    Cc: Jun Piao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masahiro Yamada
     
  • Fixes coccicheck warnings:

    fs/ocfs2/cluster/quorum.c:76:2-3: Unneeded semicolon
    fs/ocfs2/dlmglue.c:573:2-3: Unneeded semicolon

    Link: http://lkml.kernel.org/r/6ee3aa16-9078-30b1-df3f-22064950bd98@linux.alibaba.com
    Signed-off-by: zhengbin
    Reported-by: Hulk Robot
    Acked-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Changwei Ge
    Cc: Gang He
    Cc: Jun Piao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhengbin
     
  • In the only caller of dlm_migrate_lockres() - dlm_empty_lockres(),
    target is checked for O2NM_MAX_NODES. Thus, the assertion in
    dlm_migrate_lockres() is unnecessary and can be removed. The patch
    eliminates such a check.

    Link: http://lkml.kernel.org/r/20191218194111.26041-1-pakki001@umn.edu
    Signed-off-by: Aditya Pakki
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Cc: Changwei Ge
    Cc: Gang He
    Cc: Jun Piao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aditya Pakki
     

05 Jan, 2020

2 commits

  • Because ocfs2_get_dlm_debug() function is called once less here, ocfs2
    file system will trigger the system crash, usually after ocfs2 file
    system is unmounted.

    This system crash is caused by a generic memory corruption, these crash
    backtraces are not always the same, for exapmle,

    ocfs2: Unmounting device (253,16) on (node 172167785)
    general protection fault: 0000 [#1] SMP PTI
    CPU: 3 PID: 14107 Comm: fence_legacy Kdump:
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
    RIP: 0010:__kmalloc+0xa5/0x2a0
    Code: 00 00 4d 8b 07 65 4d 8b
    RSP: 0018:ffffaa1fc094bbe8 EFLAGS: 00010286
    RAX: 0000000000000000 RBX: d310a8800d7a3faf RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000dc0 RDI: ffff96e68fc036c0
    RBP: d310a8800d7a3faf R08: ffff96e6ffdb10a0 R09: 00000000752e7079
    R10: 000000000001c513 R11: 0000000004091041 R12: 0000000000000dc0
    R13: 0000000000000039 R14: ffff96e68fc036c0 R15: ffff96e68fc036c0
    FS: 00007f699dfba540(0000) GS:ffff96e6ffd80000(0000) knlGS:00000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000055f3a9d9b768 CR3: 000000002cd1c000 CR4: 00000000000006e0
    Call Trace:
    ext4_htree_store_dirent+0x35/0x100 [ext4]
    htree_dirblock_to_tree+0xea/0x290 [ext4]
    ext4_htree_fill_tree+0x1c1/0x2d0 [ext4]
    ext4_readdir+0x67c/0x9d0 [ext4]
    iterate_dir+0x8d/0x1a0
    __x64_sys_getdents+0xab/0x130
    do_syscall_64+0x60/0x1f0
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x7f699d33a9fb

    This regression problem was introduced by commit e581595ea29c ("ocfs: no
    need to check return value of debugfs_create functions").

    Link: http://lkml.kernel.org/r/20191225061501.13587-1-ghe@suse.com
    Fixes: e581595ea29c ("ocfs: no need to check return value of debugfs_create functions")
    Signed-off-by: Gang He
    Acked-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Changwei Ge
    Cc: Gang He
    Cc: Jun Piao
    Cc: [5.3+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gang He
     
  • If journal is dirty when mount, it will be replayed but jbd2 sb log tail
    cannot be updated to mark a new start because journal->j_flag has
    already been set with JBD2_ABORT first in journal_init_common.

    When a new transaction is committed, it will be recored in block 1
    first(journal->j_tail is set to 1 in journal_reset). If emergency
    restart happens again before journal super block is updated
    unfortunately, the new recorded trans will not be replayed in the next
    mount.

    The following steps describe this procedure in detail.
    1. mount and touch some files
    2. these transactions are committed to journal area but not checkpointed
    3. emergency restart
    4. mount again and its journals are replayed
    5. journal super block's first s_start is 1, but its s_seq is not updated
    6. touch a new file and its trans is committed but not checkpointed
    7. emergency restart again
    8. mount and journal is dirty, but trans committed in 6 will not be
    replayed.

    This exception happens easily when this lun is used by only one node.
    If it is used by multi-nodes, other node will replay its journal and its
    journal super block will be updated after recovery like what this patch
    does.

    ocfs2_recover_node->ocfs2_replay_journal.

    The following jbd2 journal can be generated by touching a new file after
    journal is replayed, and seq 15 is the first valid commit, but first seq
    is 13 in journal super block.

    logdump:
    Block 0: Journal Superblock
    Seq: 0 Type: 4 (JBD2_SUPERBLOCK_V2)
    Blocksize: 4096 Total Blocks: 32768 First Block: 1
    First Commit ID: 13 Start Log Blknum: 1
    Error: 0
    Feature Compat: 0
    Feature Incompat: 2 block64
    Feature RO compat: 0
    Journal UUID: 4ED3822C54294467A4F8E87D2BA4BC36
    FS Share Cnt: 1 Dynamic Superblk Blknum: 0
    Per Txn Block Limit Journal: 0 Data: 0

    Block 1: Journal Commit Block
    Seq: 14 Type: 2 (JBD2_COMMIT_BLOCK)

    Block 2: Journal Descriptor
    Seq: 15 Type: 1 (JBD2_DESCRIPTOR_BLOCK)
    No. Blocknum Flags
    0. 587 none
    UUID: 00000000000000000000000000000000
    1. 8257792 JBD2_FLAG_SAME_UUID
    2. 619 JBD2_FLAG_SAME_UUID
    3. 24772864 JBD2_FLAG_SAME_UUID
    4. 8257802 JBD2_FLAG_SAME_UUID
    5. 513 JBD2_FLAG_SAME_UUID JBD2_FLAG_LAST_TAG
    ...
    Block 7: Inode
    Inode: 8257802 Mode: 0640 Generation: 57157641 (0x3682809)
    FS Generation: 2839773110 (0xa9437fb6)
    CRC32: 00000000 ECC: 0000
    Type: Regular Attr: 0x0 Flags: Valid
    Dynamic Features: (0x1) InlineData
    User: 0 (root) Group: 0 (root) Size: 7
    Links: 1 Clusters: 0
    ctime: 0x5de5d870 0x11104c61 -- Tue Dec 3 11:37:20.286280801 2019
    atime: 0x5de5d870 0x113181a1 -- Tue Dec 3 11:37:20.288457121 2019
    mtime: 0x5de5d870 0x11104c61 -- Tue Dec 3 11:37:20.286280801 2019
    dtime: 0x0 -- Thu Jan 1 08:00:00 1970
    ...
    Block 9: Journal Commit Block
    Seq: 15 Type: 2 (JBD2_COMMIT_BLOCK)

    The following is journal recovery log when recovering the upper jbd2
    journal when mount again.

    syslog:
    ocfs2: File system on device (252,1) was not unmounted cleanly, recovering it.
    fs/jbd2/recovery.c:(do_one_pass, 449): Starting recovery pass 0
    fs/jbd2/recovery.c:(do_one_pass, 449): Starting recovery pass 1
    fs/jbd2/recovery.c:(do_one_pass, 449): Starting recovery pass 2
    fs/jbd2/recovery.c:(jbd2_journal_recover, 278): JBD2: recovery, exit status 0, recovered transactions 13 to 13

    Due to first commit seq 13 recorded in journal super is not consistent
    with the value recorded in block 1(seq is 14), journal recovery will be
    terminated before seq 15 even though it is an unbroken commit, inode
    8257802 is a new file and it will be lost.

    Link: http://lkml.kernel.org/r/20191217020140.2197-1-li.kai4@h3c.com
    Signed-off-by: Kai Li
    Reviewed-by: Joseph Qi
    Reviewed-by: Changwei Ge
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Gang He
    Cc: Jun Piao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kai Li
     

02 Dec, 2019

2 commits

  • Merge updates from Andrew Morton:
    "Incoming:

    - a small number of updates to scripts/, ocfs2 and fs/buffer.c

    - most of MM

    I still have quite a lot of material (mostly not MM) staged after
    linux-next due to -next dependencies. I'll send those across next week
    as the preprequisites get merged up"

    * emailed patches from Andrew Morton : (135 commits)
    mm/page_io.c: annotate refault stalls from swap_readpage
    mm/Kconfig: fix trivial help text punctuation
    mm/Kconfig: fix indentation
    mm/memory_hotplug.c: remove __online_page_set_limits()
    mm: fix typos in comments when calling __SetPageUptodate()
    mm: fix struct member name in function comments
    mm/shmem.c: cast the type of unmap_start to u64
    mm: shmem: use proper gfp flags for shmem_writepage()
    mm/shmem.c: make array 'values' static const, makes object smaller
    userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK
    fs/userfaultfd.c: wp: clear VM_UFFD_MISSING or VM_UFFD_WP during userfaultfd_register()
    userfaultfd: wrap the common dst_vma check into an inlined function
    userfaultfd: remove unnecessary WARN_ON() in __mcopy_atomic_hugetlb()
    userfaultfd: use vma_pagesize for all huge page size calculation
    mm/madvise.c: use PAGE_ALIGN[ED] for range checking
    mm/madvise.c: replace with page_size() in madvise_inject_error()
    mm/mmap.c: make vma_merge() comment more easy to understand
    mm/hwpoison-inject: use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs fops
    autonuma: reduce cache footprint when scanning page tables
    autonuma: fix watermark checking in migrate_balanced_pgdat()
    ...

    Linus Torvalds
     
  • Pull removal of most of fs/compat_ioctl.c from Arnd Bergmann:
    "As part of the cleanup of some remaining y2038 issues, I came to
    fs/compat_ioctl.c, which still has a couple of commands that need
    support for time64_t.

    In completely unrelated work, I spent time on cleaning up parts of
    this file in the past, moving things out into drivers instead.

    After Al Viro reviewed an earlier version of this series and did a lot
    more of that cleanup, I decided to try to completely eliminate the
    rest of it and move it all into drivers.

    This series incorporates some of Al's work and many patches of my own,
    but in the end stops short of actually removing the last part, which
    is the scsi ioctl handlers. I have patches for those as well, but they
    need more testing or possibly a rewrite"

    * tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (42 commits)
    scsi: sd: enable compat ioctls for sed-opal
    pktcdvd: add compat_ioctl handler
    compat_ioctl: move SG_GET_REQUEST_TABLE handling
    compat_ioctl: ppp: move simple commands into ppp_generic.c
    compat_ioctl: handle PPPIOCGIDLE for 64-bit time_t
    compat_ioctl: move PPPIOCSCOMPRESS to ppp_generic
    compat_ioctl: unify copy-in of ppp filters
    tty: handle compat PPP ioctls
    compat_ioctl: move SIOCOUTQ out of compat_ioctl.c
    compat_ioctl: handle SIOCOUTQNSD
    af_unix: add compat_ioctl support
    compat_ioctl: reimplement SG_IO handling
    compat_ioctl: move WDIOC handling into wdt drivers
    fs: compat_ioctl: move FITRIM emulation into file systems
    gfs2: add compat_ioctl support
    compat_ioctl: remove unused convert_in_user macro
    compat_ioctl: remove last RAID handling code
    compat_ioctl: remove /dev/raw ioctl translation
    compat_ioctl: remove PCI ioctl translation
    compat_ioctl: remove joystick ioctl translation
    ...

    Linus Torvalds
     

01 Dec, 2019

4 commits

  • Fix a static code checker warning:
    fs/ocfs2/acl.c:331
    ocfs2_acl_chmod() warn: passing zero to 'PTR_ERR'

    Link: http://lkml.kernel.org/r/1dee278b-6c96-eec2-ce76-fe6e07c6e20f@linux.alibaba.com
    Fixes: 5ee0fbd50fd ("ocfs2: revert using ocfs2_acl_chmod to avoid inode cluster lock hang")
    Signed-off-by: Ding Xiang
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Changwei Ge
    Cc: Gang He
    Cc: Jun Piao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ding Xiang
     
  • …ux/kernel/git/dhowells/linux-fs

    Pull pipe rework from David Howells:
    "This is my set of preparatory patches for building a general
    notification queue on top of pipes. It makes a number of significant
    changes:

    - It removes the nr_exclusive argument from __wake_up_sync_key() as
    this is always 1. This prepares for the next step:

    - Adds wake_up_interruptible_sync_poll_locked() so that poll can be
    woken up from a function that's holding the poll waitqueue
    spinlock.

    - Change the pipe buffer ring to be managed in terms of unbounded
    head and tail indices rather than bounded index and length. This
    means that reading the pipe only needs to modify one index, not
    two.

    - A selection of helper functions are provided to query the state of
    the pipe buffer, plus a couple to apply updates to the pipe
    indices.

    - The pipe ring is allowed to have kernel-reserved slots. This allows
    many notification messages to be spliced in by the kernel without
    allowing userspace to pin too many pages if it writes to the same
    pipe.

    - Advance the head and tail indices inside the pipe waitqueue lock
    and use wake_up_interruptible_sync_poll_locked() to poke poll
    without having to take the lock twice.

    - Rearrange pipe_write() to preallocate the buffer it is going to
    write into and then drop the spinlock. This allows kernel
    notifications to then be added the ring whilst it is filling the
    buffer it allocated. The read side is stalled because the pipe
    mutex is still held.

    - Don't wake up readers on a pipe if there was already data in it
    when we added more.

    - Don't wake up writers on a pipe if the ring wasn't full before we
    removed a buffer"

    * tag 'notifications-pipe-prep-20191115' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
    pipe: Remove sync on wake_ups
    pipe: Increase the writer-wakeup threshold to reduce context-switch count
    pipe: Check for ring full inside of the spinlock in pipe_write()
    pipe: Remove redundant wakeup from pipe_write()
    pipe: Rearrange sequence in pipe_write() to preallocate slot
    pipe: Conditionalise wakeup in pipe_read()
    pipe: Advance tail pointer inside of wait spinlock in pipe_read()
    pipe: Allow pipes to have kernel-reserved slots
    pipe: Use head and tail pointers for the ring, not cursor and length
    Add wake_up_interruptible_sync_poll_locked()
    Remove the nr_exclusive argument from __wake_up_sync_key()
    pipe: Reduce #inclusion of pipe_fs_i.h

    Linus Torvalds
     
  • Pull ext2, quota, reiserfs cleanups and fixes from Jan Kara:

    - Refactor the quota on/off kernel internal interfaces (mostly for
    ubifs quota support as ubifs does not want to have inodes holding
    quota information)

    - A few other small quota fixes and cleanups

    - Various small ext2 fixes and cleanups

    - Reiserfs xattr fix and one cleanup

    * tag 'for_v5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (28 commits)
    ext2: code cleanup for descriptor_loc()
    fs/quota: handle overflows of sysctl fs.quota.* and report as unsigned long
    ext2: fix improper function comment
    ext2: code cleanup for ext2_try_to_allocate()
    ext2: skip unnecessary operations in ext2_try_to_allocate()
    ext2: Simplify initialization in ext2_try_to_allocate()
    ext2: code cleanup by calling ext2_group_last_block_no()
    ext2: introduce new helper ext2_group_last_block_no()
    reiserfs: replace open-coded atomic_dec_and_mutex_lock()
    ext2: check err when partial != NULL
    quota: Handle quotas without quota inodes in dquot_get_state()
    quota: Make dquot_disable() work without quota inodes
    quota: Drop dquot_enable()
    fs: Use dquot_load_quota_inode() from filesystems
    quota: Rename vfs_load_quota_inode() to dquot_load_quota_inode()
    quota: Simplify dquot_resume()
    quota: Factor out setup of quota inode
    quota: Check that quota is not dirty before release
    quota: fix livelock in dquot_writeback_dquots
    ext2: don't set *count in the case of failure in ext2_try_to_allocate()
    ...

    Linus Torvalds
     
  • Pull ext4 updates from Ted Ts'o:
    "This merge window saw the the following new featuers added to ext4:

    - Direct I/O via iomap (required the iomap-for-next branch from
    Darrick as a prereq).

    - Support for using dioread-nolock where the block size < page size.

    - Support for encryption for file systems where the block size < page
    size.

    - Rework of journal credits handling so a revoke-heavy workload will
    not cause the journal to run out of space.

    - Replace bit-spinlocks with spinlocks in jbd2

    Also included were some bug fixes and cleanups, mostly to clean up
    corner cases from fuzzed file systems and error path handling"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (59 commits)
    ext4: work around deleting a file with i_nlink == 0 safely
    ext4: add more paranoia checking in ext4_expand_extra_isize handling
    jbd2: make jbd2_handle_buffer_credits() handle reserved handles
    ext4: fix a bug in ext4_wait_for_tail_page_commit
    ext4: bio_alloc with __GFP_DIRECT_RECLAIM never fails
    ext4: code cleanup for get_next_id
    ext4: fix leak of quota reservations
    ext4: remove unused variable warning in parse_options()
    ext4: Enable encryption for subpage-sized blocks
    fs/buffer.c: support fscrypt in block_read_full_page()
    ext4: Add error handling for io_end_vec struct allocation
    jbd2: Fine tune estimate of necessary descriptor blocks
    jbd2: Provide trace event for handle restarts
    ext4: Reserve revoke credits for freed blocks
    jbd2: Make credit checking more strict
    jbd2: Rename h_buffer_credits to h_total_credits
    jbd2: Reserve space for revoke descriptor blocks
    jbd2: Drop jbd2_space_needed()
    jbd2: Account descriptor blocks into t_outstanding_credits
    jbd2: Factor out common parts of stopping and restarting a handle
    ...

    Linus Torvalds
     

27 Nov, 2019

1 commit

  • Pull locking updates from Ingo Molnar:
    "The main changes in this cycle were:

    - A comprehensive rewrite of the robust/PI futex code's exit handling
    to fix various exit races. (Thomas Gleixner et al)

    - Rework the generic REFCOUNT_FULL implementation using
    atomic_fetch_* operations so that the performance impact of the
    cmpxchg() loops is mitigated for common refcount operations.

    With these performance improvements the generic implementation of
    refcount_t should be good enough for everybody - and this got
    confirmed by performance testing, so remove ARCH_HAS_REFCOUNT and
    REFCOUNT_FULL entirely, leaving the generic implementation enabled
    unconditionally. (Will Deacon)

    - Other misc changes, fixes, cleanups"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
    lkdtm: Remove references to CONFIG_REFCOUNT_FULL
    locking/refcount: Remove unused 'refcount_error_report()' function
    locking/refcount: Consolidate implementations of refcount_t
    locking/refcount: Consolidate REFCOUNT_{MAX,SATURATED} definitions
    locking/refcount: Move saturation warnings out of line
    locking/refcount: Improve performance of generic REFCOUNT_FULL code
    locking/refcount: Move the bulk of the REFCOUNT_FULL implementation into the header
    locking/refcount: Remove unused refcount_*_checked() variants
    locking/refcount: Ensure integer operands are treated as signed
    locking/refcount: Define constants for saturation and max refcount values
    futex: Prevent exit livelock
    futex: Provide distinct return value when owner is exiting
    futex: Add mutex around futex exit
    futex: Provide state handling for exec() as well
    futex: Sanitize exit state handling
    futex: Mark the begin of futex exit explicitly
    futex: Set task::futex_state to DEAD right after handling futex exit
    futex: Split futex_mm_release() for exit/exec
    exit/exec: Seperate mm_release()
    futex: Replace PF_EXITPIDONE with a state
    ...

    Linus Torvalds
     

23 Nov, 2019

1 commit

  • This reverts commit 56e94ea132bb5c2c1d0b60a6aeb34dcb7d71a53d.

    Commit 56e94ea132bb ("fs: ocfs2: fix possible null-pointer dereferences
    in ocfs2_xa_prepare_entry()") introduces a regression that fail to
    create directory with mount option user_xattr and acl. Actually the
    reported NULL pointer dereference case can be correctly handled by
    loc->xl_ops->xlo_add_entry(), so revert it.

    Link: http://lkml.kernel.org/r/1573624916-83825-1-git-send-email-joseph.qi@linux.alibaba.com
    Fixes: 56e94ea132bb ("fs: ocfs2: fix possible null-pointer dereferences in ocfs2_xa_prepare_entry()")
    Signed-off-by: Joseph Qi
    Reported-by: Thomas Voegtle
    Acked-by: Changwei Ge
    Cc: Jia-Ju Bai
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Gang He
    Cc: Jun Piao
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     

07 Nov, 2019

1 commit

  • When the extent tree is modified, it should be protected by inode
    cluster lock and ip_alloc_sem.

    The extent tree is accessed and modified in the
    ocfs2_prepare_inode_for_write, but isn't protected by ip_alloc_sem.

    The following is a case. The function ocfs2_fiemap is accessing the
    extent tree, which is modified at the same time.

    kernel BUG at fs/ocfs2/extent_map.c:475!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: tun ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue [...]
    CPU: 16 PID: 14047 Comm: o2info Not tainted 4.1.12-124.23.1.el6uek.x86_64 #2
    Hardware name: Oracle Corporation ORACLE SERVER X7-2L/ASM, MB MECH, X7-2L, BIOS 42040600 10/19/2018
    task: ffff88019487e200 ti: ffff88003daa4000 task.ti: ffff88003daa4000
    RIP: ocfs2_get_clusters_nocache.isra.11+0x390/0x550 [ocfs2]
    Call Trace:
    ocfs2_fiemap+0x1e3/0x430 [ocfs2]
    do_vfs_ioctl+0x155/0x510
    SyS_ioctl+0x81/0xa0
    system_call_fastpath+0x18/0xd8
    Code: 18 48 c7 c6 60 7f 65 a0 31 c0 bb e2 ff ff ff 48 8b 4a 40 48 8b 7a 28 48 c7 c2 78 2d 66 a0 e8 38 4f 05 00 e9 28 fe ff ff 0f 1f 00 0b 66 0f 1f 44 00 00 bb 86 ff ff ff e9 13 fe ff ff 66 0f 1f
    RIP ocfs2_get_clusters_nocache.isra.11+0x390/0x550 [ocfs2]
    ---[ end trace c8aa0c8180e869dc ]---
    Kernel panic - not syncing: Fatal exception
    Kernel Offset: disabled

    This issue can be reproduced every week in a production environment.

    This issue is related to the usage mode. If others use ocfs2 in this
    mode, the kernel will panic frequently.

    [akpm@linux-foundation.org: coding style fixes]
    [Fix new warning due to unused function by removing said function - Linus ]
    Link: http://lkml.kernel.org/r/1568772175-2906-2-git-send-email-sunny.s.zhang@oracle.com
    Signed-off-by: Shuning Zhang
    Reviewed-by: Junxiao Bi
    Reviewed-by: Gang He
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Joseph Qi
    Cc: Changwei Ge
    Cc: Jun Piao
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shuning Zhang
     

06 Nov, 2019

4 commits

  • Jan Kara
     
  • Theodore Ts'o
     
  • Extend functions for starting, extending, and restarting transaction
    handles to take number of revoke records handle must be able to
    accommodate. These functions then make sure transaction has enough
    credits to be able to store resulting revoke descriptor blocks. Also
    revoke code tracks number of revoke records created by a handle to catch
    situation where some place didn't reserve enough space for revoke
    records. Similarly to standard transaction credits, space for unused
    reserved revoke records is released when the handle is stopped.

    On the ext4 side we currently take a simplistic approach of reserving
    space for 1024 revoke records for any transaction. This grows amount of
    credits reserved for each handle only by a few and is enough for any
    normal workload so that we don't hit warnings in jbd2. We will refine
    the logic in following commits.

    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-20-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Use the jbd2 accessor function for h_buffer_credits.

    Reviewed-by: Theodore Ts'o
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-12-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

04 Nov, 2019

1 commit


01 Nov, 2019

1 commit

  • There is a race window where quota was redirted once we drop dq_list_lock inside dqput(),
    but before we grab dquot->dq_lock inside dquot_release()

    TASK1 TASK2 (chowner)
    ->dqput()
    we_slept:
    spin_lock(&dq_list_lock)
    if (dquot_dirty(dquot)) {
    spin_unlock(&dq_list_lock);
    dquot->dq_sb->dq_op->write_dquot(dquot);
    goto we_slept
    if (test_bit(DQ_ACTIVE_B, &dquot->dq_flags)) {
    spin_unlock(&dq_list_lock);
    dquot->dq_sb->dq_op->release_dquot(dquot);
    dqget()
    mark_dquot_dirty()
    dqput()
    goto we_slept;
    }
    So dquot dirty quota will be released by TASK1, but on next we_sleept loop
    we detect this and call ->write_dquot() for it.
    XFSTEST: https://github.com/dmonakhov/xfstests/commit/440a80d4cbb39e9234df4d7240aee1d551c36107

    Link: https://lore.kernel.org/r/20191031103920.3919-2-dmonakhov@openvz.org
    CC: stable@vger.kernel.org
    Signed-off-by: Dmitry Monakhov
    Signed-off-by: Jan Kara

    Dmitry Monakhov
     

24 Oct, 2019

1 commit


23 Oct, 2019

1 commit


21 Oct, 2019

1 commit

  • Bit-spinlocks are problematic on PREEMPT_RT if functions which might sleep
    on RT, e.g. spin_lock(), alloc/free(), are invoked inside the lock held
    region because bit spinlocks disable preemption even on RT.

    A first attempt was to replace state lock with a spinlock placed in struct
    buffer_head and make the locking conditional on PREEMPT_RT and
    DEBUG_BIT_SPINLOCKS.

    Jan pointed out that there is a 4 byte hole in struct journal_head where a
    regular spinlock fits in and he would not object to convert the state lock
    to a spinlock unconditionally.

    Aside of solving the RT problem, this also gains lockdep coverage for the
    journal head state lock (bit-spinlocks are not covered by lockdep as it's
    hard to fit a lockdep map into a single bit).

    The trivial change would have been to convert the jbd_*lock_bh_state()
    inlines, but that comes with the downside that these functions take a
    buffer head pointer which needs to be converted to a journal head pointer
    which adds another level of indirection.

    As almost all functions which use this lock have a journal head pointer
    readily available, it makes more sense to remove the lock helper inlines
    and write out spin_*lock() at all call sites.

    Fixup all locking comments as well.

    Suggested-by: Jan Kara
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Jan Kara
    Cc: "Theodore Ts'o"
    Cc: Mark Fasheh
    Cc: Joseph Qi
    Cc: Joel Becker
    Cc: Jan Kara
    Cc: linux-ext4@vger.kernel.org
    Link: https://lore.kernel.org/r/20190809124233.13277-7-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Thomas Gleixner
     

19 Oct, 2019

2 commits

  • mount.ocfs2 failed when reading ocfs2 filesystem superblock encounters
    an error. ocfs2_initialize_super() returns before allocating ocfs2_wq.
    ocfs2_dismount_volume() triggers the following panic.

    Oct 15 16:09:27 cnwarekv-205120 kernel: On-disk corruption discovered.Please run fsck.ocfs2 once the filesystem is unmounted.
    Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_read_locked_inode:537 ERROR: status = -30
    Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_init_global_system_inodes:458 ERROR: status = -30
    Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_init_global_system_inodes:491 ERROR: status = -30
    Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_initialize_super:2313 ERROR: status = -30
    Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_fill_super:1033 ERROR: status = -30
    ------------[ cut here ]------------
    Oops: 0002 [#1] SMP NOPTI
    CPU: 1 PID: 11753 Comm: mount.ocfs2 Tainted: G E
    4.14.148-200.ckv.x86_64 #1
    Hardware name: Sugon H320-G30/35N16-US, BIOS 0SSDX017 12/21/2018
    task: ffff967af0520000 task.stack: ffffa5f05484000
    RIP: 0010:mutex_lock+0x19/0x20
    Call Trace:
    flush_workqueue+0x81/0x460
    ocfs2_shutdown_local_alloc+0x47/0x440 [ocfs2]
    ocfs2_dismount_volume+0x84/0x400 [ocfs2]
    ocfs2_fill_super+0xa4/0x1270 [ocfs2]
    ? ocfs2_initialize_super.isa.211+0xf20/0xf20 [ocfs2]
    mount_bdev+0x17f/0x1c0
    mount_fs+0x3a/0x160

    Link: http://lkml.kernel.org/r/1571139611-24107-1-git-send-email-yili@winhong.com
    Signed-off-by: Yi Li
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Changwei Ge
    Cc: Gang He
    Cc: Jun Piao
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yi Li
     
  • Should set transfer_to[USRQUOTA/GRPQUOTA] to NULL on error case before
    jumping to do dqput().

    Link: http://lkml.kernel.org/r/20191010082349.1134-1-cgxu519@mykernel.net
    Signed-off-by: Chengguang Xu
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Changwei Ge
    Cc: Gang He
    Cc: Jun Piao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chengguang Xu
     

09 Oct, 2019

1 commit

  • Since the following commit:

    b4adfe8e05f1 ("locking/lockdep: Remove unused argument in __lock_release")

    @nested is no longer used in lock_release(), so remove it from all
    lock_release() calls and friends.

    Signed-off-by: Qian Cai
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Will Deacon
    Acked-by: Daniel Vetter
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: airlied@linux.ie
    Cc: akpm@linux-foundation.org
    Cc: alexander.levin@microsoft.com
    Cc: daniel@iogearbox.net
    Cc: davem@davemloft.net
    Cc: dri-devel@lists.freedesktop.org
    Cc: duyuyang@gmail.com
    Cc: gregkh@linuxfoundation.org
    Cc: hannes@cmpxchg.org
    Cc: intel-gfx@lists.freedesktop.org
    Cc: jack@suse.com
    Cc: jlbec@evilplan.or
    Cc: joonas.lahtinen@linux.intel.com
    Cc: joseph.qi@linux.alibaba.com
    Cc: jslaby@suse.com
    Cc: juri.lelli@redhat.com
    Cc: maarten.lankhorst@linux.intel.com
    Cc: mark@fasheh.com
    Cc: mhocko@kernel.org
    Cc: mripard@kernel.org
    Cc: ocfs2-devel@oss.oracle.com
    Cc: rodrigo.vivi@intel.com
    Cc: sean@poorly.run
    Cc: st@kernel.org
    Cc: tj@kernel.org
    Cc: tytso@mit.edu
    Cc: vdavydov.dev@gmail.com
    Cc: vincent.guittot@linaro.org
    Cc: viro@zeniv.linux.org.uk
    Link: https://lkml.kernel.org/r/1568909380-32199-1-git-send-email-cai@lca.pw
    Signed-off-by: Ingo Molnar

    Qian Cai
     

08 Oct, 2019

4 commits

  • In ocfs2_info_scan_inode_alloc(), there is an if statement on line 283
    to check whether inode_alloc is NULL:

    if (inode_alloc)

    When inode_alloc is NULL, it is used on line 287:

    ocfs2_inode_lock(inode_alloc, &bh, 0);
    ocfs2_inode_lock_full_nested(inode, ...)
    struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);

    Thus, a possible null-pointer dereference may occur.

    To fix this bug, inode_alloc is checked on line 286.

    This bug is found by a static analysis tool STCheck written by us.

    Link: http://lkml.kernel.org/r/20190726033717.32359-1-baijiaju1990@gmail.com
    Signed-off-by: Jia-Ju Bai
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Changwei Ge
    Cc: Gang He
    Cc: Jun Piao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jia-Ju Bai
     
  • In ocfs2_write_end_nolock(), there are an if statement on lines 1976,
    2047 and 2058, to check whether handle is NULL:

    if (handle)

    When handle is NULL, it is used on line 2045:

    ocfs2_update_inode_fsync_trans(handle, inode, 1);
    oi->i_sync_tid = handle->h_transaction->t_tid;

    Thus, a possible null-pointer dereference may occur.

    To fix this bug, handle is checked before calling
    ocfs2_update_inode_fsync_trans().

    This bug is found by a static analysis tool STCheck written by us.

    Link: http://lkml.kernel.org/r/20190726033705.32307-1-baijiaju1990@gmail.com
    Signed-off-by: Jia-Ju Bai
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Changwei Ge
    Cc: Gang He
    Cc: Jun Piao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jia-Ju Bai
     
  • In ocfs2_xa_prepare_entry(), there is an if statement on line 2136 to
    check whether loc->xl_entry is NULL:

    if (loc->xl_entry)

    When loc->xl_entry is NULL, it is used on line 2158:

    ocfs2_xa_add_entry(loc, name_hash);
    loc->xl_entry->xe_name_hash = cpu_to_le32(name_hash);
    loc->xl_entry->xe_name_offset = cpu_to_le16(loc->xl_size);

    and line 2164:

    ocfs2_xa_add_namevalue(loc, xi);
    loc->xl_entry->xe_value_size = cpu_to_le64(xi->xi_value_len);
    loc->xl_entry->xe_name_len = xi->xi_name_len;

    Thus, possible null-pointer dereferences may occur.

    To fix these bugs, if loc-xl_entry is NULL, ocfs2_xa_prepare_entry()
    abnormally returns with -EINVAL.

    These bugs are found by a static analysis tool STCheck written by us.

    [akpm@linux-foundation.org: remove now-unused ocfs2_xa_add_entry()]
    Link: http://lkml.kernel.org/r/20190726101447.9153-1-baijiaju1990@gmail.com
    Signed-off-by: Jia-Ju Bai
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Changwei Ge
    Cc: Gang He
    Cc: Jun Piao
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jia-Ju Bai
     
  • Unused portion of a part-written fs-block-sized block is not set to zero
    in unaligned append direct write.This can lead to serious data
    inconsistencies.

    Ocfs2 manage disk with cluster size(for example, 1M), part-written in
    one cluster will change the cluster state from UN-WRITTEN to WRITTEN,
    VFS(function dio_zero_block) doesn't do the cleaning because bh's state
    is not set to NEW in function ocfs2_dio_wr_get_block when we write a
    WRITTEN cluster. For example, the cluster size is 1M, file size is 8k
    and we direct write from 14k to 15k, then 12k~14k and 15k~16k will
    contain dirty data.

    We have to deal with two cases:
    1.The starting position of direct write is outside the file.
    2.The starting position of direct write is located in the file.

    We need set bh's state to NEW in the first case. In the second case, we
    need mapped twice because bh's state of area out file should be set to
    NEW while area in file not.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/5292e287-8f1a-fd4a-1a14-661e555e0bed@huawei.com
    Signed-off-by: Jia Guo
    Reviewed-by: Yiwen Jiang
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jia Guo
     

25 Sep, 2019

4 commits

  • There is a spelling mistake in a mlog_bug_on_msg message. Fix it.

    Link: http://lkml.kernel.org/r/831bdff4-064e-038b-f45d-c4d265cbff1e@linux.alibaba.com
    Signed-off-by: Colin Ian King
    Acked-by: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Colin Ian King
     
  • Appending truncate log(TA) and and flushing truncate log(TF) are two
    separated transactions. They can be both committed but not checkpointed.
    If crash occurs then, both transaction will be replayed with several
    already released to global bitmap clusters. Then truncate log will be
    replayed resulting in cluster double free.

    To reproduce this issue, just crash the host while punching hole to files.

    Signed-off-by: Changwei Ge
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Changwei Ge
     
  • There is a scenario causing ocfs2 umount hang when multiple hosts are
    rebooting at the same time.

    NODE1 NODE2 NODE3
    send unlock requset to NODE2
    dies
    become recovery master
    recover NODE2
    find NODE2 dead
    mark resource RECOVERING
    directly remove lock from grant list
    calculate usage but RECOVERING marked
    **miss the window of purging
    clear RECOVERING

    To reproduce this issue, crash a host and then umount ocfs2
    from another node.

    To solve this, just let unlock progress wait for recovery done.

    Link: http://lkml.kernel.org/r/1550124866-20367-1-git-send-email-gechangwei@live.cn
    Signed-off-by: Changwei Ge
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Changwei Ge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Changwei Ge
     
  • brelse() tests whether its argument is NULL and then returns immediately.
    Thus the tests around the shown calls are not needed.

    This issue was detected by using the Coccinelle software.

    Link: http://lkml.kernel.org/r/55cde320-394b-f985-56ce-1a2abea782aa@web.de
    Signed-off-by: Markus Elfring
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Changwei Ge
    Cc: Gang He
    Cc: Jun Piao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Markus Elfring