05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

26 Mar, 2016

3 commits

  • This patch fixes a deadlock, as follows:

    Node 1 Node 2 Node 3
    1)volume a and b are only mount vol a only mount vol b
    mounted

    2) start to mount b start to mount a

    3) check hb of Node 3 check hb of Node 2
    in vol a, qs_holds++ in vol b, qs_holds++

    4) -------------------- all nodes' network down --------------------

    5) progress of mount b the same situation as
    failed, and then call Node 2
    ocfs2_dismount_volume.
    but the process is hung,
    since there is a work
    in ocfs2_wq cannot beo
    completed. This work is
    about vol a, because
    ocfs2_wq is global wq.
    BTW, this work which is
    scheduled in ocfs2_wq is
    ocfs2_orphan_scan_work,
    and the context in this work
    needs to take inode lock
    of orphan_dir, because
    lockres owner are Node 1 and
    all nodes' nework has been down
    at the same time, so it can't
    get the inode lock.

    6) Why can't this node be fenced
    when network disconnected?
    Because the process of
    mount is hung what caused qs_holds
    is not equal 0.

    Because all works in the ocfs2_wq are relative to the super block.

    The solution is to change the ocfs2_wq from global to local. In other
    words, move it into struct ocfs2_super.

    Signed-off-by: Yiwen Jiang
    Reviewed-by: Joseph Qi
    Cc: Xue jiufei
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    jiangyiwen
     
  • In the current implementation of unaligned aio+dio, lock order behave as
    follow:

    in user process context:
    -> call io_submit()
    -> get i_mutex
    get ip_unaligned_aio
    -> submit direct io to block device
    -> release i_mutex
    -> io_submit() return

    in dio work queue context(the work queue is created in __blockdev_direct_IO):
    -> release ip_unaligned_aio
    get i_mutex
    -> clear unwritten flag & change i_size
    -> release i_mutex

    There is a limitation to the thread number of dio work queue. 256 at
    default. If all 256 thread are in the above 'window2' stage, and there
    is a user process in the 'window1' stage, the system will became
    deadlock. Since the user process hold i_mutex to wait ip_unaligned_aio
    lock, while there is a direct bio hold ip_unaligned_aio mutex who is
    waiting for a dio work queue thread to be schedule. But all the dio
    work queue thread is waiting for i_mutex lock in 'window2'.

    This case only happened in a test which send a large number(more than
    256) of aio at one io_submit() call.

    My design is to remove ip_unaligned_aio lock. Change it to a sync io
    instead. Just like ip_unaligned_aio lock, serialize the unaligned aio
    dio.

    [akpm@linux-foundation.org: remove OCFS2_IOCB_UNALIGNED_IO, per Junxiao Bi]
    Signed-off-by: Ryan Ding
    Reviewed-by: Junxiao Bi
    Cc: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ryan Ding
     
  • To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.

    There is still one issue in the direct write procedure.

    phase 1: alloc extent with UNWRITTEN flag
    phase 2: submit direct data to disk, add zero page to page cache
    phase 3: clear UNWRITTEN flag when data has been written to disk

    When there are 2 direct write A(0~3KB),B(4~7KB) writing to the same
    cluster 0~7KB (cluster size 8KB). Write request A arrive phase 2 first,
    it will zero the region (4~7KB). Before request A enter to phase 3,
    request B arrive phase 2, it will zero region (0~3KB). This is just like
    request B steps request A.

    To resolve this issue, we should let request B knows this cluster is already
    under zero, to prevent it from steps the previous write request.

    This patch will add function ocfs2_unwritten_check() to do this job. It
    will record all clusters that are under direct write(it will be recorded
    in the 'ip_unwritten_list' member of inode info), and prevent the later
    direct write writing to the same cluster to do the zero work again.

    Signed-off-by: Ryan Ding
    Reviewed-by: Junxiao Bi
    Cc: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ryan Ding
     

23 Mar, 2016

1 commit


16 Mar, 2016

1 commit

  • Commit a75e9ccabd92 ("ocfs2: use spinlock irqsave for downconvert lock")
    missed an unmodified place in ocfs2_osb_dump(), so it still exists a
    deadlock scenario.

    ocfs2_wake_downconvert_thread
    ocfs2_rw_unlock
    ocfs2_dio_end_io
    dio_complete
    .....
    bio_endio
    req_bio_endio
    ....
    scsi_io_completion
    blk_done_softirq
    __do_softirq
    do_softirq
    irq_exit
    do_IRQ
    ocfs2_osb_dump
    cat /sys/kernel/debug/ocfs2/${uuid}/fs_state

    This patch still uses spin_lock_irqsave() - replace spin_lock() to solve
    this situation.

    Signed-off-by: Yiwen Jiang
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    jiangyiwen
     

15 Jan, 2016

3 commits

  • Mark those kmem allocations that are known to be easily triggered from
    userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
    memcg. For the list, see below:

    - threadinfo
    - task_struct
    - task_delay_info
    - pid
    - cred
    - mm_struct
    - vm_area_struct and vm_region (nommu)
    - anon_vma and anon_vma_chain
    - signal_struct
    - sighand_struct
    - fs_struct
    - files_struct
    - fdtable and fdtable->full_fds_bits
    - dentry and external_name
    - inode for all filesystems. This is the most tedious part, because
    most filesystems overwrite the alloc_inode method.

    The list is far from complete, so feel free to add more objects.
    Nevertheless, it should be close to "account everything" approach and
    keep most workloads within bounds. Malevolent users will be able to
    breach the limit, but this was possible even with the former "account
    everything" approach (simply because it did not account everything in
    fact).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Since iput will take care the NULL check itself, NULL check before
    calling it is redundant. So clean them up.

    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • In ocfs2_parse_options,

    a) it's better to declare variables(small size) outside of while loop;

    b) 'option' will be set by match_int, 'option = 0;' makes no sense, if
    match_int failed, it just goto bail and return.

    Signed-off-by: Norton.Zhu
    Reviewed-by: Joseph Qi
    Cc: Gang He
    Cc: Mark Fasheh
    Acked-by: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Norton.Zhu
     

05 Sep, 2015

4 commits

  • Many file systems that implement the show_options hook fail to correctly
    escape their output which could lead to unescaped characters (e.g. new
    lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This
    could lead to confusion, spoofed entries (resulting in things like
    systemd issuing false d-bus "mount" notifications), and who knows what
    else. This looks like it would only be the root user stepping on
    themselves, but it's possible weird things could happen in containers or
    in other situations with delegated mount privileges.

    Here's an example using overlay with setuid fusermount trusting the
    contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use
    of "sudo" is something more sneaky:

    $ BASE="ovl"
    $ MNT="$BASE/mnt"
    $ LOW="$BASE/lower"
    $ UP="$BASE/upper"
    $ WORK="$BASE/work/ 0 0
    none /proc fuse.pwn user_id=1000"
    $ mkdir -p "$LOW" "$UP" "$WORK"
    $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
    $ cat /proc/mounts
    none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
    none /proc fuse.pwn user_id=1000 0 0
    $ fusermount -u /proc
    $ cat /proc/mounts
    cat: /proc/mounts: No such file or directory

    This fixes the problem by adding new seq_show_option and
    seq_show_option_n helpers, and updating the vulnerable show_option
    handlers to use them as needed. Some, like SELinux, need to be open
    coded due to unusual existing escape mechanisms.

    [akpm@linux-foundation.org: add lost chunk, per Kees]
    [keescook@chromium.org: seq_show_option should be using const parameters]
    Signed-off-by: Kees Cook
    Acked-by: Serge Hallyn
    Acked-by: Jan Kara
    Acked-by: Paul Moore
    Cc: J. R. Okajima
    Signed-off-by: Kees Cook
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • These uses sometimes do and sometimes don't have '\n' terminations. Make
    the uses consistently use '\n' terminations and remove the newline from
    the functions.

    Miscellanea:

    o Coalesce formats
    o Realign arguments

    Signed-off-by: Joe Perches
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • OCFS2 is often used in high-availaibility systems. However, ocfs2
    converts the filesystem to read-only at the drop of the hat. This may
    not be necessary, since turning the filesystem read-only would affect
    other running processes as well, decreasing availability.

    This attempt is to add errors=continue, which would return the EIO to
    the calling process and terminate furhter processing so that the
    filesystem is not corrupted further. However, the filesystem is not
    converted to read-only.

    As a future plan, I intend to create a small utility or extend
    fsck.ocfs2 to fix small errors such as in the inode. The input to the
    utility such as the inode can come from the kernel logs so we don't have
    to schedule a downtime for fixing small-enough errors.

    The patch changes the ocfs2_error to return an error. The error
    returned depends on the mount option set. If none is set, the default
    is to turn the filesystem read-only.

    Perhaps errors=continue is not the best option name. Historically it is
    used for making an attempt to progress in the current process itself.
    Should we call it errors=eio? or errors=killproc? Suggestions/Comments
    welcome.

    Sources are available at:
    https://github.com/goldwynr/linux/tree/error-cont

    Signed-off-by: Goldwyn Rodrigues
    Signed-off-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Goldwyn Rodrigues
     
  • During direct io the inode will be added to orphan first and then
    deleted from orphan. There is a race window that the orphan entry will
    be deleted twice and thus trigger the BUG when validating
    OCFS2_DIO_ORPHANED_FL in ocfs2_del_inode_from_orphan.

    ocfs2_direct_IO_write
    ...
    ocfs2_add_inode_to_orphan
    >>>>>>>> race window.
    1) another node may rm the file and then down, this node
    take care of orphan recovery and clear flag
    OCFS2_DIO_ORPHANED_FL.
    2) since rw lock is unlocked, it may race with another
    orphan recovery and append dio.
    ocfs2_del_inode_from_orphan

    So take inode mutex lock when recovering orphans and make rw unlock at the
    end of aio write in case of append dio.

    Signed-off-by: Joseph Qi
    Reported-by: Yiwen Jiang
    Cc: Weiwei Wang
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     

22 Apr, 2015

1 commit

  • This reverts commit e2ac55b6a8e337fac7cc59c6f452caac92ab5ee6.

    Huang Ying reports that this causes a hang at boot with debugfs disabled.

    It is true that the debugfs error checks are kind of confusing, and this
    code certainly merits more cleanup and thinking about it, but there's
    something wrong with the trivial "check not just for NULL, but for error
    pointers too" patch.

    Yes, with debugfs disabled, we will end up setting the o2hb_debug_dir
    pointer variable to an error pointer (-ENODEV), and then continue as if
    everything was fine. But since debugfs is disabled, all the _users_ of
    that pointer end up being compiled away, so even though the pointer can
    not be dereferenced, that's still fine.

    So it's confusing and somewhat questionable, but the "more correct"
    error checks end up causing more trouble than they fix.

    Reported-by: Huang Ying
    Acked-by: Andrew Morton
    Acked-by: Chengyu Song
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

15 Apr, 2015

4 commits

  • Use super_block->s_uuid instead. Every shared filesystem using cleancache
    must now initialize super_block->s_uuid before calling
    cleancache_init_shared_fs. The only one on the tree, ocfs2, already meets
    this requirement.

    Signed-off-by: Vladimir Davydov
    Cc: Konrad Rzeszutek Wilk
    Cc: Boris Ostrovsky
    Cc: David Vrabel
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Stefan Hengelein
    Cc: Florian Schmaus
    Cc: Andor Daam
    Cc: Dan Magenheimer
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently, maximal number of cleancache enabled filesystems equals 32,
    which is insufficient nowadays, because a Linux host can have hundreds
    of containers on board, each of which might want its own filesystem.
    This patch set targets at removing this limitation - see patch 4 for
    more details. Patches 1-3 prepare the code for this change.

    This patch (of 4):

    This will allow us to remove the uuid argument from
    cleancache_init_shared_fs.

    Signed-off-by: Vladimir Davydov
    Cc: Konrad Rzeszutek Wilk
    Cc: Boris Ostrovsky
    Cc: David Vrabel
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Stefan Hengelein
    Cc: Florian Schmaus
    Cc: Andor Daam
    Cc: Dan Magenheimer
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Use the vsprintf %pV extension to avoid using a static buffer and remove
    the now unnecessary buffer.

    Signed-off-by: Joe Perches
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • debugfs_create_dir and debugfs_create_file may return -ENODEV when debugfs
    is not configured, so the return value should be checked against
    ERROR_VALUE as well, otherwise the later dereference of the dentry pointer
    would crash the kernel.

    This patch tries to solve this problem by fixing certain checks. However,
    I have that found other call sites are protected by #ifdef CONFIG_DEBUG_FS.
    In current implementation, if CONFIG_DEBUG_FS is defined, then the above
    two functions will never return any ERROR_VALUE. So another possibility
    to fix this is to surround all the buggy checks/functions with the same
    #ifdef CONFIG_DEBUG_FS. But I'm not sure if this would break any functionality,
    as only OCFS2_FS_STATS declares dependency on DEBUG_FS.

    Signed-off-by: Chengyu Song
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chengyu Song
     

17 Feb, 2015

1 commit

  • If one node has crashed with orphan entry leftover, another node which do
    append O_DIRECT write to the same file will override the
    i_dio_orphaned_slot. Then the old entry won't be cleaned forever. If
    this case happens, we let it wait for orphan recovery first.

    Signed-off-by: Joseph Qi
    Cc: Weiwei Wang
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Mark Fasheh
    Cc: Xuejiufei
    Cc: alex chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     

11 Feb, 2015

2 commits

  • Merge misc updates from Andrew Morton:
    "Bite-sized chunks this time, to avoid the MTA ratelimiting woes.

    - fs/notify updates

    - ocfs2

    - some of MM"

    That laconic "some MM" is mainly the removal of remap_file_pages(),
    which is a big simplification of the VM, and which gets rid of a *lot*
    of random cruft and special cases because we no longer support the
    non-linear mappings that it used.

    From a user interface perspective, nothing has changed, because the
    remap_file_pages() syscall still exists, it's just done by emulating the
    old behavior by creating a lot of individual small mappings instead of
    one non-linear one.

    The emulation is slower than the old "native" non-linear mappings, but
    nobody really uses or cares about remap_file_pages(), and simplifying
    the VM is a big advantage.

    * emailed patches from Andrew Morton : (78 commits)
    memcg: zap memcg_slab_caches and memcg_slab_mutex
    memcg: zap memcg_name argument of memcg_create_kmem_cache
    memcg: zap __memcg_{charge,uncharge}_slab
    mm/page_alloc.c: place zone_id check before VM_BUG_ON_PAGE check
    mm: hugetlb: fix type of hugetlb_treat_as_movable variable
    mm, hugetlb: remove unnecessary lower bound on sysctl handlers"?
    mm: memory: merge shared-writable dirtying branches in do_wp_page()
    mm: memory: remove ->vm_file check on shared writable vmas
    xtensa: drop _PAGE_FILE and pte_file()-related helpers
    x86: drop _PAGE_FILE and pte_file()-related helpers
    unicore32: drop pte_file()-related helpers
    um: drop _PAGE_FILE and pte_file()-related helpers
    tile: drop pte_file()-related helpers
    sparc: drop pte_file()-related helpers
    sh: drop _PAGE_FILE and pte_file()-related helpers
    score: drop _PAGE_FILE and pte_file()-related helpers
    s390: drop pte_file()-related helpers
    parisc: drop _PAGE_FILE and pte_file()-related helpers
    openrisc: drop _PAGE_FILE and pte_file()-related helpers
    nios2: drop _PAGE_FILE and pte_file()-related helpers
    ...

    Linus Torvalds
     
  • Add a mount option to support JBD2 feature:

    JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT. When this feature is opened, journal
    commit block can be written to disk without waiting for descriptor blocks,
    which can improve journal commit performance. This option will enable
    'journal_checksum' internally.

    Using the fs_mark benchmark, using journal_async_commit shows a 50%
    improvement, the files per second go up from 215.2 to 317.5.

    test script:
    fs_mark -d /mnt/ocfs2/ -s 10240 -n 1000

    default:
    FSUse% Count Size Files/sec App Overhead
    0 1000 10240 215.2 17878

    with journal_async_commit option:
    FSUse% Count Size Files/sec App Overhead
    0 1000 10240 317.5 17881

    Signed-off-by: Alex Chen
    Signed-off-by: Weiwei Wang
    Reviewed-by: Joseph Qi
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    alex chen
     

30 Jan, 2015

1 commit

  • Ocfs2 can just use the generic helpers provided by quota code for
    turning quotas on and off when quota files are stored as system inodes.
    The only difference is the feature test in ocfs2_quota_on() and that is
    covered by dquot_quota_enable() checking whether usage tracking is
    enabled (which can happen only if the filesystem has the quota feature
    set).

    Signed-off-by: Jan Kara

    Jan Kara
     

11 Dec, 2014

2 commits

  • Merge first patchbomb from Andrew Morton:
    - a few minor cifs fixes
    - dma-debug upadtes
    - ocfs2
    - slab
    - about half of MM
    - procfs
    - kernel/exit.c
    - panic.c tweaks
    - printk upates
    - lib/ updates
    - checkpatch updates
    - fs/binfmt updates
    - the drivers/rtc tree
    - nilfs
    - kmod fixes
    - more kernel/exit.c
    - various other misc tweaks and fixes

    * emailed patches from Andrew Morton : (190 commits)
    exit: pidns: fix/update the comments in zap_pid_ns_processes()
    exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting
    exit: exit_notify: re-use "dead" list to autoreap current
    exit: reparent: call forget_original_parent() under tasklist_lock
    exit: reparent: avoid find_new_reaper() if no children
    exit: reparent: introduce find_alive_thread()
    exit: reparent: introduce find_child_reaper()
    exit: reparent: document the ->has_child_subreaper checks
    exit: reparent: s/while_each_thread/for_each_thread/ in find_new_reaper()
    exit: reparent: fix the cross-namespace PR_SET_CHILD_SUBREAPER reparenting
    exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparenting
    exit: proc: don't try to flush /proc/tgid/task/tgid
    exit: release_task: fix the comment about group leader accounting
    exit: wait: drop tasklist_lock before psig->c* accounting
    exit: wait: don't use zombie->real_parent
    exit: wait: cleanup the ptrace_reparented() checks
    usermodehelper: kill the kmod_thread_locker logic
    usermodehelper: don't use CLONE_VFORK for ____call_usermodehelper()
    fs/hfs/catalog.c: fix comparison bug in hfs_cat_keycmp
    nilfs2: fix the nilfs_iget() vs. nilfs_new_inode() races
    ...

    Linus Torvalds
     
  • Error handling if creation of root of debugfs in ocfs2_init() fails is
    broken. Although error code is set we fail to exit ocfs2_init() with
    error and thus initialization ends with success. Later when mounting a
    filesystem, ocfs2 debugfs entries end up being created in the root of
    debugfs filesystem which is confusing.

    Fix the error handling to bail out.

    Coverity id: 1227009.

    Signed-off-by: Jan Kara
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

10 Nov, 2014

1 commit


11 Oct, 2014

1 commit

  • Pull UDF and quota updates from Jan Kara:
    "A few UDF fixes and also a few patches which are preparing filesystems
    for support of project quotas in VFS"

    * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
    udf: Fix loading of special inodes
    ocfs2: Back out change to use OCFS2_MAXQUOTAS in ocfs2_setattr()
    udf: remove redundant sys_tz declaration
    ocfs2: Don't use MAXQUOTAS value
    reiserfs: Don't use MAXQUOTAS value
    ext3: Don't use MAXQUOTAS value
    udf: Fix race between write(2) and close(2)

    Linus Torvalds
     

26 Sep, 2014

1 commit

  • osb->vol_label is malloced in ocfs2_initialize_super but not freed if
    error occurs or during umount, thus causing a memory leak.

    Signed-off-by: Joseph Qi
    Reviewed-by: joyce.xue
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     

17 Sep, 2014

1 commit

  • MAXQUOTAS value defines maximum number of quota types VFS supports.
    This isn't necessarily the number of types ocfs2 supports and with
    addition of project quotas these two numbers stop matching. So make
    ocfs2 use its private definition.

    CC: Mark Fasheh
    CC: Joel Becker
    CC: ocfs2-devel@oss.oracle.com
    Signed-off-by: Jan Kara

    Jan Kara
     

24 Jun, 2014

1 commit

  • 75f82eaa502c ("ocfs2: fix NULL pointer dereference when dismount and
    ocfs2rec simultaneously") may cause umount hang while shutting down
    truncate log.

    The situation is as followes:
    ocfs2_dismout_volume
    -> ocfs2_recovery_exit
    -> free osb->recovery_map
    -> ocfs2_truncate_shutdown
    -> lock global bitmap inode
    -> ocfs2_wait_for_recovery
    -> check whether osb->recovery_map->rm_used is zero

    Because osb->recovery_map is already freed, rm_used can be any other
    values, so it may yield umount hang.

    Signed-off-by: joyce.xue
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xue jiufei
     

05 Jun, 2014

2 commits


05 Apr, 2014

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "Major changes for 3.14 include support for the newly added ZERO_RANGE
    and COLLAPSE_RANGE fallocate operations, and scalability improvements
    in the jbd2 layer and in xattr handling when the extended attributes
    spill over into an external block.

    Other than that, the usual clean ups and minor bug fixes"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (42 commits)
    ext4: fix premature freeing of partial clusters split across leaf blocks
    ext4: remove unneeded test of ret variable
    ext4: fix comment typo
    ext4: make ext4_block_zero_page_range static
    ext4: atomically set inode->i_flags in ext4_set_inode_flags()
    ext4: optimize Hurd tests when reading/writing inodes
    ext4: kill i_version support for Hurd-castrated file systems
    ext4: each filesystem creates and uses its own mb_cache
    fs/mbcache.c: doucple the locking of local from global data
    fs/mbcache.c: change block and index hash chain to hlist_bl_node
    ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
    ext4: refactor ext4_fallocate code
    ext4: Update inode i_size after the preallocation
    ext4: fix partial cluster handling for bigalloc file systems
    ext4: delete path dealloc code in ext4_ext_handle_uninitialized_extents
    ext4: only call sync_filesystm() when remounting read-only
    fs: push sync_filesystem() down to the file system's remount_fs()
    jbd2: improve error messages for inconsistent journal heads
    jbd2: minimize region locked by j_list_lock in jbd2_journal_forget()
    jbd2: minimize region locked by j_list_lock in journal_get_create_access()
    ...

    Linus Torvalds
     

04 Apr, 2014

6 commits

  • The following case may lead to the same system inode ref in confusion.

    A thread B thread
    ocfs2_get_system_file_inode
    ->get_local_system_inode
    ->_ocfs2_get_system_file_inode
    because of *arr == NULL,
    ocfs2_get_system_file_inode
    ->get_local_system_inode
    ->_ocfs2_get_system_file_inode
    gets first ref thru
    _ocfs2_get_system_file_inode,
    gets second ref thru igrab and
    set *arr = inode
    at the moment, B thread also gets
    two refs, so lead to one more
    inode ref.

    So add mutex lock to avoid multi thread set two inode ref once at the
    same time.

    Signed-off-by: jiangyiwen
    Reviewed-by: Joseph Qi
    Cc: Joel Becker
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    jiangyiwen
     
  • The following patches are reverted in this patch because these patches
    caused performance regression in the remote unlink() calls.

    ea455f8ab683 - ocfs2: Push out dropping of dentry lock to ocfs2_wq
    f7b1aa69be13 - ocfs2: Fix deadlock on umount
    5fd131893793 - ocfs2: Don't oops in ocfs2_kill_sb on a failed mount

    Previous patches in this series removed the possible deadlocks from
    downconvert thread so the above patches shouldn't be needed anymore.

    The regression is caused because these patches delay the iput() in case
    of dentry unlocks. This also delays the unlocking of the open lockres.
    The open lockresource is required to test if the inode can be wiped from
    disk or not. When the deleting node does not get the open lock, it
    marks it as orphan (even though it is not in use by another
    node/process) and causes a journal checkpoint. This delays operations
    following the inode eviction. This also moves the inode to the orphaned
    inode which further causes more I/O and a lot of unneccessary orphans.

    The following script can be used to generate the load causing issues:

    declare -a create
    declare -a remove
    declare -a iterations=(1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384)
    unique="`mktemp -u XXXXX`"
    script="/tmp/idontknow-${unique}.sh"
    cat < "${script}"
    for n in {1..8}; do mkdir -p test/dir\${n}
    eval touch test/dir\${n}/foo{1.."\$1"}
    done
    EOF
    chmod 700 "${script}"

    function fcreate ()
    {
    exec 2>&1 /usr/bin/time --format=%E "${script}" "$1"
    }

    function fremove ()
    {
    exec 2>&1 /usr/bin/time --format=%E ssh node2 "cd `pwd`; rm -Rf test*"
    }

    function fcp ()
    {
    exec 2>&1 /usr/bin/time --format=%E ssh node3 "cd `pwd`; cp -R test test.new"
    }

    echo -------------------------------------------------
    echo "| # files | create #s | copy #s | remove #s |"
    echo -------------------------------------------------
    for ((x=0; x < ${#iterations[*]} ; x++)) do
    create[$x]="`fcreate ${iterations[$x]}`"
    copy[$x]="`fcp ${iterations[$x]}`"
    remove[$x]="`fremove`"
    printf "| %8d | %9s | %9s | %9s |\n" ${iterations[$x]} ${create[$x]} ${copy[$x]} ${remove[$x]}
    done
    rm "${script}"
    echo "------------------------"

    Signed-off-by: Srinivas Eeda
    Signed-off-by: Goldwyn Rodrigues
    Signed-off-by: Jan Kara
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Goldwyn Rodrigues
     
  • We cannot drop last dquot reference from downconvert thread as that
    creates the following deadlock:

    NODE 1 NODE2
    holds dentry lock for 'foo'
    holds inode lock for GLOBAL_BITMAP_SYSTEM_INODE
    dquot_initialize(bar)
    ocfs2_dquot_acquire()
    ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
    ...
    downconvert thread (triggered from another
    node or a different process from NODE2)
    ocfs2_dentry_post_unlock()
    ...
    iput(foo)
    ocfs2_evict_inode(foo)
    ocfs2_clear_inode(foo)
    dquot_drop(inode)
    ...
    ocfs2_dquot_release()
    ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
    - blocks
    finds we need more space in
    quota file
    ...
    ocfs2_extend_no_holes()
    ocfs2_inode_lock(GLOBAL_BITMAP_SYSTEM_INODE)
    - deadlocks waiting for
    downconvert thread

    We solve the problem by postponing dropping of the last dquot reference to
    a workqueue if it happens from the downconvert thread.

    Signed-off-by: Jan Kara
    Reviewed-by: Mark Fasheh
    Reviewed-by: Srinivas Eeda
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Currently, ocfs2_sync_file grabs i_mutex and forces the current journal
    transaction to complete. This isn't terribly efficient, since sync_file
    really only needs to wait for the last transaction involving that inode
    to complete, and this doesn't require i_mutex.

    Therefore, implement the necessary bits to track the newest tid
    associated with an inode, and teach sync_file to wait for that instead
    of waiting for everything in the journal to commit. Furthermore, only
    issue the flush request to the drive if jbd2 hasn't already done so.

    This also eliminates the deadlock between ocfs2_file_aio_write() and
    ocfs2_sync_file(). aio_write takes i_mutex then calls
    ocfs2_aiodio_wait() to wait for unaligned dio writes to finish.
    However, if that dio completion involves calling fsync, then we can get
    into trouble when some ocfs2_sync_file tries to take i_mutex.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     
  • Variable uuid_net_key in ocfs2_initialize_super() is not used. Clean it
    up.

    Signed-off-by: joyce.xue
    Signed-off-by: Joseph Qi
    Acked-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    joyce.xue
     
  • There is a problem that waitqueue_active() may check stale data thus miss
    a wakeup of threads waiting on ip_unaligned_aio.

    The valid value of ip_unaligned_aio is only 0 and 1 so we can change it to
    be of type mutex thus the above prolem is avoid. Another benifit is that
    mutex which works as FIFO is fairer than wake_up_all().

    Signed-off-by: Wengang Wang
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wengang Wang
     

13 Mar, 2014

1 commit

  • Previously, the no-op "mount -o mount /dev/xxx" operation when the
    file system is already mounted read-write causes an implied,
    unconditional syncfs(). This seems pretty stupid, and it's certainly
    documented or guaraunteed to do this, nor is it particularly useful,
    except in the case where the file system was mounted rw and is getting
    remounted read-only.

    However, it's possible that there might be some file systems that are
    actually depending on this behavior. In most file systems, it's
    probably fine to only call sync_filesystem() when transitioning from
    read-write to read-only, and there are some file systems where this is
    not needed at all (for example, for a pseudo-filesystem or something
    like romfs).

    Signed-off-by: "Theodore Ts'o"
    Cc: linux-fsdevel@vger.kernel.org
    Cc: Christoph Hellwig
    Cc: Artem Bityutskiy
    Cc: Adrian Hunter
    Cc: Evgeniy Dushistov
    Cc: Jan Kara
    Cc: OGAWA Hirofumi
    Cc: Anders Larsen
    Cc: Phillip Lougher
    Cc: Kees Cook
    Cc: Mikulas Patocka
    Cc: Petr Vandrovec
    Cc: xfs@oss.sgi.com
    Cc: linux-btrfs@vger.kernel.org
    Cc: linux-cifs@vger.kernel.org
    Cc: samba-technical@lists.samba.org
    Cc: codalist@coda.cs.cmu.edu
    Cc: linux-ext4@vger.kernel.org
    Cc: linux-f2fs-devel@lists.sourceforge.net
    Cc: fuse-devel@lists.sourceforge.net
    Cc: cluster-devel@redhat.com
    Cc: linux-mtd@lists.infradead.org
    Cc: jfs-discussion@lists.sourceforge.net
    Cc: linux-nfs@vger.kernel.org
    Cc: linux-nilfs@vger.kernel.org
    Cc: linux-ntfs-dev@lists.sourceforge.net
    Cc: ocfs2-devel@oss.oracle.com
    Cc: reiserfs-devel@vger.kernel.org

    Theodore Ts'o
     

22 Jan, 2014

1 commit

  • 2 nodes cluster, say Node A and Node B, mount the same ocfs2 volume, and
    create a file 1.

    Node A Node B
    open 1, get open lock
    rm 1, and then add 1 to orphan_dir
    storage link down,
    o2hb_write_timeout
    ->o2quo_disk_timeout
    ->emergency_restart
    at the moment, Node B dismount and do
    ocfs2rec simultaneously
    1) ocfs2_dismount_volume
    ->ocfs2_recovery_exit
    ->wait_event(osb->recovery_event)
    ->flush_workqueue(ocfs2_wq)
    2) ocfs2rec
    ->queue_work(&journal->j_recovery_work)
    ->ocfs2_recover_orphans
    ->ocfs2_commit_truncate
    ->queue_delayed_work(&osb->osb_truncate_log_wq)

    In ocfs2_recovery_exit, it flushes workqueue and then releases system
    inodes. When doing ocfs2rec, it will call ocfs2_flush_truncate_log
    which will try to get sys_root_inode, and NULL pointer dereference
    occurs.

    Signed-off-by: Yiwen Jiang
    Signed-off-by: joyce
    Signed-off-by: Joseph Qi
    Cc: Joel Becker
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yiwen Jiang