20 Apr, 2016

5 commits

  • Add registration APIs in the clk divider code to return struct
    clk_hw pointers instead of struct clk pointers. This way we hide
    the struct clk pointer from providers unless they need to use
    consumer facing APIs.

    Signed-off-by: Stephen Boyd

    Stephen Boyd
     
  • Now that we have a clk registration API that doesn't return
    struct clks, we need to have some way to hand out struct clks via
    the clk_get() APIs that doesn't involve associating struct clk
    pointers with a struct clk_lookup. Luckily, clkdev already
    operates on struct clk_hw pointers, except for the registration
    facing APIs where it converts struct clk pointers into struct
    clk_hw pointers almost immediately.

    Let's add clk_hw based registration APIs so that we can skip the
    conversion step and provide a way for clk provider drivers to
    operate exclusively on clk_hw structs. This way we clearly
    split the API between consumers and providers.

    Cc: Russell King
    Signed-off-by: Stephen Boyd

    Stephen Boyd
     
  • Now that we have a clk registration API that doesn't return
    struct clks, we need to have some way to hand out struct clks via
    the clk_get() APIs that doesn't involve associating struct clk
    pointers with an OF node. Currently we ask the OF provider to
    give us a struct clk pointer for some clkspec, turn that struct
    clk into a struct clk_hw and then allocate a new struct clk to
    return to the caller.

    Let's add a clk_hw based OF provider hook that returns a struct
    clk_hw directly, so that we skip the intermediate step of
    converting from struct clk to struct clk_hw. Eventually when
    we've converted all OF clk providers to struct clk_hw based APIs
    we can remove the struct clk based ones.

    It should also be noted that we change the onecell provider to
    have a flex array instead of a pointer for the array of clk_hw
    pointers. This allows providers to allocate one structure of the
    correct length in one step instead of two.

    Signed-off-by: Stephen Boyd

    Stephen Boyd
     
  • We've largely split the clk consumer and provider APIs along
    struct clk and struct clk_hw, but clk_register() still returns a
    struct clk pointer for each struct clk_hw that's registered.
    Eventually we'd like to only allocate struct clks when there's a
    user, because struct clk is per-user now, so clk_register() needs
    to change.

    Let's add new APIs to register struct clk_hws, but this time
    we'll hide the struct clk from the caller by returning an int
    error code. Also add an unregistration API that takes the clk_hw
    structure that was passed to the registration API. This way
    provider drivers never have to deal with a struct clk pointer
    unless they're using the clk consumer APIs.

    Signed-off-by: Stephen Boyd

    Stephen Boyd
     
  • Now that we've converted the only caller over to another clkdev
    API, remove this one.

    Reviewed-by: Andy Shevchenko
    Cc: Russell King
    Signed-off-by: Stephen Boyd

    Stephen Boyd
     

27 Mar, 2016

6 commits

  • Linus Torvalds
     
  • Pull Ceph updates from Sage Weil:
    "There is quite a bit here, including some overdue refactoring and
    cleanup on the mon_client and osd_client code from Ilya, scattered
    writeback support for CephFS and a pile of bug fixes from Zheng, and a
    few random cleanups and fixes from others"

    [ I already decided not to pull this because of it having been rebased
    recently, but ended up changing my mind after all. Next time I'll
    really hold people to it. Oh well. - Linus ]

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (34 commits)
    libceph: use KMEM_CACHE macro
    ceph: use kmem_cache_zalloc
    rbd: use KMEM_CACHE macro
    ceph: use lookup request to revalidate dentry
    ceph: kill ceph_get_dentry_parent_inode()
    ceph: fix security xattr deadlock
    ceph: don't request vxattrs from MDS
    ceph: fix mounting same fs multiple times
    ceph: remove unnecessary NULL check
    ceph: avoid updating directory inode's i_size accidentally
    ceph: fix race during filling readdir cache
    libceph: use sizeof_footer() more
    ceph: kill ceph_empty_snapc
    ceph: fix a wrong comparison
    ceph: replace CURRENT_TIME by current_fs_time()
    ceph: scattered page writeback
    libceph: add helper that duplicates last extent operation
    libceph: enable large, variable-sized OSD requests
    libceph: osdc->req_mempool should be backed by a slab pool
    libceph: make r_request msg_size calculation clearer
    ...

    Linus Torvalds
     
  • Pull orangefs filesystem from Mike Marshall.

    This finally merges the long-pending orangefs filesystem, which has been
    much cleaned up with input from Al Viro over the last six months. From
    the documentation file:

    "OrangeFS is an LGPL userspace scale-out parallel storage system. It
    is ideal for large storage problems faced by HPC, BigData, Streaming
    Video, Genomics, Bioinformatics.

    Orangefs, originally called PVFS, was first developed in 1993 by Walt
    Ligon and Eric Blumer as a parallel file system for Parallel Virtual
    Machine (PVM) as part of a NASA grant to study the I/O patterns of
    parallel programs.

    Orangefs features include:

    - Distributes file data among multiple file servers
    - Supports simultaneous access by multiple clients
    - Stores file data and metadata on servers using local file system
    and access methods
    - Userspace implementation is easy to install and maintain
    - Direct MPI support
    - Stateless"

    see Documentation/filesystems/orangefs.txt for more in-depth details.

    * tag 'ofs-pull-tag-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux: (174 commits)
    orangefs: fix orangefs_superblock locking
    orangefs: fix do_readv_writev() handling of error halfway through
    orangefs: have ->kill_sb() evict the VFS side of things first
    orangefs: sanitize ->llseek()
    orangefs-bufmap.h: trim unused junk
    orangefs: saner calling conventions for getting a slot
    orangefs_copy_{to,from}_bufmap(): don't pass bufmap pointer
    orangefs: get rid of readdir_handle_s
    ornagefs: ensure that truncate has an up to date inode size
    orangefs: move code which sets i_link to orangefs_inode_getattr
    orangefs: remove needless wrapper around GFP_KERNEL
    orangefs: remove wrapper around mutex_lock(&inode->i_mutex)
    orangefs: refactor inode type or link_target change detection
    orangefs: use new getattr for revalidate and remove old getattr
    orangefs: use new getattr in inode getattr and permission
    orangefs: use new orangefs_inode_getattr to get size in write and llseek
    orangefs: use new orangefs_inode_getattr to create new inodes
    orangefs: rename orangefs_inode_getattr to orangefs_inode_old_getattr
    orangefs: remove inode->i_lock wrapper
    orangefs: put register_chrdev immediately before register_filesystem
    ...

    Linus Torvalds
     
  • Pull NTB bug fixes from Jon Mason:
    "NTB bug fixes for tasklet from spinning forever, link errors,
    translation window setup, NULL ptr dereference, and ntb-perf errors.

    Also, a modification to the driver API that makes _addr functions
    optional"

    * tag 'ntb-4.6' of git://github.com/jonmason/ntb:
    NTB: Remove _addr functions from ntb_hw_amd
    NTB: Make _addr functions optional in the API
    NTB: Fix incorrect clean up routine in ntb_perf
    NTB: Fix incorrect return check in ntb_perf
    ntb: fix possible NULL dereference
    ntb: add missing setup of translation window
    ntb: stop link work when we do not have memory
    ntb: stop tasklet from spinning forever during shutdown.
    ntb: perf test: fix address space confusion

    Linus Torvalds
     
  • Pull more SCSI updates from James Bottomley:
    "The only new stuff which missed the first pull request is an update to
    the UFS driver.

    The rest is an assortment of bug fixes and minor tweaks which appeared
    recently (some are fixes for recent code and some are stuff spotted
    recently by the checkers or the new gcc-6 compiler [most of Arnd's
    stuff])"

    * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (32 commits)
    scsi_common: do not clobber fixed sense information
    scsi: ufs: select CONFIG_NLS
    scsi: fc: use get/put_unaligned64 for wwn access
    fnic: move printk()s outside of the critical code section.
    qla2xxx: avoid maybe_uninitialized warning
    megaraid_sas: add missing curly braces in ioctl handler
    lpfc: fix misleading indentation
    scsi_transport_sas: add 'scsi_target_id' sysfs attribute
    scsi_dh_alua: uninitialized variable in alua_check_vpd()
    scsi: ufs-qcom: add printouts of testbus debug registers
    scsi: ufs-qcom: enable/disable the device ref clock
    scsi: ufs-qcom: set PA_Local_TX_LCC_Enable before link startup
    scsi: ufs: add device quirk delay before putting UFS rails in LPM
    scsi: ufs: fix leakage during link off state
    scsi: ufs: tune UniPro parameters to optimize hibern8 exit time
    scsi: ufs: handle non spec compliant bkops behaviour by device
    scsi: ufs: add retry for query descriptors
    scsi: ufs: add error recovery after DL NAC error
    scsi: ufs: make error handling bit faster
    scsi: ufs: disable vccq if it's not needed by UFS device
    ...

    Linus Torvalds
     
  • Commit 0b81d07790726 ("fs crypto: move per-file encryption from f2fs
    tree to fs/crypto") moved the f2fs crypto files to fs/crypto/ and
    renamed the symbol prefixes from "f2fs_" to "fscrypt_" (and from "F2FS_"
    to just "FS" for preprocessor symbols).

    Because of the symbol renaming, it's a bit hard to see it as a file
    move: use

    git show -M30 0b81d07790726

    to lower the rename detection to just 30% similarity and make git show
    the files as renamed (the header file won't be shown as a rename even
    then - since all it contains is symbol definitions, it looks almost
    completely different).

    Even with the renames showing as renames, the diffs are not all that
    easy to read, since so much is just the renames. But Eric Biggers
    noticed that it's not just all renames: the initialization of the
    xts_tweak had been broken too, using the inode number rather than the
    page offset.

    That's not right - it makes the xfs_tweak the same for all pages of each
    inode. It _might_ make sense to make the xfs_tweak contain both the
    offset _and_ the inode number, but not just the inode number.

    Reported-by: Eric Biggers
    Cc: Jaegeuk Kim
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

26 Mar, 2016

29 commits

  • Kernel zero day testing warned about address space confusion. A virtual
    iomem address was used where a physical address is expected. The
    offending functions implement an optional part of the api, so they are
    removed. They can be added later, after testing.

    Fixes: a1b3695820aa490e58915d720a1438069813008b

    Signed-off-by: Allen Hubbe
    Acked-by: Xiangliang Yu
    Signed-off-by: Jon Mason

    Allen Hubbe
     
  • * switch orangefs_remount() to taking ORANGEFS_SB(sb) instead of sb
    * remove from the list _before_ orangefs_unmount() - request_mutex
    in the latter will make sure that nothing observed in the loop in
    ORANGEFS_DEV_REMOUNT_ALL handling will get freed until the end
    of loop
    * on removal, keep the forward pointer and zero the back one. That
    way we can drop and regain the spinlock in the loop body (again,
    ORANGEFS_DEV_REMOUNT_ALL one) and still be able to get to the
    rest of the list.

    Signed-off-by: Al Viro
    Signed-off-by: Mike Marshall

    Al Viro
     
  • Error should only be returned if nothing had been read/written.
    Otherwise we need to report a short read/write instead.

    Signed-off-by: Al Viro
    Signed-off-by: Mike Marshall

    Al Viro
     
  • Signed-off-by: Al Viro
    Signed-off-by: Mike Marshall

    Al Viro
     
  • a) open files can't have NULL inodes
    b) it's SEEK_END, not ORANGEFS_SEEK_END; no need to get cute.
    c) make_bad_inode() on lseek()?

    Signed-off-by: Al Viro
    Signed-off-by: Mike Marshall

    Al Viro
     
  • Signed-off-by: Al Viro
    Signed-off-by: Mike Marshall

    Al Viro
     
  • just have it return the slot number or -E... - the caller checks
    the sign anyway

    Signed-off-by: Al Viro
    Signed-off-by: Mike Marshall

    Al Viro
     
  • it's always __orangefs_bufmap

    Signed-off-by: Al Viro
    Signed-off-by: Mike Marshall

    Al Viro
     
  • no point, really - we couldn't keep those across the calls of
    getdents(); it would be too easy to DoS, having all slots exhausted.

    Signed-off-by: Al Viro
    Signed-off-by: Mike Marshall

    Al Viro
     
  • Merge fourth patch-bomb from Andrew Morton:
    "A lot more stuff than expected, sorry. A bunch of ocfs2 reviewing was
    finished off.

    - mhocko's oom-reaper out-of-memory-handler changes

    - ocfs2 fixes and features

    - KASAN feature work

    - various fixes"

    * emailed patches from Andrew Morton : (42 commits)
    thp: fix typo in khugepaged_scan_pmd()
    MAINTAINERS: fill entries for KASAN
    mm/filemap: generic_file_read_iter(): check for zero reads unconditionally
    kasan: test fix: warn if the UAF could not be detected in kmalloc_uaf2
    mm, kasan: stackdepot implementation. Enable stackdepot for SLAB
    arch, ftrace: for KASAN put hard/soft IRQ entries into separate sections
    mm, kasan: add GFP flags to KASAN API
    mm, kasan: SLAB support
    kasan: modify kmalloc_large_oob_right(), add kmalloc_pagealloc_oob_right()
    include/linux/oom.h: remove undefined oom_kills_count()/note_oom_kill()
    mm/page_alloc: prevent merging between isolated and other pageblocks
    drivers/memstick/host/r592.c: avoid gcc-6 warning
    ocfs2: extend enough credits for freeing one truncate record while replaying truncate records
    ocfs2: extend transaction for ocfs2_remove_rightmost_path() and ocfs2_update_edge_lengths() before to avoid inconsistency between inode and et
    ocfs2/dlm: move lock to the tail of grant queue while doing in-place convert
    ocfs2: solve a problem of crossing the boundary in updating backups
    ocfs2: fix occurring deadlock by changing ocfs2_wq from global to local
    ocfs2/dlm: fix BUG in dlm_move_lockres_to_recovery_list
    ocfs2/dlm: fix race between convert and recovery
    ocfs2: fix a deadlock issue in ocfs2_dio_end_io_write()
    ...

    Linus Torvalds
     
  • Pull power management fixlet from Rafael Wysocki:
    "One of commits in my previous pull request changed the permissions of
    drivers/power/avs/rockchip-io-domain.c to executable by mistake"

    * tag 'pm+acpi-4.6-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    Fix permissions of drivers/power/avs/rockchip-io-domain.c

    Linus Torvalds
     
  • Pull ia64 update from Tony Luck:
    "Wire up new system calls p{read,write}v2 for ia64"

    * tag 'please-pull-preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
    [IA64] Enable preadv2 and pwritev2 syscalls for ia64

    Linus Torvalds
     
  • Pull more input updates from Dmitry Torokhov:
    "Second round of updates for the input subsystem.

    The BYD PS/2 protocol driver now uses absolute reporting mode and
    should behave more like other touchpads; Synaptics driver needed to
    extend one of its quirks to a newer firmware version, and a few USB
    drivers got tightened up checks for the contents of their descriptors"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
    Input: sur40 - fix DMA on stack
    Input: ati_remote2 - fix crashes on detecting device with invalid descriptor
    Input: synaptics - handle spurious release of trackstick buttons, again
    Input: synaptics-rmi4 - remove check of Non-NULL array
    Input: byd - enable absolute mode
    Input: ims-pcu - sanity check against missing interfaces
    Input: melfas_mip4 - add hw_version sysfs attribute

    Linus Torvalds
     
  • !PageLRU should lead to SCAN_PAGE_LRU, not SCAN_SCAN_ABORT result.

    Signed-off-by: Kirill A. Shutemov
    Cc: Ebru Akagunduz
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Signed-off-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Acked-by: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • If
    - generic_file_read_iter() gets called with a zero read length,
    - the read offset is at a page boundary,
    - IOCB_DIRECT is not set
    - and the page in question hasn't made it into the page cache yet,
    then do_generic_file_read() will trigger a readahead with a req_size hint
    of zero.

    Since roundup_pow_of_two(0) is undefined, UBSAN reports

    UBSAN: Undefined behaviour in include/linux/log2.h:63:13
    shift exponent 64 is too large for 64-bit type 'long unsigned int'
    CPU: 3 PID: 1017 Comm: sa1 Tainted: G L 4.5.0-next-20160318+ #14
    [...]
    Call Trace:
    [...]
    [] ondemand_readahead+0x3aa/0x3d0
    [] ? ondemand_readahead+0x3aa/0x3d0
    [] ? find_get_entry+0x2d/0x210
    [] page_cache_sync_readahead+0x63/0xa0
    [] do_generic_file_read+0x80d/0xf90
    [] generic_file_read_iter+0x185/0x420
    [...]
    [] __vfs_read+0x256/0x3d0
    [...]

    when get_init_ra_size() gets called from ondemand_readahead().

    The net effect is that the initial readahead size is arch dependent for
    requested read lengths of zero: for example, since

    1UL << (sizeof(unsigned long) * 8)

    evaluates to 1 on x86 while its result is 0 on ARMv7, the initial readahead
    size becomes 4 on the former and 0 on the latter.

    What's more, whether or not the file access timestamp is updated for zero
    length reads is decided differently for the two cases of IOCB_DIRECT
    being set or cleared: in the first case, generic_file_read_iter()
    explicitly skips updating that timestamp while in the latter case, it is
    always updated through the call to do_generic_file_read().

    According to POSIX, zero length reads "do not modify the last data access
    timestamp" and thus, the IOCB_DIRECT behaviour is POSIXly correct.

    Let generic_file_read_iter() unconditionally check the requested read
    length at its entry and return immediately with success if it is zero.

    Signed-off-by: Nicolai Stange
    Cc: Al Viro
    Reviewed-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicolai Stange
     
  • Signed-off-by: Alexander Potapenko
    Acked-by: Andrey Ryabinin
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Steven Rostedt
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • Implement the stack depot and provide CONFIG_STACKDEPOT. Stack depot
    will allow KASAN store allocation/deallocation stack traces for memory
    chunks. The stack traces are stored in a hash table and referenced by
    handles which reside in the kasan_alloc_meta and kasan_free_meta
    structures in the allocated memory chunks.

    IRQ stack traces are cut below the IRQ entry point to avoid unnecessary
    duplication.

    Right now stackdepot support is only enabled in SLAB allocator. Once
    KASAN features in SLAB are on par with those in SLUB we can switch SLUB
    to stackdepot as well, thus removing the dependency on SLUB stack
    bookkeeping, which wastes a lot of memory.

    This patch is based on the "mm: kasan: stack depots" patch originally
    prepared by Dmitry Chernenkov.

    Joonsoo has said that he plans to reuse the stackdepot code for the
    mm/page_owner.c debugging facility.

    [akpm@linux-foundation.org: s/depot_stack_handle/depot_stack_handle_t]
    [aryabinin@virtuozzo.com: comment style fixes]
    Signed-off-by: Alexander Potapenko
    Signed-off-by: Andrey Ryabinin
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Steven Rostedt
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • KASAN needs to know whether the allocation happens in an IRQ handler.
    This lets us strip everything below the IRQ entry point to reduce the
    number of unique stack traces needed to be stored.

    Move the definition of __irq_entry to so that the
    users don't need to pull in . Also introduce the
    __softirq_entry macro which is similar to __irq_entry, but puts the
    corresponding functions to the .softirqentry.text section.

    Signed-off-by: Alexander Potapenko
    Acked-by: Steven Rostedt
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • Add GFP flags to KASAN hooks for future patches to use.

    This patch is based on the "mm: kasan: unified support for SLUB and SLAB
    allocators" patch originally prepared by Dmitry Chernenkov.

    Signed-off-by: Alexander Potapenko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Steven Rostedt
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • Add KASAN hooks to SLAB allocator.

    This patch is based on the "mm: kasan: unified support for SLUB and SLAB
    allocators" patch originally prepared by Dmitry Chernenkov.

    Signed-off-by: Alexander Potapenko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Steven Rostedt
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • This patchset implements SLAB support for KASAN

    Unlike SLUB, SLAB doesn't store allocation/deallocation stacks for heap
    objects, therefore we reimplement this feature in mm/kasan/stackdepot.c.
    The intention is to ultimately switch SLUB to use this implementation as
    well, which will save a lot of memory (right now SLUB bloats each object
    by 256 bytes to store the allocation/deallocation stacks).

    Also neither SLUB nor SLAB delay the reuse of freed memory chunks, which
    is necessary for better detection of use-after-free errors. We
    introduce memory quarantine (mm/kasan/quarantine.c), which allows
    delayed reuse of deallocated memory.

    This patch (of 7):

    Rename kmalloc_large_oob_right() to kmalloc_pagealloc_oob_right(), as
    the test only checks the page allocator functionality. Also reimplement
    kmalloc_large_oob_right() so that the test allocates a large enough
    chunk of memory that still does not trigger the page allocator fallback.

    Signed-off-by: Alexander Potapenko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Steven Rostedt
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • A leftover from commit c32b3cbe0d06 ("oom, PM: make OOM detection in the
    freezer path raceless").

    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Hanjun Guo has reported that a CMA stress test causes broken accounting of
    CMA and free pages:

    > Before the test, I got:
    > -bash-4.3# cat /proc/meminfo | grep Cma
    > CmaTotal: 204800 kB
    > CmaFree: 195044 kB
    >
    >
    > After running the test:
    > -bash-4.3# cat /proc/meminfo | grep Cma
    > CmaTotal: 204800 kB
    > CmaFree: 6602584 kB
    >
    > So the freed CMA memory is more than total..
    >
    > Also the the MemFree is more than mem total:
    >
    > -bash-4.3# cat /proc/meminfo
    > MemTotal: 16342016 kB
    > MemFree: 22367268 kB
    > MemAvailable: 22370528 kB

    Laura Abbott has confirmed the issue and suspected the freepage accounting
    rewrite around 3.18/4.0 by Joonsoo Kim. Joonsoo had a theory that this is
    caused by unexpected merging between MIGRATE_ISOLATE and MIGRATE_CMA
    pageblocks:

    > CMA isolates MAX_ORDER aligned blocks, but, during the process,
    > partialy isolated block exists. If MAX_ORDER is 11 and
    > pageblock_order is 9, two pageblocks make up MAX_ORDER
    > aligned block and I can think following scenario because pageblock
    > (un)isolation would be done one by one.
    >
    > (each character means one pageblock. 'C', 'I' means MIGRATE_CMA,
    > MIGRATE_ISOLATE, respectively.
    >
    > CC -> IC -> II (Isolation)
    > II -> CI -> CC (Un-isolation)
    >
    > If some pages are freed at this intermediate state such as IC or CI,
    > that page could be merged to the other page that is resident on
    > different type of pageblock and it will cause wrong freepage count.

    This was supposed to be prevented by CMA operating on MAX_ORDER blocks,
    but since it doesn't hold the zone->lock between pageblocks, a race
    window does exist.

    It's also likely that unexpected merging can occur between
    MIGRATE_ISOLATE and non-CMA pageblocks. This should be prevented in
    __free_one_page() since commit 3c605096d315 ("mm/page_alloc: restrict
    max order of merging on isolated pageblock"). However, we only check
    the migratetype of the pageblock where buddy merging has been initiated,
    not the migratetype of the buddy pageblock (or group of pageblocks)
    which can be MIGRATE_ISOLATE.

    Joonsoo has suggested checking for buddy migratetype as part of
    page_is_buddy(), but that would add extra checks in allocator hotpath
    and bloat-o-meter has shown significant code bloat (the function is
    inline).

    This patch reduces the bloat at some expense of more complicated code.
    The buddy-merging while-loop in __free_one_page() is initially bounded
    to pageblock_border and without any migratetype checks. The checks are
    placed outside, bumping the max_order if merging is allowed, and
    returning to the while-loop with a statement which can't be possibly
    considered harmful.

    This fixes the accounting bug and also removes the arguably weird state
    in the original commit 3c605096d315 where buddies could be left
    unmerged.

    Fixes: 3c605096d315 ("mm/page_alloc: restrict max order of merging on isolated pageblock")
    Link: https://lkml.org/lkml/2016/3/2/280
    Signed-off-by: Vlastimil Babka
    Reported-by: Hanjun Guo
    Tested-by: Hanjun Guo
    Acked-by: Joonsoo Kim
    Debugged-by: Laura Abbott
    Debugged-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Yasuaki Ishimatsu
    Cc: Zhang Yanfei
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K.V"
    Cc: [3.18+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The r592 driver relies on behavior of the DMA mapping API that is
    normally observed but not guaranteed by the API. Instead it uses a
    runtime check to fail transfers if the API ever behaves

    When CONFIG_NEED_SG_DMA_LENGTH is not set, one of the checks turns into a
    comparison of a variable with itself, which gcc-6.0 now warns about:

    drivers/memstick/host/r592.c: In function 'r592_transfer_fifo_dma':
    drivers/memstick/host/r592.c:302:31: error: self-comparison always evaluates to false [-Werror=tautological-compare]
    (sg_dma_len(&dev->req->sg) < dev->req->sg.length)) {
    ^

    The check itself is not a problem, so this patch just rephrases the
    condition in a way that gcc does not consider an indication of a mistake.
    We already know that dev->req->sg.length was initially R592_LFIFO_SIZE, so
    we can compare it to that constant again.

    Signed-off-by: Arnd Bergmann
    Cc: Maxim Levitsky
    Cc: Quentin Lambert
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • Now function ocfs2_replay_truncate_records() first modifies tl_used,
    then calls ocfs2_extend_trans() to extend transactions for gd and alloc
    inode used for freeing clusters. jbd2_journal_restart() may be called
    and it may happen that tl_used in truncate log is decreased but the
    clusters are not freed, which means these clusters are lost. So we
    should avoid extending transactions in these two operations.

    Signed-off-by: joyce.xue
    Reviewed-by: Mark Fasheh
    Acked-by: Joseph Qi
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xue jiufei
     
  • …e_lengths() before to avoid inconsistency between inode and et

    I found that jbd2_journal_restart() is called in some places without
    keeping things consistently before. However, jbd2_journal_restart() may
    commit the handle's transaction and restart another one. If the first
    transaction is committed successfully while another not, it may cause
    filesystem inconsistency or read only. This is an effort to fix this
    kind of problems.

    This patch (of 3):

    The following functions will be called while truncating an extent:
    ocfs2_remove_btree_range
    -> ocfs2_start_trans
    -> ocfs2_remove_extent
    -> ocfs2_truncate_rec
    -> ocfs2_extend_rotate_transaction
    -> jbd2_journal_restart if jbd2_journal_extend fail
    -> ocfs2_rotate_tree_left
    -> ocfs2_remove_rightmost_path
    -> ocfs2_extend_rotate_transaction
    -> ocfs2_unlink_subtree
    -> ocfs2_update_edge_lengths
    -> ocfs2_extend_trans
    -> jbd2_journal_restart if jbd2_journal_extend fail
    -> ocfs2_et_update_clusters
    -> ocfs2_commit_trans

    jbd2_journal_restart() may be called and it may happened that the buffers
    dirtied in ocfs2_truncate_rec() are committed while buffers dirtied in
    ocfs2_et_update_clusters() are not, the total clusters on extent tree and
    i_clusters in ocfs2_dinode is inconsistency. So the clusters got from
    ocfs2_dinode is incorrect, and it also cause read-only problem when call
    ocfs2_commit_truncate() with the error message: "Inode %llu has empty
    extent block at %llu".

    We should extend enough credits for function ocfs2_remove_rightmost_path
    and ocfs2_update_edge_lengths to avoid this inconsistency.

    Signed-off-by: joyce.xue <xuejiufei@huawei.com>
    Acked-by: Joseph Qi <joseph.qi@huawei.com>
    Cc: Mark Fasheh <mfasheh@suse.com>
    Cc: Joel Becker <jlbec@evilplan.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Xue jiufei
     
  • We have found a bug when two nodes doing umount one after another.

    1) Node 1 migrate a lockres that has 3 locks in grant queue such as
    N2(PR)N3(NL)N4(PR) to N2. After migration, lvb of the lock
    N3(NL) and N4(PR) are empty on node 2 because migration target do not
    copy lvb to these two lock.

    2) Node 3 want to convert to PR, it can be granted in
    __dlmconvert_master(), and the order of these locks is unchanged. The
    lvb of the lock N3(PR) on node 2 is copyed from lockres in function
    dlm_update_lvb() while the lvb of lock N4(PR) is still empty.

    3) Node 2 want to leave domain, it will migrate this lockres to node 3.
    Then node 2 will trigger the BUG in dlm_prepare_lvb_for_migration()
    when adding the lock N4(PR) to mres with the following message because
    the lvb of mres is already copied from lock N3(PR), but the lvb of lock
    N4(PR) is empty.

    "Mismatched lvb in lock cookie=%u:%llu, name=%.*s, node=%u"

    [akpm@linux-foundation.org: tweak comment]
    Signed-off-by: xuejiufei
    Acked-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    xuejiufei
     
  • In update_backups() there exists a problem of crossing the boundary as
    follows:

    we assume that lun will be resized to 1TB(cluster_size is 32kb), it will
    include 0~33554431 cluster, in update_backups func, it will backup super
    block in location of 1TB which is the 33554432th cluster, so the
    phenomenon of crossing the boundary happens.

    Signed-off-by: Yiwen Jiang
    Reviewed-by: Joseph Qi
    Cc: Xue jiufei
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    jiangyiwen