10 Sep, 2015

3 commits

  • Followup to the UFS series - with the way we clear the new blocks (via
    buffer cache, possibly on more than a page worth of file) we really
    should not insert a reference to new block into inode block tree until
    after we'd cleared it.

    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     
  • Pull cifs updates from Steve French:
    "Small cifs fix and a patch for improved debugging"

    * 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
    cifs: Fix use-after-free on mid_q_entry
    Update cifs version number
    Add way to query server fs info for smb3

    Linus Torvalds
     
  • As part of the v4.3 merge window the DAX code was updated by Matthew and
    Kirill to handle PMD pages. Also as part of the v4.3 merge window we
    updated the DAX code to do proper PMEM flushing (commit 2765cfbb342c:
    "dax: update I/O path to do proper PMEM flushing").

    The additional code added by the DAX PMD patches also needs to be
    updated to properly use the PMEM API. This ensures that after a PMD
    fault is handled the zeros written to the newly allocated pages are
    durable on the DIMMs.

    linux/dax.h is included to get rid of a bunch of sparse warnings.

    Signed-off-by: Ross Zwisler
    Cc: Matthew Wilcox ,
    Cc: Dan Williams
    Cc: Kirill Shutemov
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

09 Sep, 2015

28 commits

  • Merge second patch-bomb from Andrew Morton:
    "Almost all of the rest of MM. There was an unusually large amount of
    MM material this time"

    * emailed patches from Andrew Morton : (141 commits)
    zpool: remove no-op module init/exit
    mm: zbud: constify the zbud_ops
    mm: zpool: constify the zpool_ops
    mm: swap: zswap: maybe_preload & refactoring
    zram: unify error reporting
    zsmalloc: remove null check from destroy_handle_cache()
    zsmalloc: do not take class lock in zs_shrinker_count()
    zsmalloc: use class->pages_per_zspage
    zsmalloc: consider ZS_ALMOST_FULL as migrate source
    zsmalloc: partial page ordering within a fullness_list
    zsmalloc: use shrinker to trigger auto-compaction
    zsmalloc: account the number of compacted pages
    zsmalloc/zram: introduce zs_pool_stats api
    zsmalloc: cosmetic compaction code adjustments
    zsmalloc: introduce zs_can_compact() function
    zsmalloc: always keep per-class stats
    zsmalloc: drop unused variable `nr_to_migrate'
    mm/memblock.c: fix comment in __next_mem_range()
    mm/page_alloc.c: fix type information of memoryless node
    memory-hotplug: fix comments in zone_spanned_pages_in_node() and zone_spanned_pages_in_node()
    ...

    Linus Torvalds
     
  • Pull regmap updates from Mark Brown:
    "This has been a busy release for regmap.

    By far the biggest set of changes here are those from Markus Pargmann
    which implement support for block transfers in smbus devices. This
    required quite a bit of refactoring but leaves us better able to
    handle odd restrictions that controllers may have and with better
    performance on smbus.

    Other new features include:

    - Fix interactions with lockdep for nested regmaps (eg, when a device
    using regmap is connected to a bus where the bus controller has a
    separate regmap). Lockdep's default class identification is too
    crude to work without help.

    - Support for must write bitfield operations, useful for operations
    which require writing a bit to trigger them from Kuniori Morimoto.

    - Support for delaying during register patch application from Nariman
    Poushin.

    - Support for overriding cache state via the debugfs implementation
    from Richard Fitzgerald"

    * tag 'regmap-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap: (25 commits)
    regmap: fix a NULL pointer dereference in __regmap_init
    regmap: Support bulk reads for devices without raw formatting
    regmap-i2c: Add smbus i2c block support
    regmap: Add raw_write/read checks for max_raw_write/read sizes
    regmap: regmap max_raw_read/write getter functions
    regmap: Introduce max_raw_read/write for regmap_bulk_read/write
    regmap: Add missing comments about struct regmap_bus
    regmap: No multi_write support if bus->write does not exist
    regmap: Split use_single_rw internally into use_single_read/write
    regmap: Fix regmap_bulk_write for bus writes
    regmap: regmap_raw_read return error on !bus->read
    regulator: core: Print at debug level on debugfs creation failure
    regmap: Fix regmap_can_raw_write check
    regmap: fix typos in regmap.c
    regmap: Fix integertypes for register address and value
    regmap: Move documentation to regmap.h
    regmap: Use different lockdep class for each regmap init call
    thermal: sti: Add parentheses around bridge->ops->regmap_init call
    mfd: vexpress: Add parentheses around bridge->ops->regmap_init call
    regmap: debugfs: Fix misuse of IS_ENABLED
    ...

    Linus Torvalds
     
  • This is based on the shmem version, but it has diverged quite a bit. We
    have no swap to worry about, nor the new file sealing. Add
    synchronication via the fault mutex table to coordinate page faults,
    fallocate allocation and fallocate hole punch.

    What this allows us to do is move physical memory in and out of a
    hugetlbfs file without having it mapped. This also gives us the ability
    to support MADV_REMOVE since it is currently implemented using
    fallocate(). MADV_REMOVE lets madvise() remove pages from the middle of
    a hugetlbfs file, which wasn't possible before.

    hugetlbfs fallocate only operates on whole huge pages.

    Based on code by Dave Hansen.

    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Modify truncate_hugepages() to take a range of pages (start, end)
    instead of simply start. If an end value of LLONG_MAX is passed, the
    current "truncate" functionality is maintained. Existing callers are
    modified to pass LLONG_MAX as end of range. By keying off end ==
    LLONG_MAX, the routine behaves differently for truncate and hole punch.
    Page removal is now synchronized with page allocation via faults by
    using the fault mutex table. The hole punch case can experience the
    rare region_del error and must handle accordingly.

    Add the routine hugetlb_fix_reserve_counts to fix up reserve counts in
    the case where region_del returns an error.

    Since the routine handles more than just the truncate case, it is
    renamed to remove_inode_hugepages(). To be consistent, the routine
    truncate_huge_page() is renamed remove_huge_page().

    Downstream of remove_inode_hugepages(), the routine
    hugetlb_unreserve_pages() is also modified to take a range of pages.
    hugetlb_unreserve_pages is modified to detect an error from region_del and
    pass it back to the caller.

    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • fallocate hole punch will want to unmap a specific range of pages.
    Modify the existing hugetlb_vmtruncate_list() routine to take a
    start/end range. If end is 0, this indicates all pages after start
    should be unmapped. This is the same as the existing truncate
    functionality. Modify existing callers to add 0 as end of range.

    Since the routine will be used in hole punch as well as truncate
    operations, it is more appropriately renamed to hugetlb_vmdelete_list().

    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • We want to know per-process workingset size for smart memory management
    on userland and we use swap(ex, zram) heavily to maximize memory
    efficiency so workingset includes swap as well as RSS.

    On such system, if there are lots of shared anonymous pages, it's really
    hard to figure out exactly how many each process consumes memory(ie, rss
    + wap) if the system has lots of shared anonymous memory(e.g, android).

    This patch introduces SwapPss field on /proc//smaps so we can get
    more exact workingset size per process.

    Bongkyu tested it. Result is below.

    1. 50M used swap
    SwapTotal: 461976 kB
    SwapFree: 411192 kB

    $ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}';
    48236
    $ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}';
    141184

    2. 240M used swap
    SwapTotal: 461976 kB
    SwapFree: 216808 kB

    $ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}';
    230315
    $ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}';
    1387744

    [akpm@linux-foundation.org: simplify kunmap_atomic() call]
    Signed-off-by: Minchan Kim
    Reported-by: Bongkyu Kim
    Tested-by: Bongkyu Kim
    Cc: Hugh Dickins
    Cc: Sergey Senozhatsky
    Cc: Jonathan Corbet
    Cc: Jerome Marchand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • This patch sets bit 56 in pagemap if this page is mapped only once. It
    allows to detect exclusively used pages without exposing PFN:

    present file exclusive state
    0 0 0 non-present
    1 1 0 file page mapped somewhere else
    1 1 1 file page mapped only here
    1 0 0 anon non-CoWed page (shared with parent/child)
    1 0 1 anon CoWed page (or never forked)

    CoWed pages in (MAP_FILE | MAP_PRIVATE) areas are anon in this context.

    MMap-exclusive bit doesn't reflect potential page-sharing via swapcache:
    page could be mapped once but has several swap-ptes which point to it.
    Application could detect that by swap bit in pagemap entry and touch that
    pte via /proc/pid/mem to get real information.

    See http://lkml.kernel.org/r/CAEVpBa+_RyACkhODZrRvQLs80iy0sqpdrd0AaP_-tgnX3Y9yNQ@mail.gmail.com

    Requested by Mark Williamson.

    [akpm@linux-foundation.org: fix spello]
    Signed-off-by: Konstantin Khlebnikov
    Reviewed-by: Mark Williamson
    Tested-by: Mark Williamson
    Reviewed-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • This patch makes pagemap readable for normal users and hides physical
    addresses from them. For some use-cases PFN isn't required at all.

    See http://lkml.kernel.org/r/1425935472-17949-1-git-send-email-kirill@shutemov.name

    Fixes: ab676b7d6fbf ("pagemap: do not leak physical addresses to non-privileged userspace")
    Signed-off-by: Konstantin Khlebnikov
    Cc: Naoya Horiguchi
    Reviewed-by: Mark Williamson
    Tested-by: Mark Williamson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • This patch moves pmd dissection out of reporting loop: huge pages are
    reported as bunch of normal pages with contiguous PFNs.

    Add missing "FILE" bit in hugetlb vmas.

    Signed-off-by: Konstantin Khlebnikov
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Mark Williamson
    Tested-by: Mark Williamson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • This patch removes page-shift bits (scheduled to remove since 3.11) and
    completes migration to the new bit layout. Also it cleans messy macro.

    Signed-off-by: Konstantin Khlebnikov
    Reviewed-by: Naoya Horiguchi
    Cc: Mark Williamson
    Tested-by: Mark Williamson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • This patchset makes pagemap useable again in the safe way (after row
    hammer bug it was made CAP_SYS_ADMIN-only). This patchset restores access
    for non-privileged users but hides PFNs from them.

    Also it adds bit 'map-exclusive' which is set if page is mapped only here:
    it helps in estimation of working set without exposing pfns and allows to
    distinguish CoWed and non-CoWed private anonymous pages.

    Second patch removes page-shift bits and completes migration to the new
    pagemap format: flags soft-dirty and mmap-exclusive are available only in
    the new format.

    This patch (of 5):

    This patch moves permission checks from pagemap_read() into pagemap_open().

    Pointer to mm is saved in file->private_data. This reference pins only
    mm_struct itself. /proc/*/mem, maps, smaps already work in the same way.

    See http://lkml.kernel.org/r/CA+55aFyKpWrt_Ajzh1rzp_GcwZ4=6Y=kOv8hBz172CFJp6L8Tg@mail.gmail.com

    Signed-off-by: Konstantin Khlebnikov
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Mark Williamson
    Tested-by: Mark Williamson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • DAX is not so special: we need i_mmap_lock to protect mapping->i_mmap.

    __dax_pmd_fault() uses unmap_mapping_range() shoot out zero page from
    all mappings. We need to drop i_mmap_lock there to avoid lock deadlock.

    Re-aquiring the lock should be fine since we check i_size after the
    point.

    Signed-off-by: Kirill A. Shutemov
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • I was basically open-coding it (thanks to copying code from do_fault()
    which probably also needs to be fixed).

    Signed-off-by: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • If the first access to a huge page was a store, there would be no existing
    zero pmd in this process's page tables. There could be a zero pmd in
    another process's page tables, if it had done a load. We can detect this
    case by noticing that the buffer_head returned from the filesystem is New,
    and ensure that other processes mapping this huge page have their page
    tables flushed.

    Signed-off-by: Matthew Wilcox
    Reported-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • This is another place where DAX assumed that pgtable_t was a pointer.
    Open code the important parts of set_huge_zero_page() in DAX and make
    set_huge_zero_page() static again.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • If two threads write-fault on the same hole at the same time, the winner
    of the race will return to userspace and complete their store, only to
    have the loser overwrite their store with zeroes. Fix this for now by
    taking the i_mmap_sem for write instead of read, and do so outside the
    call to get_block(). Now the loser of the race will see the block has
    already been zeroed, and will not zero it again.

    This severely limits our scalability. I have ideas for improving it, but
    those can wait for a later patch.

    Signed-off-by: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Jan Kara pointed out that in the case where we are writing to a hole, we
    can end up with a lock inversion between the page lock and the journal
    lock. We can avoid this by starting the transaction in ext4 before
    calling into DAX. The journal lock nests inside the superblock
    pagefault lock, so we have to duplicate that code from dax_fault, like
    XFS does.

    Signed-off-by: Matthew Wilcox
    Cc: Jan Kara
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • DAX wants different semantics from any currently-existing ext4 get_block
    callback. Unlike ext4_get_block_write(), it needs to honour the
    'create' flag, and unlike ext4_get_block(), it needs to be able to
    return unwritten extents. So introduce a new ext4_get_block_dax() which
    has those semantics.

    We could also change ext4_get_block_write() to honour the 'create' flag,
    but that might have consequences on other users that I do not currently
    understand.

    Signed-off-by: Matthew Wilcox
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Jan Kara pointed out I should be more explicit here about the perils of
    racing against truncate. The comment is mostly the same as for the PTE
    case.

    Signed-off-by: Matthew Wilcox
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • DAX relies on the get_block function either zeroing newly allocated
    blocks before they're findable by subsequent calls to get_block, or
    marking newly allocated blocks as unwritten. ext4_get_block() cannot
    create unwritten extents, but ext4_get_block_write() can.

    Signed-off-by: Matthew Wilcox
    Reported-by: Andy Rudoff
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Fix typo s/CONFIG_TRANSPARENT_HUGEPAGES/CONFIG_TRANSPARENT_HUGEPAGE/ in
    #endif comment introduced by commit 2b26a9206d6a ("dax: add huge page
    fault support").

    Signed-off-by: Valentin Rothberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Valentin Rothberg
     
  • Use DAX to provide support for huge pages.

    Signed-off-by: Matthew Wilcox
    Cc: Hillf Danton
    Cc: "Kirill A. Shutemov"
    Cc: Theodore Ts'o
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Use DAX to provide support for huge pages.

    Signed-off-by: Matthew Wilcox
    Cc: Hillf Danton
    Cc: "Kirill A. Shutemov"
    Cc: Theodore Ts'o
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Use DAX to provide support for huge pages.

    Signed-off-by: Matthew Wilcox
    Cc: Hillf Danton
    Cc: "Kirill A. Shutemov"
    Cc: Theodore Ts'o
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • This is the support code for DAX-enabled filesystems to allow them to
    provide huge pages in response to faults.

    Signed-off-by: Matthew Wilcox
    Cc: Hillf Danton
    Cc: "Kirill A. Shutemov"
    Cc: Theodore Ts'o
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • In order to handle the !CONFIG_TRANSPARENT_HUGEPAGES case, we need to
    return VM_FAULT_FALLBACK from the inlined dax_pmd_fault(), which is
    defined in linux/mm.h. Given that we don't want to include
    in , the easiest solution is to move the DAX-related
    functions to a new header, . We could also have moved
    VM_FAULT_* definitions to a new header, or a different header that isn't
    quite such a boil-the-ocean header as , but this felt like
    the best option.

    Signed-off-by: Matthew Wilcox
    Cc: Hillf Danton
    Cc: "Kirill A. Shutemov"
    Cc: Theodore Ts'o
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Pull libnvdimm updates from Dan Williams:
    "This update has successfully completed a 0day-kbuild run and has
    appeared in a linux-next release. The changes outside of the typical
    drivers/nvdimm/ and drivers/acpi/nfit.[ch] paths are related to the
    removal of IORESOURCE_CACHEABLE, the introduction of memremap(), and
    the introduction of ZONE_DEVICE + devm_memremap_pages().

    Summary:

    - Introduce ZONE_DEVICE and devm_memremap_pages() as a generic
    mechanism for adding device-driver-discovered memory regions to the
    kernel's direct map.

    This facility is used by the pmem driver to enable pfn_to_page()
    operations on the page frames returned by DAX ('direct_access' in
    'struct block_device_operations').

    For now, the 'memmap' allocation for these "device" pages comes
    from "System RAM". Support for allocating the memmap from device
    memory will arrive in a later kernel.

    - Introduce memremap() to replace usages of ioremap_cache() and
    ioremap_wt(). memremap() drops the __iomem annotation for these
    mappings to memory that do not have i/o side effects. The
    replacement of ioremap_cache() with memremap() is limited to the
    pmem driver to ease merging the api change in v4.3.

    Completion of the conversion is targeted for v4.4.

    - Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem
    driver, update the VFS DAX implementation and PMEM api to provide
    persistence guarantees for kernel operations on a DAX mapping.

    - Convert the ACPI NFIT 'BLK' driver to map the block apertures as
    cacheable to improve performance.

    - Miscellaneous updates and fixes to libnvdimm including support for
    issuing "address range scrub" commands, clarifying the optimal
    'sector size' of pmem devices, a clarification of the usage of the
    ACPI '_STA' (status) property for DIMM devices, and other minor
    fixes"

    * tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (34 commits)
    libnvdimm, pmem: direct map legacy pmem by default
    libnvdimm, pmem: 'struct page' for pmem
    libnvdimm, pfn: 'struct page' provider infrastructure
    x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB
    add devm_memremap_pages
    mm: ZONE_DEVICE for "device memory"
    mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h
    dax: drop size parameter to ->direct_access()
    nd_blk: change aperture mapping from WC to WB
    nvdimm: change to use generic kvfree()
    pmem, dax: have direct_access use __pmem annotation
    dax: update I/O path to do proper PMEM flushing
    pmem: add copy_from_iter_pmem() and clear_pmem()
    pmem, x86: clean up conditional pmem includes
    pmem: remove layer when calling arch_has_wmb_pmem()
    pmem, x86: move x86 PMEM API to new pmem.h header
    libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
    pmem: switch to devm_ allocations
    devres: add devm_memremap
    libnvdimm, btt: write and validate parent_uuid
    ...

    Linus Torvalds
     
  • …kernel/git/tyhicks/ecryptfs

    Pull ecryptfs fixes from Tyler Hicks:
    "Invalidate stale eCryptfs dcache entries caused by unlinked lower
    inodes"

    * tag 'ecryptfs-4.3-rc1-stale-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
    eCryptfs: Delete a check before the function call "key_put"
    eCryptfs: Invalidate dcache entries when lower i_nlink is zero

    Linus Torvalds
     

08 Sep, 2015

4 commits

  • Pull NFS client updates from Trond Myklebust:
    "Highlights include:

    Stable patches:
    - Fix atomicity of pNFS commit list updates
    - Fix NFSv4 handling of open(O_CREAT|O_EXCL|O_RDONLY)
    - nfs_set_pgio_error sometimes misses errors
    - Fix a thinko in xs_connect()
    - Fix borkage in _same_data_server_addrs_locked()
    - Fix a NULL pointer dereference of migration recovery ops for v4.2
    client
    - Don't let the ctime override attribute barriers.
    - Revert "NFSv4: Remove incorrect check in can_open_delegated()"
    - Ensure flexfiles pNFS driver updates the inode after write finishes
    - flexfiles must not pollute the attribute cache with attrbutes from
    the DS
    - Fix a protocol error in layoutreturn
    - Fix a protocol issue with NFSv4.1 CLOSE stateids

    Bugfixes + cleanups
    - pNFS blocks bugfixes from Christoph
    - Various cleanups from Anna
    - More fixes for delegation corner cases
    - Don't fsync twice for O_SYNC/IS_SYNC files
    - Fix pNFS and flexfiles layoutstats bugs
    - pnfs/flexfiles: avoid duplicate tracking of mirror data
    - pnfs: Fix layoutget/layoutreturn/return-on-close serialisation
    issues
    - pnfs/flexfiles: error handling retries a layoutget before fallback
    to MDS

    Features:
    - Full support for the OPEN NFS4_CREATE_EXCLUSIVE4_1 mode from
    Kinglong
    - More RDMA client transport improvements from Chuck
    - Removal of the deprecated ib_reg_phys_mr() and ib_rereg_phys_mr()
    verbs from the SUNRPC, Lustre and core infiniband tree.
    - Optimise away the close-to-open getattr if there is no cached data"

    * tag 'nfs-for-4.3-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (108 commits)
    NFSv4: Respect the server imposed limit on how many changes we may cache
    NFSv4: Express delegation limit in units of pages
    Revert "NFS: Make close(2) asynchronous when closing NFS O_DIRECT files"
    NFS: Optimise away the close-to-open getattr if there is no cached data
    NFSv4.1/flexfiles: Clean up ff_layout_write_done_cb/ff_layout_commit_done_cb
    NFSv4.1/flexfiles: Mark the layout for return in ff_layout_io_track_ds_error()
    nfs: Remove unneeded checking of the return value from scnprintf
    nfs: Fix truncated client owner id without proto type
    NFSv4.1/flexfiles: Mark layout for return if the mirrors are invalid
    NFSv4.1/flexfiles: RW layouts are valid only if all mirrors are valid
    NFSv4.1/flexfiles: Fix incorrect usage of pnfs_generic_mark_devid_invalid()
    NFSv4.1/flexfiles: Fix freeing of mirrors
    NFSv4.1/pNFS: Don't request a minimal read layout beyond the end of file
    NFSv4.1/pnfs: Handle LAYOUTGET return values correctly
    NFSv4.1/pnfs: Don't ask for a read layout for an empty file.
    NFSv4.1: Fix a protocol issue with CLOSE stateids
    NFSv4.1/flexfiles: Don't mark the entire deviceid as bad for file errors
    SUNRPC: Prevent SYN+SYNACK+RST storms
    SUNRPC: xs_reset_transport must mark the connection as disconnected
    NFSv4.1/pnfs: Ensure layoutreturn reserves space for the opaque payload
    ...

    Linus Torvalds
     
  • Pull xfs updates from Dave Chinner:
    "There isn't a whole lot to this update - it's mostly bug fixes and
    they are spread pretty much all over XFS. There are some corruption
    fixes, some fixes for log recovery, some fixes that prevent unount
    from hanging, a lockdep annotation rework for inode locking to prevent
    false positives and the usual random bunch of cleanups and minor
    improvements.

    Deatils:

    - large rework of EFI/EFD lifecycle handling to fix log recovery
    corruption issues, crashes and unmount hangs

    - separate metadata UUID on disk to enable changing boot label UUID
    for v5 filesystems

    - fixes for gcc miscompilation on certain platforms and optimisation
    levels

    - remote attribute allocation and recovery corruption fixes

    - inode lockdep annotation rework to fix bugs with too many
    subclasses

    - directory inode locking changes to prevent lockdep false positives

    - a handful of minor corruption fixes

    - various other small cleanups and bug fixes"

    * tag 'xfs-for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (42 commits)
    xfs: fix error gotos in xfs_setattr_nonsize
    xfs: add mssing inode cache attempts counter increment
    xfs: return errors from partial I/O failures to files
    libxfs: bad magic number should set da block buffer error
    xfs: fix non-debug build warnings
    xfs: collapse allocsize and biosize mount option handling
    xfs: Fix file type directory corruption for btree directories
    xfs: lockdep annotations throw warnings on non-debug builds
    xfs: Fix uninitialized return value in xfs_alloc_fix_freelist()
    xfs: inode lockdep annotations broke non-lockdep build
    xfs: flush entire file on dio read/write to cached file
    xfs: Fix xfs_attr_leafblock definition
    libxfs: readahead of dir3 data blocks should use the read verifier
    xfs: stop holding ILOCK over filldir callbacks
    xfs: clean up inode lockdep annotations
    xfs: swap leaf buffer into path struct atomically during path shift
    xfs: relocate sparse inode mount warning
    xfs: dquots should be stamped with sb_meta_uuid
    xfs: log recovery needs to validate against sb_meta_uuid
    xfs: growfs not aware of sb_meta_uuid
    ...

    Linus Torvalds
     
  • The NFSv4 delegation spec allows the server to tell a client to limit how
    much data it cache after the file is closed. In return, the server
    guarantees enough free space to avoid ENOSPC situations, etc.
    Prior to this patch, we assumed we could always cache aggressively after
    close. Unfortunately, this causes problems with servers that set the
    limit to 0 and therefore do not offer any ENOSPC guarantees.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Since we're tracking modifications to the page cache on a per-page
    basis, it makes sense to express the limit to how much we may cache
    in units of pages.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     

06 Sep, 2015

4 commits

  • Pull vfs updates from Al Viro:
    "In this one:

    - d_move fixes (Eric Biederman)

    - UFS fixes (me; locking is mostly sane now, a bunch of bugs in error
    handling ought to be fixed)

    - switch of sb_writers to percpu rwsem (Oleg Nesterov)

    - superblock scalability (Josef Bacik and Dave Chinner)

    - swapon(2) race fix (Hugh Dickins)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (65 commits)
    vfs: Test for and handle paths that are unreachable from their mnt_root
    dcache: Reduce the scope of i_lock in d_splice_alias
    dcache: Handle escaped paths in prepend_path
    mm: fix potential data race in SyS_swapon
    inode: don't softlockup when evicting inodes
    inode: rename i_wb_list to i_io_list
    sync: serialise per-superblock sync operations
    inode: convert inode_sb_list_lock to per-sb
    inode: add hlist_fake to avoid the inode hash lock in evict
    writeback: plug writeback at a high level
    change sb_writers to use percpu_rw_semaphore
    shift percpu_counter_destroy() into destroy_super_work()
    percpu-rwsem: kill CONFIG_PERCPU_RWSEM
    percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire()
    percpu-rwsem: introduce percpu_down_read_trylock()
    document rwsem_release() in sb_wait_write()
    fix the broken lockdep logic in __sb_start_write()
    introduce __sb_writers_{acquired,release}() helpers
    ufs_inode_get{frag,block}(): get rid of 'phys' argument
    ufs_getfrag_block(): tidy up a bit
    ...

    Linus Torvalds
     
  • …nux/kernel/git/ericvh/v9fs

    Pull 9p updates from Eric Van Hensbergen:
    "Just a few cleanups for 4.3 merge window for the 9p file system. I've
    gotten several more over the past week, but this group has been in
    for-next for at least a couple of weeks so I figured I'd push them
    first while I test the rest.

    Most of the ones not in this set are bug-fixes anyways so I could hold
    them for rc1"

    * tag 'for-linus-4.3-merge-window-part-1' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
    9p: fix return code of read() when count is 0
    9p: remove unused option Opt_trans

    Linus Torvalds
     
  • Pull nfsd updates from Bruce Fields:
    "Nothing major, but:

    - Add Jeff Layton as an nfsd co-maintainer: no change to existing
    practice, just an acknowledgement of the status quo.

    - Two patches ("nfsd: ensure that...") for a race overlooked by the
    state locking rewrite, causing a crash noticed by multiple users.

    - Lots of smaller bugfixes all over from Kinglong Mee.

    - From Jeff, some cleanup of server rpc code in preparation for
    possible shift of nfsd threads to workqueues"

    * tag 'nfsd-4.3' of git://linux-nfs.org/~bfields/linux: (52 commits)
    nfsd: deal with DELEGRETURN racing with CB_RECALL
    nfsd: return CLID_INUSE for unexpected SETCLIENTID_CONFIRM case
    nfsd: ensure that delegation stateid hash references are only put once
    nfsd: ensure that the ol stateid hash reference is only put once
    net: sunrpc: fix tracepoint Warning: unknown op '->'
    nfsd: allow more than one laundry job to run at a time
    nfsd: don't WARN/backtrace for invalid container deployment.
    fs: fix fs/locks.c kernel-doc warning
    nfsd: Add Jeff Layton as co-maintainer
    NFSD: Return word2 bitmask if setting security label in OPEN/CREATE
    NFSD: Set the attributes used to store the verifier for EXCLUSIVE4_1
    nfsd: SUPPATTR_EXCLCREAT must be encoded before SECURITY_LABEL.
    nfsd: Fix an FS_LAYOUT_TYPES/LAYOUT_TYPES encode bug
    NFSD: Store parent's stat in a separate value
    nfsd: Fix two typos in comments
    lockd: NLM grace period shouldn't block NFSv4 opens
    nfsd: include linux/nfs4.h in export.h
    sunrpc: Switch to using hash list instead single list
    sunrpc/nfsd: Remove redundant code by exports seq_operations functions
    sunrpc: Store cache_detail in seq_file's private directly
    ...

    Linus Torvalds
     
  • Pull btrfs updates from Chris Mason:
    "This has Jeff Mahoney's long standing trim patch that fixes corners
    where trims were missing. Omar has some raid5/6 fixes, especially for
    using scrub and device replace when devices are missing.

    Zhao Lie continues cleaning and fixing things, this series fixes some
    really hard to hit corners in xfstests. I had to pull it last merge
    window due to some deadlocks, but those are now resolved.

    I added support for Tejun's new blkio controllers. It seems to work
    well for single devices, we'll expand to multi-device as well"

    * 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (47 commits)
    btrfs: fix compile when block cgroups are not enabled
    Btrfs: fix file read corruption after extent cloning and fsync
    Btrfs: check if previous transaction aborted to avoid fs corruption
    btrfs: use __GFP_NOFAIL in alloc_btrfs_bio
    btrfs: Prevent from early transaction abort
    btrfs: Remove unused arguments in tree-log.c
    btrfs: Remove useless condition in start_log_trans()
    Btrfs: add support for blkio controllers
    Btrfs: remove unused mutex from struct 'btrfs_fs_info'
    Btrfs: fix parity scrub of RAID 5/6 with missing device
    Btrfs: fix device replace of a missing RAID 5/6 device
    Btrfs: add RAID 5/6 BTRFS_RBIO_REBUILD_MISSING operation
    Btrfs: count devices correctly in readahead during RAID 5/6 replace
    Btrfs: remove misleading handling of missing device scrub
    btrfs: fix clone / extent-same deadlocks
    Btrfs: fix defrag to merge tail file extent
    Btrfs: fix warning in backref walking
    btrfs: Add WARN_ON() for double lock in btrfs_tree_lock()
    btrfs: Remove root argument in extent_data_ref_count()
    btrfs: Fix wrong comment of btrfs_alloc_tree_block()
    ...

    Linus Torvalds
     

05 Sep, 2015

1 commit

  • vma->vm_ops->mremap() looks more natural and clean in move_vma(), and this
    way ->mremap() can have more users. Say, vdso.

    While at it, s/aio_ring_remap/aio_ring_mremap/.

    Note: this is the minimal change before ->mremap() finds another user in
    file_operations; this method should have more arguments, and it can be
    used to kill arch_remap().

    Signed-off-by: Oleg Nesterov
    Acked-by: Pavel Emelyanov
    Acked-by: Kirill A. Shutemov
    Cc: David Rientjes
    Cc: Benjamin LaHaise
    Cc: Hugh Dickins
    Cc: Jeff Moyer
    Cc: Laurent Dufour
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov