05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

25 Mar, 2016

1 commit

  • Pull more nfsd updates from Bruce Fields:
    "Apologies for the previous request, which omitted the top 8 commits
    from my for-next branch (including the SCSI layout commits). Thanks
    to Trond for spotting my error!"

    This actually includes the new layout types, so here's that part of
    the pull message repeated:

    "Support for a new pnfs layout type from Christoph Hellwig. The new
    layout type is a variant of the block layout which uses SCSI features
    to offer improved fencing and device identification.

    Note this pull request also includes the client side of SCSI layout,
    with Trond's permission"

    * tag 'nfsd-4.6-1' of git://linux-nfs.org/~bfields/linux:
    nfsd: use short read as well as i_size to set eof
    nfsd: better layoutupdate bounds-checking
    nfsd: block and scsi layout drivers need to depend on CONFIG_BLOCK
    nfsd: add SCSI layout support
    nfsd: move some blocklayout code
    nfsd: add a new config option for the block layout driver
    nfs/blocklayout: add SCSI layout support
    nfs4.h: add SCSI layout definitions

    Linus Torvalds
     

22 Mar, 2016

1 commit


18 Mar, 2016

1 commit

  • This is a trivial extension to the block layout driver to support the
    new SCSI layouts draft. There are three changes:

    - device identifcation through the SCSI VPD page. This allows us to
    directly use the udev generated persistent device names instead of
    requiring an expensive lookup by crawling every block device node
    in /dev and reading a signature for it.
    - use of SCSI persistent reservations to protect device access and
    allow for robust fencing. On the client sides this just means
    registering and unregistering a server supplied key.
    - an optimized LAYOUTCOMMIT payload that doesn't send unessecary
    fields to the server.

    Signed-off-by: Christoph Hellwig
    Acked-by: Trond Myklebust
    Signed-off-by: J. Bruce Fields

    Christoph Hellwig
     

18 Feb, 2016

1 commit

  • unreferenced object 0xffffc90000abf000 (size 16900):
    comm "fsync02", pid 15765, jiffies 4297431627 (age 423.772s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 a0 c2 19 00 88 ff ff ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] kmemleak_alloc+0x4e/0xb0
    [] __vmalloc_node_range+0x231/0x280
    [] __vmalloc+0x4a/0x50
    [] ext_tree_prepare_commit+0x231/0x2e0 [blocklayoutdriver]
    [] bl_prepare_layoutcommit+0xe/0x10 [blocklayoutdriver]
    [] pnfs_layoutcommit_inode+0x29c/0x330 [nfsv4]
    [] pnfs_generic_sync+0x13/0x20 [nfsv4]
    [] nfs4_file_fsync+0x58/0x150 [nfsv4]
    [] vfs_fsync_range+0x4b/0xb0
    [] do_fsync+0x3d/0x70
    [] SyS_fsync+0x10/0x20
    [] entry_SYSCALL_64_fastpath+0x12/0x76
    [] 0xffffffffffffffff

    v2, add missing include header

    Signed-off-by: Kinglong Mee
    Signed-off-by: Trond Myklebust

    Kinglong Mee
     

22 Oct, 2015

1 commit

  • Blocklayout uses file offset for the read-back page's offset of first writing,
    it's definitely wrong, it writes data to bad address of page that cause userspace
    application segment fault. It must be the page base stored in header->args.pgbase.

    Also, the pg_offset has no influence with isect and extent length.

    Note: The offset of the non-first page is always zero.

    Ps: A test program will segment fault at read() as,
    #define _GNU_SOURCE

    #include
    #include
    #include
    #include
    #include
    #include
    #include

    int main(int argc, char **argv)
    {
    char buf[2049];
    char *filename = NULL;
    int fd = -1;

    if (argc < 2) {
    printf("Usage: %s filename\n", argv[0]);
    return 0;
    }

    filename = argv[1];
    fd = open(filename, O_RDONLY | O_DIRECT);
    if (fd < 0) {
    printf("Open %s fail: %m\n", filename);
    return 1;
    }

    lseek(fd, 2048, SEEK_SET);
    if (read(fd, buf, sizeof(buf) - 1) != (sizeof(buf) - 1))
    printf("Read 4096 bityes data from %s fail: %m\n", filename);
    out:
    close(fd);
    return 0;
    }

    Signed-off-by: Kinglong Mee
    Signed-off-by: Trond Myklebust

    Kinglong Mee
     

08 Sep, 2015

1 commit

  • Pull NFS client updates from Trond Myklebust:
    "Highlights include:

    Stable patches:
    - Fix atomicity of pNFS commit list updates
    - Fix NFSv4 handling of open(O_CREAT|O_EXCL|O_RDONLY)
    - nfs_set_pgio_error sometimes misses errors
    - Fix a thinko in xs_connect()
    - Fix borkage in _same_data_server_addrs_locked()
    - Fix a NULL pointer dereference of migration recovery ops for v4.2
    client
    - Don't let the ctime override attribute barriers.
    - Revert "NFSv4: Remove incorrect check in can_open_delegated()"
    - Ensure flexfiles pNFS driver updates the inode after write finishes
    - flexfiles must not pollute the attribute cache with attrbutes from
    the DS
    - Fix a protocol error in layoutreturn
    - Fix a protocol issue with NFSv4.1 CLOSE stateids

    Bugfixes + cleanups
    - pNFS blocks bugfixes from Christoph
    - Various cleanups from Anna
    - More fixes for delegation corner cases
    - Don't fsync twice for O_SYNC/IS_SYNC files
    - Fix pNFS and flexfiles layoutstats bugs
    - pnfs/flexfiles: avoid duplicate tracking of mirror data
    - pnfs: Fix layoutget/layoutreturn/return-on-close serialisation
    issues
    - pnfs/flexfiles: error handling retries a layoutget before fallback
    to MDS

    Features:
    - Full support for the OPEN NFS4_CREATE_EXCLUSIVE4_1 mode from
    Kinglong
    - More RDMA client transport improvements from Chuck
    - Removal of the deprecated ib_reg_phys_mr() and ib_rereg_phys_mr()
    verbs from the SUNRPC, Lustre and core infiniband tree.
    - Optimise away the close-to-open getattr if there is no cached data"

    * tag 'nfs-for-4.3-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (108 commits)
    NFSv4: Respect the server imposed limit on how many changes we may cache
    NFSv4: Express delegation limit in units of pages
    Revert "NFS: Make close(2) asynchronous when closing NFS O_DIRECT files"
    NFS: Optimise away the close-to-open getattr if there is no cached data
    NFSv4.1/flexfiles: Clean up ff_layout_write_done_cb/ff_layout_commit_done_cb
    NFSv4.1/flexfiles: Mark the layout for return in ff_layout_io_track_ds_error()
    nfs: Remove unneeded checking of the return value from scnprintf
    nfs: Fix truncated client owner id without proto type
    NFSv4.1/flexfiles: Mark layout for return if the mirrors are invalid
    NFSv4.1/flexfiles: RW layouts are valid only if all mirrors are valid
    NFSv4.1/flexfiles: Fix incorrect usage of pnfs_generic_mark_devid_invalid()
    NFSv4.1/flexfiles: Fix freeing of mirrors
    NFSv4.1/pNFS: Don't request a minimal read layout beyond the end of file
    NFSv4.1/pnfs: Handle LAYOUTGET return values correctly
    NFSv4.1/pnfs: Don't ask for a read layout for an empty file.
    NFSv4.1: Fix a protocol issue with CLOSE stateids
    NFSv4.1/flexfiles: Don't mark the entire deviceid as bad for file errors
    SUNRPC: Prevent SYN+SYNACK+RST storms
    SUNRPC: xs_reset_transport must mark the connection as disconnected
    NFSv4.1/pnfs: Ensure layoutreturn reserves space for the opaque payload
    ...

    Linus Torvalds
     

18 Aug, 2015

5 commits


29 Jul, 2015

1 commit

  • Currently we have two different ways to signal an I/O error on a BIO:

    (1) by clearing the BIO_UPTODATE flag
    (2) by returning a Linux errno value to the bi_end_io callback

    The first one has the drawback of only communicating a single possible
    error (-EIO), and the second one has the drawback of not beeing persistent
    when bios are queued up, and are not passed along from child to parent
    bio in the ever more popular chaining scenario. Having both mechanisms
    available has the additional drawback of utterly confusing driver authors
    and introducing bugs where various I/O submitters only deal with one of
    them, and the others have to add boilerplate code to deal with both kinds
    of error returns.

    So add a new bi_error field to store an errno value directly in struct
    bio and remove the existing mechanisms to clean all this up.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: NeilBrown
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

28 Mar, 2015

2 commits


04 Feb, 2015

1 commit


11 Dec, 2014

1 commit

  • Pull VFS changes from Al Viro:
    "First pile out of several (there _definitely_ will be more). Stuff in
    this one:

    - unification of d_splice_alias()/d_materialize_unique()

    - iov_iter rewrite

    - killing a bunch of ->f_path.dentry users (and f_dentry macro).

    Getting that completed will make life much simpler for
    unionmount/overlayfs, since then we'll be able to limit the places
    sensitive to file _dentry_ to reasonably few. Which allows to have
    file_inode(file) pointing to inode in a covered layer, with dentry
    pointing to (negative) dentry in union one.

    Still not complete, but much closer now.

    - crapectomy in lustre (dead code removal, mostly)

    - "let's make seq_printf return nothing" preparations

    - assorted cleanups and fixes

    There _definitely_ will be more piles"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    copy_from_iter_nocache()
    new helper: iov_iter_kvec()
    csum_and_copy_..._iter()
    iov_iter.c: handle ITER_KVEC directly
    iov_iter.c: convert copy_to_iter() to iterate_and_advance
    iov_iter.c: convert copy_from_iter() to iterate_and_advance
    iov_iter.c: get rid of bvec_copy_page_{to,from}_iter()
    iov_iter.c: convert iov_iter_zero() to iterate_and_advance
    iov_iter.c: convert iov_iter_get_pages_alloc() to iterate_all_kinds
    iov_iter.c: convert iov_iter_get_pages() to iterate_all_kinds
    iov_iter.c: convert iov_iter_npages() to iterate_all_kinds
    iov_iter.c: iterate_and_advance
    iov_iter.c: macros for iterating over iov_iter
    kill f_dentry macro
    dcache: fix kmemcheck warning in switch_names
    new helper: audit_file()
    nfsd_vfs_write(): use file_inode()
    ncpfs: use file_inode()
    kill f_dentry uses
    lockd: get rid of ->f_path.dentry->d_sb
    ...

    Linus Torvalds
     

09 Dec, 2014

1 commit


25 Nov, 2014

1 commit


20 Nov, 2014

1 commit


13 Nov, 2014

2 commits

  • Commit 3a6fd1f004fc (pnfs/blocklayout: remove read-modify-write handling
    in bl_write_pagelist) introduced a bogus assignment pg_index = pg_index
    in variable initialization. AFAICS it's just a typo so remove it.
    Spotted by Coverity (id 1248711).

    CC: Christoph Hellwig
    Signed-off-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Trond Myklebust

    Jan Kara
     
  • The rpc_pipefs code isn't thread safe, leading to occasional use after
    frees when running xfstests generic/241 (dbench).

    Signed-off-by: Christoph Hellwig
    Link: http://lkml.kernel.org/r/1411740170-18611-2-git-send-email-hch@lst.de
    Cc: stable@vger.kernel.org # 3.17.x
    Signed-off-by: Trond Myklebust

    Christoph Hellwig
     

13 Oct, 2014

1 commit

  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Optimized support for Intel "Cluster-on-Die" (CoD) topologies (Dave
    Hansen)

    - Various sched/idle refinements for better idle handling (Nicolas
    Pitre, Daniel Lezcano, Chuansheng Liu, Vincent Guittot)

    - sched/numa updates and optimizations (Rik van Riel)

    - sysbench speedup (Vincent Guittot)

    - capacity calculation cleanups/refactoring (Vincent Guittot)

    - Various cleanups to thread group iteration (Oleg Nesterov)

    - Double-rq-lock removal optimization and various refactorings
    (Kirill Tkhai)

    - various sched/deadline fixes

    ... and lots of other changes"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
    sched/dl: Use dl_bw_of() under rcu_read_lock_sched()
    sched/fair: Delete resched_cpu() from idle_balance()
    sched, time: Fix build error with 64 bit cputime_t on 32 bit systems
    sched: Improve sysbench performance by fixing spurious active migration
    sched/x86: Fix up typo in topology detection
    x86, sched: Add new topology for multi-NUMA-node CPUs
    sched/rt: Use resched_curr() in task_tick_rt()
    sched: Use rq->rd in sched_setaffinity() under RCU read lock
    sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask'
    sched: Use dl_bw_of() under RCU read lock
    sched/fair: Remove duplicate code from can_migrate_task()
    sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSW
    sched: print_rq(): Don't use tasklist_lock
    sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use task_rq_lock()
    sched: Fix the task-group check in tg_has_rt_tasks()
    sched/fair: Leverage the idle state info when choosing the "idlest" cpu
    sched: Let the scheduler see CPU idle states
    sched/deadline: Fix inter- exclusive cpusets migrations
    sched/deadline: Clear dl_entity params when setscheduling to different class
    sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault()
    ...

    Linus Torvalds
     

22 Sep, 2014

1 commit

  • kbuild test robot reports:

    fs/built-in.o: In function `bl_map_stripe':
    >> :(.text+0x965b4): undefined reference to `__aeabi_uldivmod'
    >> :(.text+0x965cc): undefined reference to `__aeabi_uldivmod'
    >> :(.text+0x96604): undefined reference to `__aeabi_uldivmod'

    Fixes: 5c83746a0cf2 (pnfs/blocklayout: in-kernel GETDEVICEINFO XDR parsing)
    Cc: Stephen Rothwell
    Cc: Christoph Hellwig
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Trond Myklebust

    Trond Myklebust
     

19 Sep, 2014

1 commit

  • schedule(), io_schedule() and schedule_timeout() always return
    with TASK_RUNNING state set, so one more setting is unnecessary.

    (All places in patch are visible good, only exception is
    kiblnd_scheduler() from:

    drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c

    Its schedule() is one line above standard 3 lines of unified diff)

    No places where set_current_state() is used for mb().

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Link: http://lkml.kernel.org/r/1410529254.3569.23.camel@tkhai
    Cc: Alasdair Kergon
    Cc: Anil Belur
    Cc: Arnd Bergmann
    Cc: Dave Kleikamp
    Cc: David Airlie
    Cc: David Howells
    Cc: Dmitry Eremin
    Cc: Frank Blaschka
    Cc: Greg Kroah-Hartman
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Isaac Huang
    Cc: James E.J. Bottomley
    Cc: James E.J. Bottomley
    Cc: J. Bruce Fields
    Cc: Jeff Dike
    Cc: Jesper Nilsson
    Cc: Jiri Slaby
    Cc: Laura Abbott
    Cc: Liang Zhen
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Cc: Masaru Nomura
    Cc: Michael Opdenacker
    Cc: Mikael Starvik
    Cc: Mike Snitzer
    Cc: Neil Brown
    Cc: Oleg Drokin
    Cc: Peng Tao
    Cc: Richard Weinberger
    Cc: Robert Love
    Cc: Steven Rostedt
    Cc: Trond Myklebust
    Cc: Ursula Braun
    Cc: Zi Shen Lim
    Cc: devel@driverdev.osuosl.org
    Cc: dm-devel@redhat.com
    Cc: dri-devel@lists.freedesktop.org
    Cc: fcoe-devel@open-fcoe.org
    Cc: jfs-discussion@lists.sourceforge.net
    Cc: linux390@de.ibm.com
    Cc: linux-afs@lists.infradead.org
    Cc: linux-cris-kernel@axis.com
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-nfs@vger.kernel.org
    Cc: linux-parisc@vger.kernel.org
    Cc: linux-raid@vger.kernel.org
    Cc: linux-s390@vger.kernel.org
    Cc: linux-scsi@vger.kernel.org
    Cc: qla2xxx-upstream@qlogic.com
    Cc: user-mode-linux-devel@lists.sourceforge.net
    Cc: user-mode-linux-user@lists.sourceforge.net
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     

16 Sep, 2014

1 commit


13 Sep, 2014

8 commits


11 Sep, 2014

6 commits

  • Signed-off-by: Christoph Hellwig
    Signed-off-by: Trond Myklebust

    Christoph Hellwig
     
  • This speads up truncate-heavy workloads like fsx by multiple orders of
    magnitude.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Trond Myklebust

    Christoph Hellwig
     
  • This allows removing extents from the extent tree especially on truncate
    operations, and thus fixing reads from truncated and re-extended that
    previously returned stale data.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Trond Myklebust

    Christoph Hellwig
     
  • Currently the block layout driver tracks extents in three separate
    data structures:

    - the two list of pnfs_block_extent structures returned by the server
    - the list of sectors that were in invalid state but have been written to
    - a list of pnfs_block_short_extent structures for LAYOUTCOMMIT

    All of these share the property that they are not only highly inefficient
    data structures, but also that operations on them are even more inefficient
    than nessecary.

    In addition there are various implementation defects like:

    - using an int to track sectors, causing corruption for large offsets
    - incorrect normalization of page or block granularity ranges
    - insufficient error handling
    - incorrect synchronization as extents can be modified while they are in
    use

    This patch replace all three data with a single unified rbtree structure
    tracking all extents, as well as their in-memory state, although we still
    need to instance for read-only and read-write extent due to the arcane
    client side COW feature in the block layouts spec.

    To fix the problem of extent possibly being modified while in use we make
    sure to return a copy of the extent for use in the write path - the
    extent can only be invalidated by a layout recall or return which has
    to wait until the I/O operations finished due to refcounts on the layout
    segment.

    The new extent tree work similar to the schemes used by block based
    filesystems like XFS or ext4.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Trond Myklebust

    Christoph Hellwig
     
  • The core nfs code handles setting pages uptodate on reads, no need to mess
    with the pageflags outselves. Also remove a debug function to dump page
    flags.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Trond Myklebust

    Christoph Hellwig
     
  • Use the new PNFS_READ_WHOLE_PAGE flag to offload read-modify-write
    handling to core nfs code, and remove a huge chunk of deadlock prone
    mess from the block layout writeback path.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Trond Myklebust

    Christoph Hellwig