02 Dec, 2016

1 commit


05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

13 Dec, 2015

1 commit

  • Commit 42cb14b110a5 ("mm: migrate dirty page without
    clear_page_dirty_for_io etc") simplified the migration of a PageDirty
    pagecache page: one stat needs moving from zone to zone and that's about
    all.

    It's convenient and safest for it to shift the PageDirty bit from old
    page to new, just before updating the zone stats: before copying data
    and marking the new PageUptodate. This is all done while both pages are
    isolated and locked, just as before; and just as before, there's a
    moment when the new page is visible in the radix_tree, but not yet
    PageUptodate. What's new is that it may now be briefly visible as
    PageDirty before it is PageUptodate.

    When I scoured the tree to see if this could cause a problem anywhere,
    the only places I found were in two similar functions __r4w_get_page():
    which look up a page with find_get_page() (not using page lock), then
    claim it's uptodate if it's PageDirty or PageWriteback or PageUptodate.

    I'm not sure whether that was right before, but now it might be wrong
    (on rare occasions): only claim the page is uptodate if PageUptodate.
    Or perhaps the page in question could never be migratable anyway?

    Signed-off-by: Hugh Dickins
    Tested-by: Boaz Harrosh
    Cc: Benny Halevy
    Cc: Trond Myklebust
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

29 Sep, 2015

1 commit

  • IS_ERR(_OR_NULL) already contain an 'unlikely' compiler flag and there
    is no need to do that again from its callers. Drop it.

    Signed-off-by: Viresh Kumar
    Reviewed-by: Jeff Layton
    Reviewed-by: David Howells
    Reviewed-by: Steve French
    Signed-off-by: Jiri Kosina

    Viresh Kumar
     

28 Mar, 2015

2 commits


04 Feb, 2015

3 commits

  • Let it return current nfs_pgio_mirror in use depending on pg_mirror_count.
    For read, we always use pg_mirrors[0], so this effectively gives us freedom
    to use pg_mirror_idx to track the actual mirror to read from through out the
    IO stack.

    Signed-off-by: Peng Tao
    Signed-off-by: Tom Haynes

    Peng Tao
     
  • This patch adds mirrored write support to the pgio layer. The default
    is to use one mirror, but pgio callers may define callbacks to change
    this to any value up to the (arbitrarily selected) limit of 16.

    The basic idea is to break out members of nfs_pageio_descriptor that cannot
    be shared between mirrored DSes and put them in a new structure.

    Signed-off-by: Weston Andros Adamson

    Weston Andros Adamson
     
  • This is needed to support mirrored writes - the first write can't just
    trash the lseg, we need to keep it around until all mirrors have
    written.

    Signed-off-by: Weston Andros Adamson

    Weston Andros Adamson
     

22 Oct, 2014

1 commit


20 Oct, 2014

1 commit


13 Sep, 2014

1 commit

  • The kbuild test robot complained about a new sparse warning in
    objio_alloc_deviceid_node, but it turns out that this was just a moved
    reference to an existing variable. Fix it to have the right big endian
    annotated type.

    Note that there are some other endianess issues in this file that I didn't
    bother to sort out as they involve global headers.

    Reported-by: kbuild test robot
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Trond Myklebust

    Christoph Hellwig
     

11 Sep, 2014

1 commit

  • Add support to the common pNFS core to issue GETDEVICEINFO calls on
    a device ID cache miss. The code is taken from the well debugged
    file layout implementation and calls out to the layoutdriver through
    a new alloc_deviceid_node method. The calling conventions for
    nfs4_find_get_deviceid are changed so that all information needed to
    send a GETDEVICEINFO request is passed to the common code.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Trond Myklebust

    Christoph Hellwig
     

25 Jun, 2014

3 commits

  • Remove duplicate writeverf structure from merge of nfs_pgio_header and
    nfs_pgio_data and remove writeverf related flags and logic to handle
    more than one RPC per nfs_pgio_header.

    Signed-off-by: Weston Andros Adamson
    Signed-off-by: Trond Myklebust

    Weston Andros Adamson
     
  • struct nfs_pgio_data only exists as a member of nfs_pgio_header, but is
    passed around everywhere, because there used to be multiple _data structs
    per _header. Many of these functions then use the _data to find a pointer
    to the _header. This patch cleans this up by merging the nfs_pgio_data
    structure into nfs_pgio_header and passing nfs_pgio_header around instead.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Weston Andros Adamson
    Signed-off-by: Trond Myklebust

    Weston Andros Adamson
     
  • Rename "verf" to "writeverf" and "pages" to "page_array" to prepare for
    merge of nfs_pgio_data and nfs_pgio_header.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Weston Andros Adamson
    Signed-off-by: Trond Myklebust

    Weston Andros Adamson
     

30 May, 2014

1 commit


29 May, 2014

3 commits

  • Now that pg_test can change the size of the request (by returning a non-zero
    size smaller than the request), pg_test functions that call other
    pg_test functions must return the minimum of the result - or 0 if any fail.

    Also clean up the logic of some pg_test functions so that all checks are
    for contitions where coalescing is not possible.

    Signed-off-by: Weston Andros Adamson
    Signed-off-by: Trond Myklebust

    Weston Andros Adamson
     
  • This is a step toward allowing pg_test to inform the the
    coalescing code to reduce the size of requests so they may fit in
    whatever scheme the pg_test callback wants to define.

    For now, just return the size of the request if there is space, or 0
    if there is not. This shouldn't change any behavior as it acts
    the same as when the pg_test functions returned bool.

    Signed-off-by: Weston Andros Adamson
    Signed-off-by: Trond Myklebust

    Weston Andros Adamson
     
  • At this point, the only difference between nfs_read_data and
    nfs_write_data is the write verifier.

    Signed-off-by: Anna Schumaker
    Signed-off-by: Trond Myklebust

    Anna Schumaker
     

29 Jun, 2013

1 commit


07 Jun, 2013

1 commit


12 Apr, 2013

1 commit


18 Feb, 2013

1 commit

  • now pnfs client uses block layout, maybe we can remove
    blocklayoutdriver first. if we umount later,
    it can cause oops in unset_pnfs_layoutdriver.
    because nfss->pnfs_curr_ld->clear_layoutdriver is invalid.

    reproduce it:
    modprobe blocklayoutdriver
    mount -t nfs4 -o minorversion=1 pnfsip:/ /mnt/
    rmmod blocklayoutdriver
    umount /mnt

    then you can see following

    CPU 0
    Pid: 17023, comm: umount.nfs4 Tainted: GF O 3.7.0-rc6-pnfs #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
    RIP: 0010:[] [] unset_pnfs_layoutdriver+0x1d/0x70 [nfsv4]
    RSP: 0018:ffff8800022d9e48 EFLAGS: 00010286
    RAX: ffffffffa04a1b00 RBX: ffff88000b013800 RCX: 0000000000000001
    RDX: ffffffff81ae8ee0 RSI: ffff880001ee94b8 RDI: ffff88000b013800
    RBP: ffff8800022d9e58 R08: 0000000000000001 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: ffff880001ee9400
    R13: ffff8800105978c0 R14: 00007fff25846c08 R15: 0000000001bba550
    FS: 00007f45ae7f0700(0000) GS:ffff880012c00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: ffffffffa04a1b38 CR3: 0000000002c0c000 CR4: 00000000000006f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process umount.nfs4 (pid: 17023, threadinfo ffff8800022d8000, task ffff880006e48aa0)
    Stack:
    ffff8800105978c0 ffff88000b013800 ffff8800022d9e78 ffffffffa04cd0ce
    ffff8800022d9e78 ffff88000b013800 ffff8800022d9ea8 ffffffffa04755a7
    ffff8800022d9ea8 ffff880002f96400 ffff88000b013800 ffff880002f96400
    Call Trace:
    [] nfs4_destroy_server+0x1e/0x30 [nfsv4]
    [] nfs_free_server+0xb7/0x150 [nfs]
    [] nfs_kill_super+0x35/0x40 [nfs]
    [] deactivate_locked_super+0x45/0x70
    [] deactivate_super+0x4a/0x70
    [] mntput_no_expire+0xd2/0x130
    [] sys_umount+0x72/0xe0
    [] system_call_fastpath+0x16/0x1b
    Code: 06 e1 b8 ea ff ff ff eb 9e 0f 1f 44 00 00 55 48 89 e5 53 48 83 ec 08 66 66 66 66 90 48 8b 87 80 03 00 00 48 89 fb 48 85 c0 74 29 8b 40 38 48 85 c0 74 02 ff d0 48 8b 03 3e ff 48 04 0f 94 c2
    RIP [] unset_pnfs_layoutdriver+0x1d/0x70 [nfsv4]
    RSP
    CR2: ffffffffa04a1b38
    ---[ end trace 29f75aaedda058bf ]---

    Signed-off-by: fanchaoting
    Signed-off-by: Trond Myklebust
    Cc: stable@vger.kernel.org

    fanchaoting
     

05 Nov, 2012

1 commit


17 Oct, 2012

1 commit


09 Oct, 2012

1 commit

  • For buffer write, block layout client scan inode mapping to find
    next hole and use offset-to-hole as layoutget length. Object
    layout client uses offset-to-isize as layoutget length.

    For direct write, both block layout and object layout use dreq->bytes_left.

    Signed-off-by: Peng Tao
    Signed-off-by: Trond Myklebust

    Peng Tao
     

03 Aug, 2012

1 commit

  • Depending on layout and ARCH, ORE has some limits on max IO sizes
    which is communicated on (what else) ore_layout->max_io_length,
    which is always stripe aligned.
    This was considered as the pg_test boundary for splitting and starting
    a new IO.

    But in the case of a long IO where the start offset is not aligned
    what would happen is that both end of IO[N] and start of IO[N+1]
    would be unaligned, causing each IO boundary parity unit to be
    calculated and written twice.

    So what we do in this patch is split the very start of an unaligned
    IO, up to a stripe boundary, and then next IO's can continue fully
    aligned til the end.

    We might be sacrificing the case where the full unaligned IO would
    fit within a single max_io_length, but the sacrifice is well worth
    the elimination of double calculation and parity units IO.
    Actually the sacrificing is marginal and is almost unmeasurable.

    TODO:
    If we know the total expected linear segment that will
    be received, at pg_init, we could use that information
    in many places:
    1. blocks-layout get_layout write segment size
    2. Better mds-threshold
    3. In above situation for a better clean split

    I will do this in future submission.

    Signed-off-by: Boaz Harrosh
    Signed-off-by: Trond Myklebust

    Boaz Harrosh
     

20 Jul, 2012

2 commits

  • It is very common for the end of the file to be unaligned on
    stripe size. But since we know it's beyond file's end then
    the XOR should be preformed with all zeros.

    Old code used to just read zeros out of the OSD devices, which is a great
    waist. But what scares me more about this situation is that, we now have
    pages attached to the file's mapping that are beyond i_size. I don't
    like the kind of bugs this calls for.

    Fix both birds, by returning a global zero_page, if offset is beyond
    i_size.

    TODO:
    Change the API to ->__r4w_get_page() so a NULL can be
    returned without being considered as error, since XOR API
    treats NULL entries as zero_pages.

    [Bug since 3.2. Should apply the same way to all Kernels since]
    CC: Stable Tree
    Signed-off-by: Boaz Harrosh

    Boaz Harrosh
     
  • [Bug since 3.2 Kernel]
    CC: Stable Tree
    Signed-off-by: Boaz Harrosh

    Boaz Harrosh
     

05 May, 2012

1 commit

  • Fix the following sparse warnings:

    fs/nfs/direct.c:221:6: warning: symbol 'nfs_direct_readpage_release' was
    not declared. Should it be static?
    fs/nfs/read.c:38:43: warning: non-ANSI function declaration of function
    'nfs_readhdr_alloc'
    fs/nfs/objlayout/objio_osd.c:214:5: warning: symbol '__alloc_objio_seg'
    was not declared. Should it be static?

    Reported-by: Dan Carpenter
    Signed-off-by: Trond Myklebust
    Cc: Fred Isaman
    Cc: Boaz Harrosh

    Trond Myklebust
     

28 Apr, 2012

1 commit

  • In order to avoid duplicating all the data in nfs_read_data whenever we
    split it up into multiple RPC calls (either due to a short read result
    or due to rsize < PAGE_SIZE), we split out the bits that are the same
    per RPC call into a separate "header" structure.

    The goal this patch moves towards is to have a single header
    refcounted by several rpc_data structures. Thus, want to always refer
    from rpc_data to the header, and not the other way. This patch comes
    close to that ideal, but the directio code currently needs some
    special casing, isolated in the nfs_direct_[read_write]hdr_release()
    functions. This will be dealt with in a future patch.

    Signed-off-by: Fred Isaman
    Signed-off-by: Trond Myklebust

    Fred Isaman
     

27 Apr, 2012

1 commit


21 Mar, 2012

1 commit

  • The pnfs-objects protocol mandates that we autologin into devices not
    present in the system, according to information specified in the
    get_device_info returned from the server.

    The Protocol specifies two login hints.
    1. An IP address:port combination
    2. A string URI which is constructed as a URL with a protocol prefix
    followed by :// and a string as address. For each protocol prefix
    the string-address format might be different.

    We only support the second option. The first option is just redundant
    to the second one.
    NOTE: The Kernel part of autologin does not parse the URI string. It
    just channels it to a user-mode script. So any new login protocols should
    only update the user-mode script which is a part of the nfs-utils package,
    but the Kernel need not change.

    We implement the autologin by using the call_usermodehelper() API.
    (Thanks to Steve Dickson for pointing it out)
    So there is no running daemon needed, and/or special setup.

    We Add the osd_login_prog Kernel module parameters which defaults to:
    /sbin/osd_login

    Kernel try's to upcall the program specified in osd_login_prog. If the file is
    not found or the execution fails Kernel will disable any farther upcalls, by
    zeroing out osd_login_prog, Until Admin re-enables it by setting the
    osd_login_prog parameter to a proper program.

    Also add text about the osd_login program command line API to:
    Documentation/filesystems/nfs/pnfs.txt
    and documentation of the new osd_login_prog module parameter to:
    Documentation/kernel-parameters.txt

    TODO: Add timeout option in the case osd_login program gets
    stuck

    Signed-off-by: Sachin Bhamare
    Signed-off-by: Boaz Harrosh
    Signed-off-by: Trond Myklebust

    Sachin Bhamare
     

14 Mar, 2012

1 commit

  • At some past instance Linus Trovalds wrote:
    > From: Linus Torvalds
    > commit a84a79e4d369a73c0130b5858199e949432da4c6 upstream.
    >
    > The size is always valid, but variable-length arrays generate worse code
    > for no good reason (unless the function happens to be inlined and the
    > compiler sees the length for the simple constant it is).
    >
    > Also, there seems to be some code generation problem on POWER, where
    > Henrik Bakken reports that register r28 can get corrupted under some
    > subtle circumstances (interrupt happening at the wrong time?). That all
    > indicates some seriously broken compiler issues, but since variable
    > length arrays are bad regardless, there's little point in trying to
    > chase it down.
    >
    > "Just don't do that, then".

    Since then any use of "variable length arrays" has become blasphemous.
    Even in perfectly good, beautiful, perfectly safe code like the one
    below where the variable length arrays are only used as a sizeof()
    parameter, for type-safe dynamic structure allocations. GCC is not
    executing any stack allocation code.

    I have produced a small file which defines two functions main1(unsigned numdevs)
    and main2(unsigned numdevs). main1 uses code as before with call to malloc
    and main2 uses code as of after this patch. I compiled it as:
    gcc -O2 -S see_asm.c
    and here is what I get:

    main1:
    .LFB7:
    .cfi_startproc
    mov %edi, %edi
    leaq 4(%rdi,%rdi), %rdi
    salq $3, %rdi
    jmp malloc
    .cfi_endproc
    .LFE7:
    .size main1, .-main1
    .p2align 4,,15
    .globl main2
    .type main2, @function
    main2:
    .LFB8:
    .cfi_startproc
    mov %edi, %edi
    addq $2, %rdi
    salq $4, %rdi
    jmp malloc
    .cfi_endproc
    .LFE8:
    .size main2, .-main2
    .section .text.startup,"ax",@progbits
    .p2align 4,,15

    *Exact* same code !!!

    So please seriously consider not accepting this patch and leave the
    perfectly good code intact.

    CC: Linus Torvalds
    Signed-off-by: Boaz Harrosh
    Signed-off-by: Trond Myklebust

    Boaz Harrosh
     

12 Mar, 2012

1 commit

  • Fix a number of "warning: symbol 'foo' was not declared. Should it be
    static?" conditions.

    Fix 2 cases of "warning: Using plain integer as NULL pointer"

    fs/nfs/delegation.c:263:31: warning: restricted fmode_t degrades to integer
    - We want to allow upgrades to a WRITE delegation, but should otherwise
    consider servers that hand out duplicate delegations to be borken.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     

07 Feb, 2012

1 commit


06 Jan, 2012

2 commits

  • As mandated by the standard. In case of an IO error, a pNFS
    objects layout driver must return it's layout. This is because
    all device errors are reported to the server as part of the
    layout return buffer.

    This is implemented the same way PNFS_LAYOUTRET_ON_SETATTR
    is done, through a bit flag on the pnfs_layoutdriver_type->flags
    member. The flag is set by the layout driver that wants a
    layout_return preformed at pnfs_ld_{write,read}_done in case
    of an error.
    (Though I have not defined a wrapper like pnfs_ld_layoutret_on_setattr
    because this code is never called outside of pnfs.c and pnfs IO
    paths)

    Without this patch 3.[0-2] Kernels leak memory and have an annoying
    WARN_ON after every IO error utilizing the pnfs-obj driver.

    [This patch is for 3.2 Kernel. 3.1/0 Kernels need a different patch]
    CC: Stable Tree
    Signed-off-by: Boaz Harrosh
    Signed-off-by: Trond Myklebust

    Boaz Harrosh
     
  • Some time along the way pNFS IO errors were switched to
    communicate with a special iodata->pnfs_error member instead
    of the regular RPC members. But objlayout was not switched
    over.

    Fix that!
    Without this fix any IO error is hanged, because IO is not
    switched to MDS and pages are never cleared or read.

    [Applies to 3.2.0. Same bug different patch for 3.1/0 Kernels]
    CC: Stable Tree
    Signed-off-by: Boaz Harrosh
    Signed-off-by: Trond Myklebust

    Boaz Harrosh
     

03 Nov, 2011

1 commit