29 Jan, 2014

1 commit


28 Jan, 2014

1 commit

  • There is a possible race in how the nfs_invalidate_mapping function is
    handled. Currently, we go and invalidate the pages in the file and then
    clear NFS_INO_INVALID_DATA.

    The problem is that it's possible for a stale page to creep into the
    mapping after the page was invalidated (i.e., via readahead). If another
    writer comes along and sets the flag after that happens but before
    invalidate_inode_pages2 returns then we could clear the flag
    without the cache having been properly invalidated.

    So, we must clear the flag first and then invalidate the pages. Doing
    this however, opens another race:

    It's possible to have two concurrent read() calls that end up in
    nfs_revalidate_mapping at the same time. The first one clears the
    NFS_INO_INVALID_DATA flag and then goes to call nfs_invalidate_mapping.

    Just before calling that though, the other task races in, checks the
    flag and finds it cleared. At that point, it trusts that the mapping is
    good and gets the lock on the page, allowing the read() to be satisfied
    from the cache even though the data is no longer valid.

    These effects are easily manifested by running diotest3 from the LTP
    test suite on NFS. That program does a series of DIO writes and buffered
    reads. The operations are serialized and page-aligned but the existing
    code fails the test since it occasionally allows a read to come out of
    the cache incorrectly. While mixing direct and buffered I/O isn't
    recommended, I believe it's possible to hit this in other ways that just
    use buffered I/O, though that situation is much harder to reproduce.

    The problem is that the checking/clearing of that flag and the
    invalidation of the mapping really need to be atomic. Fix this by
    serializing concurrent invalidations with a bitlock.

    At the same time, we also need to allow other places that check
    NFS_INO_INVALID_DATA to check whether we might be in the middle of
    invalidating the file, so fix up a couple of places that do that
    to look for the new NFS_INO_INVALIDATING flag.

    Doing this requires us to be careful not to set the bitlock
    unnecessarily, so this code only does that if it believes it will
    be doing an invalidation.

    Signed-off-by: Jeff Layton
    Signed-off-by: Trond Myklebust

    Jeff Layton
     

18 Jan, 2014

1 commit

  • We should always make sure the cached page is up-to-date when we're
    determining whether we can extend a write to cover the full page -- even
    if we've received a write delegation from the server.

    Commit c7559663 added logic to skip this check if we have a write
    delegation, which can lead to data corruption such as the following
    scenario if client B receives a write delegation from the NFS server:

    Client A:
    # echo 123456789 > /mnt/file

    Client B:
    # echo abcdefghi >> /mnt/file
    # cat /mnt/file
    0�D0�abcdefghi

    Just because we hold a write delegation doesn't mean that we've read in
    the entire page contents.

    Cc: # v3.11+
    Signed-off-by: Scott Mayhew
    Signed-off-by: Trond Myklebust

    Scott Mayhew
     

06 Jan, 2014

1 commit


25 Oct, 2013

1 commit


06 Sep, 2013

1 commit

  • If we're doing buffered writes, and there is no file locking involved,
    then we don't have to worry about whether or not the lock owner information
    is identical.
    By relaxing this check, we ensure that fork()ed child processes can write
    to a page without having to first sync dirty data that was written
    by the parent to disk.

    Reported-by: Quentin Barnes
    Signed-off-by: Trond Myklebust
    Tested-by: Quentin Barnes

    Trond Myklebust
     

05 Sep, 2013

2 commits

  • WRITE and COMMIT can use the machine credential.

    If WRITE is supported and COMMIT is not, make all (mach cred) writes FILE_SYNC4.

    Signed-off-by: Weston Andros Adamson
    Signed-off-by: Trond Myklebust

    Weston Andros Adamson
     
  • When an NFSv4 client loses contact with the server it can lose any
    locks that it holds.

    Currently when it reconnects to the server it simply tries to reclaim
    those locks. This might succeed even though some other client has
    held and released a lock in the mean time. So the first client might
    think the file is unchanged, but it isn't. This isn't good.

    If, when recovery happens, the locks cannot be claimed because some
    other client still holds the lock, then we get a message in the kernel
    logs, but the client can still write. So two clients can both think
    they have a lock and can both write at the same time. This is equally
    not good.

    There was a patch a while ago
    http://comments.gmane.org/gmane.linux.nfs/41917

    which tried to address some of this, but it didn't seem to go
    anywhere. That patch would also send a signal to the process. That
    might be useful but for now this patch just causes writes to fail.

    For NFSv4 (unlike v2/v3) there is a strong link between the lock and
    the write request so we can fairly easily fail any IO of the lock is
    gone. While some applications might not expect this, it is still
    safer than allowing the write to succeed.

    Because this is a fairly big change in behaviour a module parameter,
    "recover_locks", is introduced which defaults to true (the current
    behaviour) but can be set to "false" to tell the client not to try to
    recover things that were lost.

    Signed-off-by: NeilBrown
    Signed-off-by: Trond Myklebust

    NeilBrown
     

04 Sep, 2013

1 commit

  • We must avoid buffering a WRITE that is using a credential key (e.g. a GSS
    context key) that is about to expire or has expired. We currently will
    paint ourselves into a corner by returning success to the applciation
    for such a buffered WRITE, only to discover that we do not have permission when
    we attempt to flush the WRITE (and potentially associated COMMIT) to disk.

    Use the RPC layer credential key timeout and expire routines which use a
    a watermark, gss_key_expire_timeo. We test the key in nfs_file_write.

    If a WRITE is using a credential with a key that will expire within
    watermark seconds, flush the inode in nfs_write_end and send only
    NFS_FILE_SYNC WRITEs by adding nfs_ctx_key_to_expire to nfs_need_sync_write.
    Note that this results in single page NFS_FILE_SYNC WRITEs.

    Signed-off-by: Andy Adamson
    [Trond: removed a pr_warn_ratelimited() for now]
    Signed-off-by: Trond Myklebust

    Andy Adamson
     

22 Aug, 2013

1 commit

  • Add tracepoints for inode attribute updates, attribute revalidation,
    writeback start/end fsync start/end, attribute change start/end,
    permission check start/end.

    The intention is to enable performance tracing using 'perf'as well as
    improving debugging.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     

10 Jul, 2013

1 commit

  • Currently nfs_updatepage allows a write to be extended to cover a full
    page only if we don't have a byte range lock lock on the file... but if
    we have a write delegation on the file or if we have the whole file
    locked for writing then we should be allowed to extend the write as
    well.

    Signed-off-by: Scott Mayhew
    [Trond: fix up call to nfs_have_delegation()]
    Signed-off-by: Trond Myklebust

    Scott Mayhew
     

26 Mar, 2013

1 commit


05 Jan, 2013

1 commit


21 Dec, 2012

1 commit

  • nfs_migrate_page() does not wait for FS-Cache to finish with a page, probably
    leading to the following bad-page-state:

    BUG: Bad page state in process python-bin pfn:17d39b
    page:ffffea00053649e8 flags:004000000000100c count:0 mapcount:0 mapping:(null)
    index:38686 (Tainted: G B ---------------- )
    Pid: 31053, comm: python-bin Tainted: G B ----------------
    2.6.32-71.24.1.el6.x86_64 #1
    Call Trace:
    [] bad_page+0x107/0x160
    [] free_hot_cold_page+0x1c9/0x220
    [] __pagevec_free+0x59/0xb0
    [] ? flush_tlb_others_ipi+0x128/0x130
    [] release_pages+0x21c/0x250
    [] ? remove_migration_pte+0x28a/0x2b0
    [] ? mem_cgroup_get_reclaim_stat_from_page+0x18/0x70
    [] ____pagevec_lru_add+0x167/0x180
    [] __lru_cache_add+0x58/0x70
    [] lru_cache_add_lru+0x21/0x40
    [] putback_lru_page+0x69/0x100
    [] migrate_pages+0x13d/0x5d0
    [] ? ____pagevec_lru_add+0x167/0x180
    [] ? compaction_alloc+0x0/0x370
    [] compact_zone+0x4cc/0x600
    [] ? get_page_from_freelist+0x15c/0x820
    [] ? check_preempt_wakeup+0x1c4/0x3c0
    [] compact_zone_order+0x7e/0xb0
    [] try_to_compact_pages+0x109/0x170
    [] __alloc_pages_nodemask+0x5ed/0x850
    [] ? thread_return+0x4e/0x778
    [] alloc_pages_vma+0x93/0x150
    [] do_huge_pmd_anonymous_page+0x135/0x340
    [] ? rwsem_down_read_failed+0x26/0x30
    [] handle_mm_fault+0x245/0x2b0
    [] do_page_fault+0x123/0x3a0
    [] page_fault+0x25/0x30

    nfs_migrate_page() calls nfs_fscache_release_page() which doesn't actually wait
    - even if __GFP_WAIT is set. The reason that doesn't wait is that
    fscache_maybe_release_page() might deadlock the allocator as the work threads
    writing to the cache may all end up sleeping on memory allocation.

    However, I wonder if that is actually a problem. There are a number of things
    I can do to deal with this:

    (1) Make nfs_migrate_page() wait.

    (2) Make fscache_maybe_release_page() honour the __GFP_WAIT flag.

    (3) Set a timeout around the wait.

    (4) Make nfs_migrate_page() return an error if the page is still busy.

    For the moment, I'll select (2) and (4).

    Signed-off-by: David Howells
    Acked-by: Jeff Layton

    David Howells
     

16 Dec, 2012

1 commit

  • The writeback code is already capable of passing errors back to user space
    by means of the open_context->error. In the case of ENOSPC, Neil Brown
    is reporting seeing 2 errors being returned.

    Neil writes:

    "e.g. if /mnt2/ if an nfs mounted filesystem that has no space then

    strace dd if=/dev/zero conv=fsync >> /mnt2/afile count=1

    reported Input/output error and the relevant parts of the strace output are:

    write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512
    fsync(1) = -1 EIO (Input/output error)
    close(1) = -1 ENOSPC (No space left on device)"

    Neil then shows that the duplication of error messages appears to be due to
    the use of the PageError() mechanism, which causes filemap_fdatawait_range
    to return the extra EIO. The regression was introduced by
    commit 7b281ee026552f10862b617a2a51acf49c829554 (NFS: fsync() must exit
    with an error if page writeback failed).

    Fix this by removing the call to SetPageError(), and just relying on
    open_context->error reporting the ENOSPC back to fsync().

    Reported-by: Neil Brown
    Tested-by: Neil Brown
    Signed-off-by: Trond Myklebust
    Cc: stable@vger.kernel.org [3.6+]

    Trond Myklebust
     

11 Dec, 2012

2 commits

  • Trond Myklebust
     
  • Jian reported that the following sequence would leave "testfile" with
    corrupt data:

    # mount localhost:/export /mnt/nfs/ -o vers=3
    # echo abc > /mnt/nfs/testfile; echo def >> /export/testfile; echo ghi >> /mnt/nfs/testfile
    # cat -v /export/testfile
    abc
    ^@^@^@^@ghi

    While there's no locking involved here, the operations are serialized,
    so CTO should prevent corruption.

    The first write to the file is fine and writes 4 bytes. The file is then
    extended on the server. When it's reopened a GETATTR is issued and the
    size change is noticed. This causes NFS_INO_INVALID_DATA to be set on
    the file. Because the file is opened for write only,
    nfs_want_read_modify_write() returns 0 to nfs_write_begin().
    nfs_updatepage then calls nfs_write_pageuptodate() to see if it should
    extend the nfs_page to cover the whole page. NFS_INO_INVALID_DATA is
    still set on the file at that point, but that flag is ignored and
    nfs_pageuptodate erroneously extends the write to cover the whole page,
    with the write done on the server side filled in with zeroes.

    This patch just has that function check for NFS_INO_INVALID_DATA in
    addition to NFS_INO_REVAL_PAGECACHE. This fixes the bug, but looking
    over the code, I wonder if we might have a similar bug in
    nfs_revalidate_size(). The difference between those two flags is very
    subtle, so it seems like we ought to be checking for
    NFS_INO_INVALID_DATA in most of the places that we look for
    NFS_INO_REVAL_PAGECACHE.

    I believe this is regression introduced by commit 8d197a568. The code
    did check for NFS_INO_INVALID_DATA prior to that patch.

    Original bug report is here:

    https://bugzilla.redhat.com/show_bug.cgi?id=885743

    Cc: # 3.5+
    Reported-by: Jian Li
    Signed-off-by: Jeff Layton
    Signed-off-by: Trond Myklebust

    Jeff Layton
     

26 Nov, 2012

1 commit

  • The slab cache in nfs_commit_mempool is wrong, and I think it is just a slip.
    I tested it on a x86-32 machine, the size of nfs_write_header is 544, and
    the size of nfs_commit_data is 408, so it works fine. It is also true that
    sizeof(struct nfs_write_header) > sizeof(struct nfs_commit_data) on other
    platforms in my opinoin. Just fix it.

    Signed-off-by: Yanchuan Nian
    Signed-off-by: Trond Myklebust

    Yanchuan Nian
     

05 Nov, 2012

1 commit


29 Sep, 2012

2 commits

  • If the server reboots before it can commit the unstable writes to disk,
    then nfs_commit_release_pages() will detect this when it compares the
    verifier returned by COMMIT to the one returned by WRITE. When this
    happens, the client needs to resend those writes in order to guarantee
    that they make it to stable storage.

    This patch adds a signalling mechanism to notify fsync() that it
    needs to retry all writes before it can exit.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • We want to be able to pass on the information that the page was not
    dirtied under a lock. Instead of adding a flag parameter, do this
    by passing a pointer to a 'struct nfs_lock_owner' that may be NULL.

    Also reuse this structure in struct nfs_lock_context to carry the
    fl_owner_t and pid_t.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     

03 Aug, 2012

1 commit


01 Aug, 2012

4 commits

  • Merge Andrew's second set of patches:
    - MM
    - a few random fixes
    - a couple of RTC leftovers

    * emailed patches from Andrew Morton : (120 commits)
    rtc/rtc-88pm80x: remove unneed devm_kfree
    rtc/rtc-88pm80x: assign ret only when rtc_register_driver fails
    mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
    tmpfs: distribute interleave better across nodes
    mm: remove redundant initialization
    mm: warn if pg_data_t isn't initialized with zero
    mips: zero out pg_data_t when it's allocated
    memcg: gix memory accounting scalability in shrink_page_list
    mm/sparse: remove index_init_lock
    mm/sparse: more checks on mem_section number
    mm/sparse: optimize sparse_index_alloc
    memcg: add mem_cgroup_from_css() helper
    memcg: further prevent OOM with too many dirty pages
    memcg: prevent OOM with too many dirty pages
    mm: mmu_notifier: fix freed page still mapped in secondary MMU
    mm: memcg: only check anon swapin page charges for swap cache
    mm: memcg: only check swap cache pages for repeated charging
    mm: memcg: split swapin charge function into private and public part
    mm: memcg: remove needless !mm fixup to init_mm when charging
    mm: memcg: remove unneeded shmem charge type
    ...

    Linus Torvalds
     
  • GFP_NOFS is _more_ permissive than GFP_NOIO in that it will initiate IO,
    just not of any filesystem data.

    The problem is that previously NOFS was correct because that avoids
    recursion into the NFS code. With swap-over-NFS, it is no longer correct
    as swap IO can lead to this recursion.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Christoph Hellwig
    Cc: David S. Miller
    Cc: Eric B Munson
    Cc: Eric Paris
    Cc: James Morris
    Cc: Mel Gorman
    Cc: Mike Christie
    Cc: Neil Brown
    Cc: Sebastian Andrzej Siewior
    Cc: Trond Myklebust
    Cc: Xiaotian Feng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The VM does not like PG_private set on PG_swapcache pages. As suggested
    by Trond in http://lkml.org/lkml/2006/8/25/348, this patch disables NFS
    data cache revalidation on swap files. as it does not make sense to have
    other clients change the file while it is being used as swap. This avoids
    setting PG_private on swap pages, since there ought to be no further races
    with invalidate_inode_pages2() to deal with.

    Since we cannot set PG_private we cannot use page->private which is
    already used by PG_swapcache pages to store the nfs_page. Thus augment
    the new nfs_page_find_request logic.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Christoph Hellwig
    Cc: David S. Miller
    Cc: Eric B Munson
    Cc: Eric Paris
    Cc: James Morris
    Cc: Mel Gorman
    Cc: Mike Christie
    Cc: Neil Brown
    Cc: Sebastian Andrzej Siewior
    Cc: Trond Myklebust
    Cc: Xiaotian Feng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Replace all relevant occurences of page->index and page->mapping in the
    NFS client with the new page_file_index() and page_file_mapping()
    functions.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Christoph Hellwig
    Cc: David S. Miller
    Cc: Eric B Munson
    Cc: Eric Paris
    Cc: James Morris
    Cc: Mel Gorman
    Cc: Mike Christie
    Cc: Neil Brown
    Cc: Sebastian Andrzej Siewior
    Cc: Trond Myklebust
    Cc: Xiaotian Feng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

31 Jul, 2012

4 commits

  • This patch exports symbols needed by the v4 module. In addition, I also
    switch over to using IS_ENABLED() to check if CONFIG_NFS_V4 or
    CONFIG_NFS_V4_MODULE are set.

    The module (nfs4.ko) will be created in the same directory as nfs.ko and
    will be automatically loaded the first time you try to mount over NFS v4.

    Signed-off-by: Bryan Schumaker
    Signed-off-by: Trond Myklebust

    Bryan Schumaker
     
  • This patch exports symbols and moves over the final structures needed by
    the v3 module. In addition, I also switch over to using IS_ENABLED() to
    check if CONFIG_NFS_V3 or CONFIG_NFS_V3_MODULE are set.

    The module (nfs3.ko) will be created in the same directory as nfs.ko and
    will be automatically loaded the first time you try to mount over NFS v3.

    Signed-off-by: Bryan Schumaker
    Signed-off-by: Trond Myklebust

    Bryan Schumaker
     
  • The module (nfs2.ko) will be created in the same directory as nfs.ko and
    will be automatically loaded the first time you try to mount over NFS v2.

    Signed-off-by: Bryan Schumaker
    Signed-off-by: Trond Myklebust

    Bryan Schumaker
     
  • Somehow I missed this in my previous patch series, but these functions
    are only needed by the v4 code and should be moved to a v4-only file. I
    wasn't exactly sure where I should put these functions, so I moved them
    into nfs4super.c where I could make them static.

    Signed-off-by: Bryan Schumaker
    Signed-off-by: Trond Myklebust

    Bryan Schumaker
     

29 Jun, 2012

4 commits


06 Jun, 2012

1 commit

  • The new commit code fails to copy the verifier into the wb_verf field
    of _all_ the nfs_page structures; it only copies it into the first entry.
    The consequence is that most requests end up failing to match in
    nfs_commit_release.

    Fix is to copy the verifier into the req->wb_verf field in
    nfs_write_completion.

    Signed-off-by: Trond Myklebust
    Cc: Fred Isaman

    Trond Myklebust
     

20 May, 2012

1 commit


10 May, 2012

4 commits