29 Oct, 2013

32 commits

  • This patch adds support for multiple security options which can be
    specified using a colon-delimited list of security flavors (the same
    syntax as nfsd's exports file).

    This is useful, for instance, when NFSv4.x mounts cross SECINFO
    boundaries. With this patch a user can use "sec=krb5i,krb5p"
    to mount a remote filesystem using krb5i, but can still cross
    into krb5p-only exports.

    New mounts will try all security options before failing. NFSv4.x
    SECINFO results will be compared against the sec= flavors to
    find the first flavor in both lists or if no match is found will
    return -EPERM.

    Signed-off-by: Weston Andros Adamson
    Signed-off-by: Trond Myklebust

    Weston Andros Adamson
     
  • Since the parsed sec= flavor is now stored in nfs_server->auth_info,
    we no longer need an nfs_server flag to determine if a sec= option was
    used.

    This flag has not been completely removed because it is still needed for
    the (old but still supported) non-text parsed mount options ABI
    compatability.

    Signed-off-by: Weston Andros Adamson
    Signed-off-by: Trond Myklebust

    Weston Andros Adamson
     
  • Cache the auth_info structure in nfs_server and pass these values to submounts.

    This lays the groundwork for supporting multiple sec= options.

    Signed-off-by: Weston Andros Adamson
    Signed-off-by: Trond Myklebust

    Weston Andros Adamson
     
  • When filling parsed_mount_data, store the parsed sec= mount option in
    the new struct nfs_auth_info and the chosen flavor in selected_flavor.

    This patch lays the groundwork for supporting multiple sec= options.

    Signed-off-by: Weston Andros Adamson
    Signed-off-by: Trond Myklebust

    Weston Andros Adamson
     
  • It's not used outside of nfs4namespace.c anymore.

    Signed-off-by: Weston Andros Adamson
    Signed-off-by: Trond Myklebust

    Weston Andros Adamson
     
  • In nfs4_wait_clnt_recover(), hold a reference to the clp being
    waited on. The state manager can reduce clp->cl_count to 1, in
    which case the nfs_put_client() in nfs4_run_state_manager() can
    free *clp before wait_on_bit() returns and allows
    nfs4_wait_clnt_recover() to run again.

    The behavior at that point is non-deterministic. If the waited-on
    bit still happens to be zero, wait_on_bit() will wake the waiter as
    expected. If the bit is set again (say, if the memory was poisoned
    when freed) wait_on_bit() can leave the waiter asleep.

    This is a narrow fix which ensures the safety of accessing *clp in
    nfs4_wait_clnt_recover(), but does not address the continued use
    of a possibly freed *clp after nfs4_wait_clnt_recover() returns
    (see nfs_end_delegation_return(), for example).

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • Broadly speaking, v4.1 migration is untested. There are no servers
    in the wild that support NFSv4.1 migration. However, as server
    implementations become available, we do want to enable testing by
    developers, while leaving it disabled for environments for which
    broken migration support would be an unpleasant surprise.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • With the advent of NFSv4 sessions in NFSv4.1 and following, a "lease
    moved" condition is reported differently than it is in NFSv4.0.

    NFSv4 minor version 0 servers return an error status code,
    NFS4ERR_LEASE_MOVED, to signal that a lease has moved. This error
    causes the whole compound operation to fail. Normal compounds
    against this server continue to fail until the client performs
    migration recovery on the migrated share.

    Minor version 1 and later servers assert a bit flag in the reply to
    a compound's SEQUENCE operation to signal LEASE_MOVED. This is not
    a fatal condition: operations against this server continue normally.
    The server asserts this flag until the client performs migration
    recovery on the migrated share.

    Note that servers MUST NOT return NFS4ERR_LEASE_MOVED to NFSv4
    clients not using NFSv4.0.

    After the server asserts any of the sr_status_flags in the SEQUENCE
    operation in a typical compound, our client initiates standard lease
    recovery. For NFSv4.1+, a stand-alone SEQUENCE operation is
    performed to discover what recovery is needed.

    If SEQ4_STATUS_LEASE_MOVED is asserted in this stand-alone SEQUENCE
    operation, our client attempts to discover which FSIDs have been
    migrated, and then performs migration recovery on each.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • With NFSv4 minor version 0, the asynchronous lease RENEW
    heartbeat can return NFS4ERR_LEASE_MOVED. Error recovery logic for
    async RENEW is a separate code path from the generic NFS proc paths,
    so it must be updated to handle NFS4ERR_LEASE_MOVED as well.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • Currently the Linux NFS client ignores the operation status code for
    the RELEASE_LOCKOWNER operation. Like NFSv3's UMNT operation,
    RELEASE_LOCKOWNER is a courtesy to help servers manage their
    resources, and the outcome is not consequential for the client.

    During a migration, a server may report NFS4ERR_LEASE_MOVED, in
    which case the client really should retry, since typically
    LEASE_MOVED has nothing to do with the current operation, but does
    prevent it from going forward.

    Also, it's important for a client to respond as soon as possible to
    a moved lease condition, since the client's lease could expire on
    the destination without further action by the client.

    NFS4ERR_DELAY is not included in the list of valid status codes for
    RELEASE_LOCKOWNER in RFC 3530bis. However, rfc3530-migration-update
    does permit migration-capable servers to return DELAY to clients,
    but only in the context of an ongoing migration. In this case the
    server has frozen lock state in preparation for migration, and a
    client retry would help the destination server purge unneeded state
    once migration recovery is complete.

    Interestly, NFS4ERR_MOVED is not valid for RELEASE_LOCKOWNER, even
    though lock owners can be migrated with Transparent State Migration.

    Note that RFC 3530bis section 9.5 includes RELEASE_LOCKOWNER in the
    list of operations that renew a client's lease on the server if they
    succeed. Now that our client pays attention to the operation's
    status code, we can note that renewal appropriately.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • Trigger lease-moved recovery when a request returns
    NFS4ERR_LEASE_MOVED.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • A migration on the FSID in play for the current NFS operation
    is reported via the error status code NFS4ERR_MOVED.

    "Lease moved" means that a migration has occurred on some other
    FSID than the one for the current operation. It's a signal that
    the client should take action immediately to handle a migration
    that it may not have noticed otherwise. This is so that the
    client's lease does not expire unnoticed on the destination server.

    In NFSv4.0, a moved lease is reported with the NFS4ERR_LEASE_MOVED
    error status code.

    To recover from NFS4ERR_LEASE_MOVED, check each FSID for that server
    to see if it is still present. Invoke nfs4_try_migration() if the
    FSID is no longer present on the server.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • Introduce a mechanism for probing a server to determine if an FSID
    is present or absent.

    The on-the-wire compound is different between minor version 0 and 1.
    Minor version 0 appends a RENEW operation to identify which client
    ID is probing. Minor version 1 has a SEQUENCE operation in the
    compound which effectively carries the same information.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • When a server returns NFS4ERR_MOVED during a delegation recall,
    trigger the new migration recovery logic in the state manager.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • When a server returns NFS4ERR_MOVED, trigger the new migration
    recovery logic in the state manager.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • I'm going to use this exit label also for migration recovery
    failures.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • Clean up.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • Migration recovery and state recovery must be serialized, so handle
    both in the state manager thread.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • NFS_SB() returns the pointer to an nfs_server struct, given a
    pointer to a super_block. But we have no way to go back the other
    way.

    Add a super_block backpointer field so that, given an nfs_server
    struct, it is easy to get to the filesystem's root dentry.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • The nfs4_proc_fs_locations() function is invoked during referral
    processing to perform a GETATTR(fs_locations) on an object's parent
    directory in order to discover the target of the referral. It
    performs a LOOKUP in the compound, so the client needs to know the
    parent's file handle a priori.

    Unfortunately this function is not adequate for handling migration
    recovery. We need to probe fs_locations information on an FSID, but
    there's no parent directory available for many operations that
    can return NFS4ERR_MOVED.

    Another subtlety: recovering from NFS4ERR_LEASE_MOVED is a process
    of walking over a list of known FSIDs that reside on the server, and
    probing whether they have migrated. Once the server has detected
    that the client has probed all migrated file systems, it stops
    returning NFS4ERR_LEASE_MOVED.

    A minor version zero server needs to know what client ID is
    requesting fs_locations information so it can clear the flag that
    forces it to continue returning NFS4ERR_LEASE_MOVED. This flag is
    set per client ID and per FSID. However, the client ID is not an
    argument of either the PUTFH or GETATTR operations. Later minor
    versions have client ID information embedded in the compound's
    SEQUENCE operation.

    Therefore, by convention, minor version zero clients send a RENEW
    operation in the same compound as the GETATTR(fs_locations), since
    RENEW's one argument is a clientid4. This allows a minor version
    zero server to identify correctly the client that is probing for a
    migration.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • Allow code in nfsv4.ko to use _nfs_display_fhandle().

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • The differences between minor version 0 and minor version 1
    migration will be abstracted by the addition of a set of migration
    recovery ops.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • Introduce functions that can walk through an array of returned
    fs_locations information and connect a transport to one of the
    destination servers listed therein.

    Note that NFS minor version 1 introduces "fs_locations_info" which
    extends the locations array sorting criteria available to clients.
    This is not supported yet.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • New function nfs4_update_server() moves an nfs_server to a different
    nfs_client. This is done as part of migration recovery.

    Though it may be appealing to think of them as the same thing,
    migration recovery is not the same as following a referral.

    For a referral, the client has not descended into the file system
    yet: it has no nfs_server, no super block, no inodes or open state.
    It is enough to simply instantiate the nfs_server and super block,
    and perform a referral mount.

    For a migration, however, we have all of those things already, and
    they have to be moved to a different nfs_client. No local namespace
    changes are needed here.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • Cached opens have already been handled by _nfs4_opendata_reclaim_to_nfs4_state
    and can safely skip being reprocessed, but must still call update_open_stateid
    to make sure that all active fmodes are recovered.

    Signed-off-by: Weston Andros Adamson
    Cc: stable@vger.kernel.org # 3.7.x: f494a6071d3: NFSv4: fix NULL dereference
    Cc: stable@vger.kernel.org # 3.7.x: a43ec98b72a: NFSv4: don't fail on missin
    Cc: stable@vger.kernel.org # 3.7.x
    Signed-off-by: Trond Myklebust

    Weston Andros Adamson
     
  • Currently, if the call to nfs_refresh_inode fails, then we end up leaking
    a reference count, due to the call to nfs4_get_open_state.
    While we're at it, replace nfs4_get_open_state with a simple call to
    atomic_inc(); there is no need to do a full lookup of the struct nfs_state
    since it is passed as an argument in the struct nfs4_opendata, and
    is already assigned to the variable 'state'.

    Cc: stable@vger.kernel.org # 3.7.x: a43ec98b72a: NFSv4: don't fail on missing
    Cc: stable@vger.kernel.org # 3.7.x
    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • This is an unneeded check that could cause the client to fail to recover
    opens.

    Signed-off-by: Weston Andros Adamson
    Signed-off-by: Trond Myklebust

    Weston Andros Adamson
     
  • _nfs4_opendata_reclaim_to_nfs4_state doesn't expect to see a cached
    open CLAIM_PREVIOUS, but this can happen. An example is when there are
    RDWR openers and RDONLY openers on a delegation stateid. The recovery
    path will first try an open CLAIM_PREVIOUS for the RDWR openers, this
    marks the delegation as not needing RECLAIM anymore, so the open
    CLAIM_PREVIOUS for the RDONLY openers will not actually send an rpc.

    The NULL dereference is due to _nfs4_opendata_reclaim_to_nfs4_state
    returning PTR_ERR(rpc_status) when !rpc_done. When the open is
    cached, rpc_done == 0 and rpc_status == 0, thus
    _nfs4_opendata_reclaim_to_nfs4_state returns NULL - this is unexpected
    by callers of nfs4_opendata_to_nfs4_state().

    This can be reproduced easily by opening the same file two times on an
    NFSv4.0 mount with delegations enabled, once as RDWR and once as RDONLY then
    sleeping for a long time. While the files are held open, kick off state
    recovery and this NULL dereference will be hit every time.

    An example OOPS:

    [ 65.003602] BUG: unable to handle kernel NULL pointer dereference at 00000000
    00000030
    [ 65.005312] IP: [] __nfs4_close+0x1e/0x160 [nfsv4]
    [ 65.006820] PGD 7b0ea067 PUD 791ff067 PMD 0
    [ 65.008075] Oops: 0000 [#1] SMP
    [ 65.008802] Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache
    snd_ens1371 gameport nfsd snd_rawmidi snd_ac97_codec ac97_bus btusb snd_seq snd
    _seq_device snd_pcm ppdev bluetooth auth_rpcgss coretemp snd_page_alloc crc32_pc
    lmul crc32c_intel ghash_clmulni_intel microcode rfkill nfs_acl vmw_balloon serio
    _raw snd_timer lockd parport_pc e1000 snd soundcore parport i2c_piix4 shpchp vmw
    _vmci sunrpc ata_generic mperf pata_acpi mptspi vmwgfx ttm scsi_transport_spi dr
    m mptscsih mptbase i2c_core
    [ 65.018684] CPU: 0 PID: 473 Comm: 192.168.10.85-m Not tainted 3.11.2-201.fc19
    .x86_64 #1
    [ 65.020113] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop
    Reference Platform, BIOS 6.00 07/31/2013
    [ 65.022012] task: ffff88003707e320 ti: ffff88007b906000 task.ti: ffff88007b906000
    [ 65.023414] RIP: 0010:[] [] __nfs4_close+0x1e/0x160 [nfsv4]
    [ 65.025079] RSP: 0018:ffff88007b907d10 EFLAGS: 00010246
    [ 65.026042] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
    [ 65.027321] RDX: 0000000000000050 RSI: 0000000000000001 RDI: 0000000000000000
    [ 65.028691] RBP: ffff88007b907d38 R08: 0000000000016f60 R09: 0000000000000000
    [ 65.029990] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
    [ 65.031295] R13: 0000000000000050 R14: 0000000000000000 R15: 0000000000000001
    [ 65.032527] FS: 0000000000000000(0000) GS:ffff88007f600000(0000) knlGS:0000000000000000
    [ 65.033981] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 65.035177] CR2: 0000000000000030 CR3: 000000007b27f000 CR4: 00000000000407f0
    [ 65.036568] Stack:
    [ 65.037011] 0000000000000000 0000000000000001 ffff88007b907d90 ffff88007a880220
    [ 65.038472] ffff88007b768de8 ffff88007b907d48 ffffffffa037e4a5 ffff88007b907d80
    [ 65.039935] ffffffffa036a6c8 ffff880037020e40 ffff88007a880000 ffff880037020e40
    [ 65.041468] Call Trace:
    [ 65.042050] [] nfs4_close_state+0x15/0x20 [nfsv4]
    [ 65.043209] [] nfs4_open_recover_helper+0x148/0x1f0 [nfsv4]
    [ 65.044529] [] nfs4_open_recover+0x116/0x150 [nfsv4]
    [ 65.045730] [] nfs4_open_reclaim+0xad/0x150 [nfsv4]
    [ 65.046905] [] nfs4_do_reclaim+0x149/0x5f0 [nfsv4]
    [ 65.048071] [] nfs4_run_state_manager+0x3bc/0x670 [nfsv4]
    [ 65.049436] [] ? nfs4_do_reclaim+0x5f0/0x5f0 [nfsv4]
    [ 65.050686] [] ? nfs4_do_reclaim+0x5f0/0x5f0 [nfsv4]
    [ 65.051943] [] kthread+0xc0/0xd0
    [ 65.052831] [] ? insert_kthread_work+0x40/0x40
    [ 65.054697] [] ret_from_fork+0x7c/0xb0
    [ 65.056396] [] ? insert_kthread_work+0x40/0x40
    [ 65.058208] Code: 5c 41 5d 5d c3 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 41 57 41 89 f7 41 56 41 89 ce 41 55 41 89 d5 41 54 53 48 89 fb 8b 67 30 f0 41 ff 44 24 44 49 8d 7c 24 40 e8 0e 0a 2d e1 44
    [ 65.065225] RIP [] __nfs4_close+0x1e/0x160 [nfsv4]
    [ 65.067175] RSP
    [ 65.068570] CR2: 0000000000000030
    [ 65.070098] ---[ end trace 0d1fe4f5c7dd6f8b ]---

    Cc: #3.7+
    Signed-off-by: Weston Andros Adamson
    Signed-off-by: Trond Myklebust

    Weston Andros Adamson
     
  • The current caching model calls for the security label to be set on
    first lookup and/or on any subsequent label changes. There is no
    need to do it as part of an open reclaim.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • nfs_parse_mount_options returns 0 on error, not -errno.

    Reported-by: Karel Zak
    Signed-off-by: Jeff Layton
    Signed-off-by: Trond Myklebust

    Jeff Layton
     
  • Reported-by: Eric Doutreleau
    Signed-off-by: Jeff Layton
    Signed-off-by: Trond Myklebust

    Jeff Layton
     
  • As of commit 5d422301f97b821301efcdb6fc9d1a83a5c102d6 we no longer zero the
    state.

    Signed-off-by: Andy Adamson
    Signed-off-by: Trond Myklebust

    Andy Adamson
     

02 Oct, 2013

2 commits

  • The spec states that the client should not resend requests because
    the server will disconnect if it needs to drop an RPC request.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • In nfs4_proc_getlk(), when some error causes a retry of the call to
    _nfs4_proc_getlk(), we can end up with Oopses of the form

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000134
    IP: [] _raw_spin_lock+0xe/0x30

    Call Trace:
    [] _atomic_dec_and_lock+0x4d/0x70
    [] nfs4_put_lock_state+0x32/0xb0 [nfsv4]
    [] nfs4_fl_release_lock+0x15/0x20 [nfsv4]
    [] _nfs4_proc_getlk.isra.40+0x146/0x170 [nfsv4]
    [] nfs4_proc_lock+0x399/0x5a0 [nfsv4]

    The problem is that we don't clear the request->fl_ops after the first
    try and so when we retry, nfs4_set_lock_state() exits early without
    setting the lock stateid.
    Regression introduced by commit 70cc6487a4e08b8698c0e2ec935fb48d10490162
    (locks: make ->lock release private data before returning in GETLK case)

    Reported-by: Weston Andros Adamson
    Reported-by: Jorge Mora
    Signed-off-by: Trond Myklebust
    Cc: #2.6.22+

    Trond Myklebust
     

01 Oct, 2013

4 commits

  • Pull NFS client bugfixes from Trond Myklebust:
    - Stable fix for Oopses in the pNFS files layout driver
    - Fix a regression when doing a non-exclusive file create on NFSv4.x
    - NFSv4.1 security negotiation fixes when looking up the root
    filesystem
    - Fix a memory ordering issue in the pNFS files layout driver

    * tag 'nfs-for-3.12-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    NFS: Give "flavor" an initial value to fix a compile warning
    NFSv4.1: try SECINFO_NO_NAME flavs until one works
    NFSv4.1: Ensure memory ordering between nfs4_ds_connect and nfs4_fl_prepare_ds
    NFSv4.1: nfs4_fl_prepare_ds - fix bugs when the connect attempt fails
    NFSv4: Honour the 'opened' parameter in the atomic_open() filesystem method

    Linus Torvalds
     
  • Merge misc fixes from Andrew Morton.

    * emailed patches from Andrew Morton : (22 commits)
    pidns: fix free_pid() to handle the first fork failure
    ipc,msg: prevent race with rmid in msgsnd,msgrcv
    ipc/sem.c: update sem_otime for all operations
    mm/hwpoison: fix the lack of one reference count against poisoned page
    mm/hwpoison: fix false report on 2nd attempt at page recovery
    mm/hwpoison: fix test for a transparent huge page
    mm/hwpoison: fix traversal of hugetlbfs pages to avoid printk flood
    block: change config option name for cmdline partition parsing
    mm/mlock.c: prevent walking off the end of a pagetable in no-pmd configuration
    mm: avoid reinserting isolated balloon pages into LRU lists
    arch/parisc/mm/fault.c: fix uninitialized variable usage
    include/asm-generic/vtime.h: avoid zero-length file
    nilfs2: fix issue with race condition of competition between segments for dirty blocks
    Documentation/kernel-parameters.txt: replace kernelcore with Movable
    mm/bounce.c: fix a regression where MS_SNAP_STABLE (stable pages snapshotting) was ignored
    kernel/kmod.c: check for NULL in call_usermodehelper_exec()
    ipc/sem.c: synchronize the proc interface
    ipc/sem.c: optimize sem_lock()
    ipc/sem.c: fix race in sem_lock()
    mm/compaction.c: periodically schedule when freeing pages
    ...

    Linus Torvalds
     
  • Many NILFS2 users were reported about strange file system corruption
    (for example):

    NILFS: bad btree node (blocknr=185027): level = 0, flags = 0x0, nchildren = 768
    NILFS error (device sda4): nilfs_bmap_last_key: broken bmap (inode number=11540)

    But such error messages are consequence of file system's issue that takes
    place more earlier. Fortunately, Jerome Poulin
    and Anton Eliasson were reported about another
    issue not so recently. These reports describe the issue with segctor
    thread's crash:

    BUG: unable to handle kernel paging request at 0000000000004c83
    IP: nilfs_end_page_io+0x12/0xd0 [nilfs2]

    Call Trace:
    nilfs_segctor_do_construct+0xf25/0x1b20 [nilfs2]
    nilfs_segctor_construct+0x17b/0x290 [nilfs2]
    nilfs_segctor_thread+0x122/0x3b0 [nilfs2]
    kthread+0xc0/0xd0
    ret_from_fork+0x7c/0xb0

    These two issues have one reason. This reason can raise third issue
    too. Third issue results in hanging of segctor thread with eating of
    100% CPU.

    REPRODUCING PATH:

    One of the possible way or the issue reproducing was described by
    Jermoe me Poulin :

    1. init S to get to single user mode.
    2. sysrq+E to make sure only my shell is running
    3. start network-manager to get my wifi connection up
    4. login as root and launch "screen"
    5. cd /boot/log/nilfs which is a ext3 mount point and can log when NILFS dies.
    6. lscp | xz -9e > lscp.txt.xz
    7. mount my snapshot using mount -o cp=3360839,ro /dev/vgUbuntu/root /mnt/nilfs
    8. start a screen to dump /proc/kmsg to text file since rsyslog is killed
    9. start a screen and launch strace -f -o find-cat.log -t find
    /mnt/nilfs -type f -exec cat {} > /dev/null \;
    10. start a screen and launch strace -f -o apt-get.log -t apt-get update
    11. launch the last command again as it did not crash the first time
    12. apt-get crashes
    13. ps aux > ps-aux-crashed.log
    13. sysrq+W
    14. sysrq+E wait for everything to terminate
    15. sysrq+SUSB

    Simplified way of the issue reproducing is starting kernel compilation
    task and "apt-get update" in parallel.

    REPRODUCIBILITY:

    The issue is reproduced not stable [60% - 80%]. It is very important to
    have proper environment for the issue reproducing. The critical
    conditions for successful reproducing:

    (1) It should have big modified file by mmap() way.

    (2) This file should have the count of dirty blocks are greater that
    several segments in size (for example, two or three) from time to time
    during processing.

    (3) It should be intensive background activity of files modification
    in another thread.

    INVESTIGATION:

    First of all, it is possible to see that the reason of crash is not valid
    page address:

    NILFS [nilfs_segctor_complete_write]:2100 bh->b_count 0, bh->b_blocknr 13895680, bh->b_size 13897727, bh->b_page 0000000000001a82
    NILFS [nilfs_segctor_complete_write]:2101 segbuf->sb_segnum 6783

    Moreover, value of b_page (0x1a82) is 6786. This value looks like segment
    number. And b_blocknr with b_size values look like block numbers. So,
    buffer_head's pointer points on not proper address value.

    Detailed investigation of the issue is discovered such picture:

    [-----------------------------SEGMENT 6783-------------------------------]
    NILFS [nilfs_segctor_do_construct]:2310 nilfs_segctor_begin_construction
    NILFS [nilfs_segctor_do_construct]:2321 nilfs_segctor_collect
    NILFS [nilfs_segctor_do_construct]:2336 nilfs_segctor_assign
    NILFS [nilfs_segctor_do_construct]:2367 nilfs_segctor_update_segusage
    NILFS [nilfs_segctor_do_construct]:2371 nilfs_segctor_prepare_write
    NILFS [nilfs_segctor_do_construct]:2376 nilfs_add_checksums_on_logs
    NILFS [nilfs_segctor_do_construct]:2381 nilfs_segctor_write
    NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111149024, segbuf->sb_segnum 6783

    [-----------------------------SEGMENT 6784-------------------------------]
    NILFS [nilfs_segctor_do_construct]:2310 nilfs_segctor_begin_construction
    NILFS [nilfs_segctor_do_construct]:2321 nilfs_segctor_collect
    NILFS [nilfs_lookup_dirty_data_buffers]:782 bh->b_count 1, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
    NILFS [nilfs_lookup_dirty_data_buffers]:783 bh->b_assoc_buffers.next ffff8802174a6798, bh->b_assoc_buffers.prev ffff880221cffee8
    NILFS [nilfs_segctor_do_construct]:2336 nilfs_segctor_assign
    NILFS [nilfs_segctor_do_construct]:2367 nilfs_segctor_update_segusage
    NILFS [nilfs_segctor_do_construct]:2371 nilfs_segctor_prepare_write
    NILFS [nilfs_segctor_do_construct]:2376 nilfs_add_checksums_on_logs
    NILFS [nilfs_segctor_do_construct]:2381 nilfs_segctor_write
    NILFS [nilfs_segbuf_submit_bh]:575 bh->b_count 1, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
    NILFS [nilfs_segbuf_submit_bh]:576 segbuf->sb_segnum 6784
    NILFS [nilfs_segbuf_submit_bh]:577 bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880218bcdf50
    NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111150080, segbuf->sb_segnum 6784, segbuf->sb_nbio 0
    [----------] ditto
    NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111164416, segbuf->sb_segnum 6784, segbuf->sb_nbio 15

    [-----------------------------SEGMENT 6785-------------------------------]
    NILFS [nilfs_segctor_do_construct]:2310 nilfs_segctor_begin_construction
    NILFS [nilfs_segctor_do_construct]:2321 nilfs_segctor_collect
    NILFS [nilfs_lookup_dirty_data_buffers]:782 bh->b_count 2, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
    NILFS [nilfs_lookup_dirty_data_buffers]:783 bh->b_assoc_buffers.next ffff880219277e80, bh->b_assoc_buffers.prev ffff880221cffc88
    NILFS [nilfs_segctor_do_construct]:2367 nilfs_segctor_update_segusage
    NILFS [nilfs_segctor_do_construct]:2371 nilfs_segctor_prepare_write
    NILFS [nilfs_segctor_do_construct]:2376 nilfs_add_checksums_on_logs
    NILFS [nilfs_segctor_do_construct]:2381 nilfs_segctor_write
    NILFS [nilfs_segbuf_submit_bh]:575 bh->b_count 2, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
    NILFS [nilfs_segbuf_submit_bh]:576 segbuf->sb_segnum 6785
    NILFS [nilfs_segbuf_submit_bh]:577 bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880222cc7ee8
    NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111165440, segbuf->sb_segnum 6785, segbuf->sb_nbio 0
    [----------] ditto
    NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111177728, segbuf->sb_segnum 6785, segbuf->sb_nbio 12

    NILFS [nilfs_segctor_do_construct]:2399 nilfs_segctor_wait
    NILFS [nilfs_segbuf_wait]:676 segbuf->sb_segnum 6783
    NILFS [nilfs_segbuf_wait]:676 segbuf->sb_segnum 6784
    NILFS [nilfs_segbuf_wait]:676 segbuf->sb_segnum 6785

    NILFS [nilfs_segctor_complete_write]:2100 bh->b_count 0, bh->b_blocknr 13895680, bh->b_size 13897727, bh->b_page 0000000000001a82

    BUG: unable to handle kernel paging request at 0000000000001a82
    IP: [] nilfs_end_page_io+0x12/0xd0 [nilfs2]

    Usually, for every segment we collect dirty files in list. Then, dirty
    blocks are gathered for every dirty file, prepared for write and
    submitted by means of nilfs_segbuf_submit_bh() call. Finally, it takes
    place complete write phase after calling nilfs_end_bio_write() on the
    block layer. Buffers/pages are marked as not dirty on final phase and
    processed files removed from the list of dirty files.

    It is possible to see that we had three prepare_write and submit_bio
    phases before segbuf_wait and complete_write phase. Moreover, segments
    compete between each other for dirty blocks because on every iteration
    of segments processing dirty buffer_heads are added in several lists of
    payload_buffers:

    [SEGMENT 6784]: bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880218bcdf50
    [SEGMENT 6785]: bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880222cc7ee8

    The next pointer is the same but prev pointer has changed. It means
    that buffer_head has next pointer from one list but prev pointer from
    another. Such modification can be made several times. And, finally, it
    can be resulted in various issues: (1) segctor hanging, (2) segctor
    crashing, (3) file system metadata corruption.

    FIX:
    This patch adds:

    (1) setting of BH_Async_Write flag in nilfs_segctor_prepare_write()
    for every proccessed dirty block;

    (2) checking of BH_Async_Write flag in
    nilfs_lookup_dirty_data_buffers() and
    nilfs_lookup_dirty_node_buffers();

    (3) clearing of BH_Async_Write flag in nilfs_segctor_complete_write(),
    nilfs_abort_logs(), nilfs_forget_buffer(), nilfs_clear_dirty_page().

    Reported-by: Jerome Poulin
    Reported-by: Anton Eliasson
    Cc: Paul Fertser
    Cc: ARAI Shun-ichi
    Cc: Piotr Szymaniak
    Cc: Juan Barry Manuel Canham
    Cc: Zahid Chowdhury
    Cc: Elmer Zhang
    Cc: Kenneth Langga
    Signed-off-by: Vyacheslav Dubeyko
    Acked-by: Ryusuke Konishi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vyacheslav Dubeyko
     
  • A high setting of max_map_count, and a process core-dumping with a large
    enough vm_map_count could result in an NT_FILE note not being written,
    and the kernel crashing immediately later because it has assumed
    otherwise.

    Reproduction of the oops-causing bug described here:

    https://lkml.org/lkml/2013/8/30/50

    Rge ussue originated in commit 2aa362c49c31 ("coredump: extend core dump
    note section to contain file names of mapped file") from Oct 4, 2012.

    This patch make that section optional in that case. fill_files_note()
    should signify the error, and also let the info struct in
    elf_core_dump() be zero-initialized so that we can check for the
    optionally written note.

    [akpm@linux-foundation.org: avoid abusing E2BIG, remove a couple of not-really-needed local variables]
    [akpm@linux-foundation.org: fix sparse warning]
    Signed-off-by: Dan Aloni
    Cc: Al Viro
    Cc: Denys Vlasenko
    Reported-by: Martin MOKREJS
    Tested-by: Martin MOKREJS
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Aloni
     

30 Sep, 2013

2 commits