04 Jun, 2011

2 commits

  • * 'for-linus' of git://git.kernel.dk/linux-block:
    block: Use hlist_entry() for io_context.cic_list.first
    cfq-iosched: Remove bogus check in queue_fail path
    xen/blkback: potential null dereference in error handling
    xen/blkback: don't call vbd_size() if bd_disk is NULL
    block: blkdev_get() should access ->bd_disk only after success
    CFQ: Fix typo and remove unnecessary semicolon
    block: remove unwanted semicolons
    Revert "block: Remove extra discard_alignment from hd_struct."
    nbd: adjust 'max_part' according to part_shift
    nbd: limit module parameters to a sane value
    nbd: pass MSG_* flags to kernel_recvmsg()
    block: improve the bio_add_page() and bio_add_pc_page() descriptions

    Linus Torvalds
     
  • * 'linux-next' of git://git.infradead.org/ubifs-2.6:
    UBIFS: fix-up free space earlier
    UBIFS: intialize LPT earlier
    UBIFS: assert no fixup when writing a node
    UBIFS: fix clean znode counter corruption in error cases
    UBIFS: fix memory leak on error path
    UBIFS: fix shrinker object count reports
    UBIFS: fix recovery broken by the previous recovery fix
    UBIFS: amend ubifs_recover_leb interface
    UBIFS: introduce a "grouped" journal head flag
    UBIFS: supress false error messages

    Linus Torvalds
     

03 Jun, 2011

6 commits

  • The free space fixup is currently initiated during mount after the call to
    ubifs_write_master() which results in a write to PEBs; this has been observed
    with the patch 'assert no fixup when writing a node' applied:

    Move the free space fixup on mount to before the calls to
    ubifs_recover_inl_heads() and ubifs_write_master(). This results in no
    assertions with the previously mentioned patch applied.

    Artem: tweaked the patch a bit

    Signed-off-by: Ben Gardiner
    Reviewed-by: Matthew L. Creech
    Signed-off-by: Artem Bityutskiy

    Ben Gardiner
     
  • The current 'mount_ubifs()' implementation does not initialize the LPT until the
    the master node is marked dirty. Move the LPT initialization to before marking
    the master node dirty. This is a preparation for the next patch which will move
    the free-space-fixup check to before marking the master node dirty, because we
    have to fix-up the free space before doing any writes.

    Artem: massaged the patch and commit message.

    Signed-off-by: Ben Gardiner
    Reviewed-by: Matthew L. Creech
    Signed-off-by: Artem Bityutskiy

    Ben Gardiner
     
  • The current free space fixup can result in some writing to the UBI volume
    when the space_fixup flag is set.

    To catch instances where UBIFS is writing to the NAND while the space_fixup
    flag is set, add an assert to ubifs_write_node().

    Artem: tweaked the patch, added similar assertion to the write buffer
    write path.

    Signed-off-by: Ben Gardiner
    Reviewed-by: Matthew L. Creech
    Signed-off-by: Artem Bityutskiy

    Ben Gardiner
     
  • UBIFS maintains per-filesystem and global clean znode counters
    ('c->clean_zn_cnt' and 'ubifs_clean_zn_cnt'). It is important to maintain
    correct values there since the shrinker relies on 'ubifs_clean_zn_cnt'.

    However, in case of failures during commit the counters were corrupted. E.g.,
    if a failure happens in the middle of 'write_index()', then some nodes in the
    commit list ('c->cnext') are marked as clean, and some are marked as dirty. And
    the 'ubifs_destroy_tnc_subtree()' frees does not retrun correct count, and we
    end up with non-zero 'c->clean_zn_cnt' when unmounting. This means that if we
    have 2 file-sytem and one of them fails, and we unmount it,
    'ubifs_clean_zn_cnt' stays incorrect and confuses the shrinker.

    Signed-off-by: Artem Bityutskiy

    Artem Bityutskiy
     
  • UBIFS leaks memory on error path in 'ubifs_jnl_update()' in case of write
    failure because it forgets to free the 'struct ubifs_dent_node *dent' object.
    Although the object is small, the alignment can make it large - e.g., 2KiB
    if the min. I/O unit is 2KiB.

    Signed-off-by: Artem Bityutskiy
    Cc: stable@kernel.org

    Artem Bityutskiy
     
  • Sometimes VM asks the shrinker to return amount of objects it can shrink,
    and we return the ubifs_clean_zn_cnt in that case. However, it is possible
    that this counter is negative for a short period of time, due to the way
    UBIFS TNC code updates it. And I can observe the following warnings sometimes:

    shrink_slab: ubifs_shrinker+0x0/0x2b7 [ubifs] negative objects to delete nr=-8541616642706119788

    This patch makes sure UBIFS never returns negative count of objects.

    Signed-off-by: Artem Bityutskiy
    Cc: stable@kernel.org

    Artem Bityutskiy
     

01 Jun, 2011

5 commits

  • Unfortunately, the recovery fix d1606a59b6be4ea392eabd40d1250aa1eeb19efb
    (UBIFS: fix extremely rare mount failure) broke recovery. This commit make
    UBIFS drop the last min. I/O unit in all journal heads, but this is needed only
    for the GC head. And this does not work for non-GC heads. For example, if
    suppose we have min. I/O units A and B, and A contains a valid node X, which
    was fsynced, and then a group of nodes Y which spans the rest of A and B. In
    this case we'll drop not only Y, but also X, which is obviously incorrect.

    This patch fixes the issue and additionally makes recovery to drop last min.
    I/O unit only for the GC head, and leave things as they have been for ages for
    the other heads - this is safer.

    Signed-off-by: Artem Bityutskiy

    Artem Bityutskiy
     
  • Instead of passing "grouped" parameter to 'ubifs_recover_leb()' which tells
    whether the nodes are grouped in the LEB to recover, pass the journal head
    number and let 'ubifs_recover_leb()' look at the journal head's 'grouped' flag.

    This patch is a preparation to a further fix where we'll need to know the
    journal head number for other purposes.

    Signed-off-by: Artem Bityutskiy

    Artem Bityutskiy
     
  • Journal heads are different in a way how UBIFS writes nodes there. All normal
    journal heads receive grouped nodes, while the GC journal heads receives
    ungrouped nodes. This patch adds a 'grouped' flag to 'struct ubifs_jhead' which
    describes this property.

    This patch is a preparation to a further recovery fix.

    Signed-off-by: Artem Bityutskiy

    Artem Bityutskiy
     
  • Commit ab51afe05273741f72383529ef488aa1ea598ec6 was a good clean-up, but
    it introduced a regression - now UBIFS prints scary error messages during
    recovery on all corrupted nodes, even though the corruptions are expected
    (due to a power cut). This patch fixes the issue.

    Additionally fix a typo in a commentary introduced by the same commit.

    Signed-off-by: Artem Bityutskiy

    Artem Bityutskiy
     
  • d4dc210f69 (block: don't block events on excl write for non-optical
    devices) added dereferencing of bdev->bd_disk to test
    GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE; however, bdev->bd_disk can be
    %NULL if open failed which can lead to an oops.

    Test the flag after testing open was successful, not before.

    Signed-off-by: Tejun Heo
    Reported-by: David Miller
    Tested-by: David Miller
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Tejun Heo
     

30 May, 2011

27 commits

  • Signed-off-by: Al Viro

    Al Viro
     
  • The dentry_unhash push-down series missed that shink_dcache_parent needs to
    be called prior to rmdir or dir rename to clear DCACHE_REFERENCED and
    allow efficient dentry reclaim.

    Reported-by: Dave Chinner
    Signed-off-by: Sage Weil
    Signed-off-by: Al Viro

    Sage Weil
     
  • It was not a good idea to start dereferencing disk->queue from
    the fs sysfs strategy for displaying discard alignment. We ran
    into first a NULL pointer deref, and after fixing that we sometimes
    see unvalid disk->queue pointer values.

    Since discard is the only one of the bunch actually looking into
    the queue, just revert the change.

    This reverts commit 23ceb5b7719e9276d4fa72a3ecf94dd396755276.

    Conflicts:
    fs/partitions/check.c

    Jens Axboe
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6:
    eCryptfs: Remove ecryptfs_header_cache_2
    eCryptfs: Cleanup and optimize ecryptfs_lookup_interpose()
    eCryptfs: Return useful code from contains_ecryptfs_marker
    eCryptfs: Fix new inode race condition
    eCryptfs: Cleanup inode initialization code
    eCryptfs: Consolidate inode functions into inode.c

    Linus Torvalds
     
  • * 'pnfs-submit' of git://git.open-osd.org/linux-open-osd: (32 commits)
    pnfs-obj: pg_test check for max_io_size
    NFSv4.1: define nfs_generic_pg_test
    NFSv4.1: use pnfs_generic_pg_test directly by layout driver
    NFSv4.1: change pg_test return type to bool
    NFSv4.1: unify pnfs_pageio_init functions
    pnfs-obj: objlayout_encode_layoutcommit implementation
    pnfs: encode_layoutcommit
    pnfs-obj: report errors and .encode_layoutreturn Implementation.
    pnfs: encode_layoutreturn
    pnfs: layoutret_on_setattr
    pnfs: layoutreturn
    pnfs-obj: osd raid engine read/write implementation
    pnfs: support for non-rpc layout drivers
    pnfs-obj: define per-inode private structure
    pnfs: alloc and free layout_hdr layoutdriver methods
    pnfs-obj: objio_osd device information retrieval and caching
    pnfs-obj: decode layout, alloc/free lseg
    pnfs-obj: pnfs_osd XDR client implementation
    pnfs-obj: pnfs_osd XDR definitions
    pnfs-obj: objlayoutdriver module skeleton
    ...

    Linus Torvalds
     
  • Now that ecryptfs_lookup_interpose() is no longer using
    ecryptfs_header_cache_2 to read in metadata, the kmem_cache can be
    removed and the ecryptfs_header_cache_1 kmem_cache can be renamed to
    ecryptfs_header_cache.

    Signed-off-by: Tyler Hicks

    Tyler Hicks
     
  • ecryptfs_lookup_interpose() has turned into spaghetti code over the
    years. This is an effort to clean it up.

    - Shorten overly descriptive variable names such as ecryptfs_dentry
    - Simplify gotos and error paths
    - Create helper function for reading plaintext i_size from metadata

    It also includes an optimization when reading i_size from the metadata.
    A complete page-sized kmem_cache_alloc() was being done to read in 16
    bytes of metadata. The buffer for that is now statically declared.

    Signed-off-by: Tyler Hicks

    Tyler Hicks
     
  • Instead of having the calling functions translate the true/false return
    code to either 0 or -EINVAL, have contains_ecryptfs_marker() return 0 or
    -EINVAL so that the calling functions can just reuse the return code.

    Also, rename the function to ecryptfs_validate_marker() to avoid callers
    mistakenly thinking that it returns true/false codes.

    Signed-off-by: Tyler Hicks

    Tyler Hicks
     
  • Only unlock and d_add() new inodes after the plaintext inode size has
    been read from the lower filesystem. This fixes a race condition that
    was sometimes seen during a multi-job kernel build in an eCryptfs mount.

    https://bugzilla.kernel.org/show_bug.cgi?id=36002

    Signed-off-by: Tyler Hicks
    Reported-by: David
    Tested-by: David

    Tyler Hicks
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
    arch/tile: more /proc and /sys file support

    Linus Torvalds
     
  • * 'for-2.6.40' of git://linux-nfs.org/~bfields/linux: (22 commits)
    nfsd: make local functions static
    NFSD: Remove unused variable from nfsd4_decode_bind_conn_to_session()
    NFSD: Check status from nfsd4_map_bcts_dir()
    NFSD: Remove setting unused variable in nfsd_vfs_read()
    nfsd41: error out on repeated RECLAIM_COMPLETE
    nfsd41: compare request's opcnt with session's maxops at nfsd4_sequence
    nfsd v4.1 lOCKT clientid field must be ignored
    nfsd41: add flag checking for create_session
    nfsd41: make sure nfs server process OPEN with EXCLUSIVE4_1 correctly
    nfsd4: fix wrongsec handling for PUTFH + op cases
    nfsd4: make fh_verify responsibility of nfsd_lookup_dentry caller
    nfsd4: introduce OPDESC helper
    nfsd4: allow fh_verify caller to skip pseudoflavor checks
    nfsd: distinguish functions of NFSD_MAY_* flags
    svcrpc: complete svsk processing on cb receive failure
    svcrpc: take advantage of tcp autotuning
    SUNRPC: Don't wait for full record to receive tcp data
    svcrpc: copy cb reply instead of pages
    svcrpc: close connection if client sends short packet
    svcrpc: note network-order types in svc_process_calldir
    ...

    Linus Torvalds
     
  • * 'nfs-for-2.6.40' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
    SUNRPC: Support for RPC over AF_LOCAL transports
    SUNRPC: Remove obsolete comment
    SUNRPC: Use AF_LOCAL for rpcbind upcalls
    SUNRPC: Clean up use of curly braces in switch cases
    NFS: Revert NFSROOT default mount options
    SUNRPC: Rename xs_encode_tcp_fragment_header()
    nfs,rcu: convert call_rcu(nfs_free_delegation_callback) to kfree_rcu()
    nfs41: Correct offset for LAYOUTCOMMIT
    NFS: nfs_update_inode: print current and new inode size in debug output
    NFSv4.1: Fix the handling of NFS4ERR_SEQ_MISORDERED errors
    NFSv4: Handle expired stateids when the lease is still valid
    SUNRPC: Deal with the lack of a SYN_SENT sk->sk_state_change callback...

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus:
    Squashfs: Fix sanity check patches on big-endian systems

    Linus Torvalds
     
  • Commit 1495f230fa77 ("vmscan: change shrinker API by passing
    shrink_control struct") changed the API of ->shrink(), but missed ubifs
    and cifs instances.

    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     
  • Implement pg_test vector to test for max IO sizes. We calculate
    a max_io_size member only once, and cache it in lseg so to not
    do so on every page insert.

    Signed-off-by: Boaz Harrosh
    [simplify logic]
    Signed-off-by: Benny Halevy

    Boaz Harrosh
     
  • By default, unless pnfs is used coalesce pages until pg_bsize
    (rsize or wsize) is reached.

    pnfs layout drivers define their own pg_test methods that use
    pnfs_generic_pg_test and need to define their own I/O size
    limits (e.g. based on the file stripe size).

    [Move a check from nfs_pageio_do_add_request to nfs_generic_pg_test]
    Signed-off-by: Boaz Harrosh
    Signed-off-by: Benny Halevy

    Boaz Harrosh
     
  • Signed-off-by: Benny Halevy

    Benny Halevy
     
  • Signed-off-by: Benny Halevy

    Benny Halevy
     
  • Use common code for pnfs_pageio_init_{read,write} and use
    a common generic pg_test function.

    Note that this function always assumes the the layout driver's
    pg_test method is implemented.

    [Fix BUG]
    Signed-off-by: Boaz Harrosh
    Signed-off-by: Benny Halevy

    Benny Halevy
     
  • * Define API for io-engines to report delta_space_used in IOs
    * Encode the osd-layout specific information of the layoutcommit
    XDR buffer.

    Signed-off-by: Boaz Harrosh
    Signed-off-by: Benny Halevy

    Boaz Harrosh
     
  • Add a layout driver method to encode the layout type specific
    opaque part of layout commit in-line in the xdr stream.

    Currently, the pnfs-objects layout driver uses it to encode metadata hints
    to the MDS and the blocks layout driver to commit provisionally allocated
    extents to the file.

    Signed-off-by: Benny Halevy

    Benny Halevy
     
  • An io_state pre-allocates an error information structure for each
    possible osd-device that might error during IO. When IO is done if all
    was well the io_state is freed. (as today). If the I/O has ended with an
    error, the io_state is queued on a per-layout err_list. When eventually
    encode_layoutreturn() is called, each error is properly encoded on the
    XDR buffer and only then the io_state is removed from err_list and
    de-allocated.

    It is up to the io_engine to fill in the segment that fault and the type
    of osd_error that occurred. By calling objlayout_io_set_result() for
    each failing device.

    In objio_osd:
    * Allocate io-error descriptors space as part of io_state
    * Use generic objlayout error reporting at end of io.

    Signed-off-by: Boaz Harrosh
    Signed-off-by: Benny Halevy

    Boaz Harrosh
     
  • Add a layout driver method to encode the layout type specific
    opaque part of layout return in-line in the xdr stream.

    Currently the pnfs-objects layout driver uses it to encode i/o error
    information on LAYOUTRETURN.

    Signed-off-by: Andy Adamson
    [fixup layout header pointer for encode_layoutreturn]
    Signed-off-by: Benny Halevy

    Andy Adamson
     
  • With the objects layout security model, we have object capabilities
    that are associated with the layout and we anticipate that the server
    will issue a cb_layoutrecall for any setattr that changes security
    related attributes (user/group/mode/acl) or truncates the file.

    Therefore, the layout is returned before issuing the setattr to avoid
    the anticipated cb_layoutrecall.

    Signed-off-by: Benny Halevy

    Benny Halevy
     
  • NFSv4.1 LAYOUTRETURN implementation

    Currently, does not support layout-type payload encoding.

    Signed-off-by: Alexandros Batsakis
    Signed-off-by: Andy Adamson
    Signed-off-by: Andy Adamson
    Signed-off-by: Dean Hildebrand
    Signed-off-by: Fred Isaman
    Signed-off-by: Fred Isaman
    Signed-off-by: Marc Eshel
    Signed-off-by: Zhang Jingwang
    [call pnfs_return_layout right before pnfs_destroy_layout]
    [remove assert_spin_locked from pnfs_clear_lseg_list]
    [remove wait parameter from the layoutreturn path.]
    [remove return_type field from nfs4_layoutreturn_args]
    [remove range from nfs4_layoutreturn_args]
    [no need to send layoutcommit from _pnfs_return_layout]
    [don't wait on sync layoutreturn]
    [fix layout stateid in layoutreturn args]
    [fixed NULL deref in _pnfs_return_layout]
    [removed recaim member of nfs4_layoutreturn_args]
    Signed-off-by: Benny Halevy

    Benny Halevy
     
  • With the use of the in-kernel osd library. Implement read/write
    of data from/to osd-objects according to information specified
    in the objects-layout.

    Support for stripping over mirrors with a received stripe_unit.
    There are however a few constrains which are not supported:
    1. Stripe Unit must be a multiple of PAGE_SIZE
    2. stripe length (stripe_unit * number_of_stripes) can not be
    bigger then 32bit.

    Also support raid-groups and partial-layout. Partial-layout is
    when not all the groups are received on the line, addressing
    only a partial range of the file.

    TODO:
    Only raid0! raid 4/5/6 support will come at later stage

    A none supported layout will send IO through the MDS

    [Important fallout from the last rebase]
    Signed-off-by: Boaz Harrosh
    [gfp_flags]
    Signed-off-by: Benny Halevy

    Boaz Harrosh
     
  • Non-rpc layout driver such as for objects and blocks
    implement their own I/O path and error handling logic.
    Therefore bypass NFS-based error handling for these layout drivers.

    [fix lseg ref-count bugs, and null de-refs]
    [Fall out from: non-rpc layout drivers]
    Signed-off-by: Boaz Harrosh
    [get rid of PNFS_USE_RPC_CODE]
    [get rid of __nfs4_write_done_cb]
    [revert useless change in nfs4_write_done_cb]
    Signed-off-by: Benny Halevy

    Benny Halevy