Eric Lee / linux-smarc-t335x-v3.2

Commit 12debc4248a4a7f1873e47cda2cdd7faca80b099

Authored by David Howells 2008-02-07 16:15:52 +0800

Committed by Linus Torvalds 2008-02-08 00:42:29 +0800

Exists in master and in 4 other branches

iget: remove iget() and the read_inode() super op as being obsolete

Remove the old iget() call and the read_inode() superblock operation it uses
as these are really obsolete, and the use of read_inode() does not produce
proper error handling (no distinction between ENOMEM and EIO when marking an
inode bad).

Furthermore, this removes the temptation to use iget() to find an inode by
number in a filesystem from code outside that filesystem.

iget_locked() should be used instead.  A new function is added in an earlier
patch (iget_failed) that is to be called to mark an inode as bad, unlock it
and release it should the get routine fail.  Mark iget() and read_inode() as
being obsolete and remove references to them from the documentation.

Typically a filesystem will be modified such that the read_inode function
becomes an internal iget function, for example the following:

	void thingyfs_read_inode(struct inode *inode)
	{
		...
	}

would be changed into something like:

	struct inode *thingyfs_iget(struct super_block *sp, unsigned long ino)
	{
		struct inode *inode;
		int ret;

		inode = iget_locked(sb, ino);
		if (!inode)
			return ERR_PTR(-ENOMEM);
		if (!(inode->i_state & I_NEW))
			return inode;

		...
		unlock_new_inode(inode);
		return inode;
	error:
		iget_failed(inode);
		return ERR_PTR(ret);
	}

and then thingyfs_iget() would be called rather than iget(), for example:

	ret = -EINVAL;
	inode = iget(sb, ino);
	if (!inode || is_bad_inode(inode))
		goto error;

becomes:

	inode = thingyfs_iget(sb, ino);
	if (IS_ERR(inode)) {
		ret = PTR_ERR(inode);
		goto error;
	}

Note that is_bad_inode() does not need to be called.  The error returned by
thingyfs_iget() should render it unnecessary.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Showing 5 changed files with 9 additions and 41 deletions Inline Diff

Documentation/filesystems/Locking
Documentation/filesystems/porting
Documentation/filesystems/vfs.txt
fs/inode.c
include/linux/fs.h

Documentation/filesystems/Locking

Diff comments View file @ 12debc4

 	The text below describes the locking rules for VFS-related methods.
 It is (believed to be) up-to-date. *Please*, if you change anything in
 prototypes or locking protocols - update this file. And update the relevant
 instances in the tree, don't leave that to maintainers of filesystems/devices/
 etc. At the very least, put the list of dubious cases in the end of this file.
 Don't turn it into log - maintainers of out-of-the-tree code are supposed to
 be able to use diff(1).
 	Thing currently missing here: socket operations. Alexey?
 --------------------------- dentry_operations --------------------------
 prototypes:
 	int (*d_revalidate)(struct dentry *, int);
 	int (*d_hash) (struct dentry *, struct qstr *);
 	int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
 	int (*d_delete)(struct dentry *);
 	void (*d_release)(struct dentry *);
 	void (*d_iput)(struct dentry *, struct inode *);
 	char *(*d_dname)((struct dentry *dentry, char *buffer, int buflen);
 locking rules:
 	none have BKL
 		dcache_lock	rename_lock	->d_lock	may block
 d_revalidate:	no		no		no		yes
 d_hash		no		no		no		yes
 d_compare:	no		yes		no		no
 d_delete:	yes		no		yes		no
 d_release:	no		no		no		yes
 d_iput:		no		no		no		yes
 d_dname:	no		no		no		no
 --------------------------- inode_operations ---------------------------
 prototypes:
 	int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
 	struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameid
 ata *);
 	int (*link) (struct dentry *,struct inode *,struct dentry *);
 	int (*unlink) (struct inode *,struct dentry *);
 	int (*symlink) (struct inode *,struct dentry *,const char *);
 	int (*mkdir) (struct inode *,struct dentry *,int);
 	int (*rmdir) (struct inode *,struct dentry *);
 	int (*mknod) (struct inode *,struct dentry *,int,dev_t);
 	int (*rename) (struct inode *, struct dentry *,
 			struct inode *, struct dentry *);
 	int (*readlink) (struct dentry *, char __user *,int);
 	int (*follow_link) (struct dentry *, struct nameidata *);
 	void (*truncate) (struct inode *);
 	int (*permission) (struct inode *, int, struct nameidata *);
 	int (*setattr) (struct dentry *, struct iattr *);
 	int (*getattr) (struct vfsmount *, struct dentry *, struct kstat *);
 	int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
 	ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
 	ssize_t (*listxattr) (struct dentry *, char *, size_t);
 	int (*removexattr) (struct dentry *, const char *);
 locking rules:
 	all may block, none have BKL
 		i_mutex(inode)
 lookup:		yes
 create:		yes
 link:		yes (both)
 mknod:		yes
 symlink:	yes
 mkdir:		yes
 unlink:		yes (both)
 rmdir:		yes (both)	(see below)
 rename:		yes (all)	(see below)
 readlink:	no
 follow_link:	no
 truncate:	yes		(see below)
 setattr:	yes
 permission:	no
 getattr:	no
 setxattr:	yes
 getxattr:	no
 listxattr:	no
 removexattr:	yes
 	Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_mutex on
 victim.
 	cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem.
 	->truncate() is never called directly - it's a callback, not a
 method. It's called by vmtruncate() - library function normally used by
 ->setattr(). Locking information above applies to that call (i.e. is
 inherited from ->setattr() - vmtruncate() is used when ATTR_SIZE had been
 passed).
 See Documentation/filesystems/directory-locking for more detailed discussion
 of the locking scheme for directory operations.
 --------------------------- super_operations ---------------------------
 prototypes:
 	struct inode *(*alloc_inode)(struct super_block *sb);
 	void (*destroy_inode)(struct inode *);
-	void (*read_inode) (struct inode *);
 	void (*dirty_inode) (struct inode *);
 	int (*write_inode) (struct inode *, int);
 	void (*put_inode) (struct inode *);
 	void (*drop_inode) (struct inode *);
 	void (*delete_inode) (struct inode *);
 	void (*put_super) (struct super_block *);
 	void (*write_super) (struct super_block *);
 	int (*sync_fs)(struct super_block *sb, int wait);
 	void (*write_super_lockfs) (struct super_block *);
 	void (*unlockfs) (struct super_block *);
 	int (*statfs) (struct dentry *, struct kstatfs *);
 	int (*remount_fs) (struct super_block *, int *, char *);
 	void (*clear_inode) (struct inode *);
 	void (*umount_begin) (struct super_block *);
 	int (*show_options)(struct seq_file *, struct vfsmount *);
 	ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
 	ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
 locking rules:
 	All may block.
 			BKL	s_lock	s_umount
 alloc_inode:		no	no	no
 destroy_inode:		no
-read_inode:		no				(see below)
 dirty_inode:		no				(must not sleep)
 write_inode:		no
 put_inode:		no
 drop_inode:		no				!!!inode_lock!!!
 delete_inode:		no
 put_super:		yes	yes	no
 write_super:		no	yes	read
 sync_fs:		no	no	read
 write_super_lockfs:	?
 unlockfs:		?
 statfs:			no	no	no
 remount_fs:		yes	yes	maybe		(see below)
 clear_inode:		no
 umount_begin:		yes	no	no
 show_options:		no				(vfsmount->sem)
 quota_read:		no	no	no		(see below)
 quota_write:		no	no	no		(see below)
-->read_inode() is not a method - it's a callback used in iget().
 ->remount_fs() will have the s_umount lock if it's already mounted.
 When called from get_sb_single, it does NOT have the s_umount lock.
 ->quota_read() and ->quota_write() functions are both guaranteed to
 be the only ones operating on the quota file by the quota code (via
 dqio_sem) (unless an admin really wants to screw up something and
 writes to quota files with quotas on). For other details about locking
 see also dquot_operations section.
 --------------------------- file_system_type ---------------------------
 prototypes:
 	int (*get_sb) (struct file_system_type *, int,
 		       const char *, void *, struct vfsmount *);
 	void (*kill_sb) (struct super_block *);
 locking rules:
 		may block	BKL
 get_sb		yes		yes
 kill_sb		yes		yes
 ->get_sb() returns error or 0 with locked superblock attached to the vfsmount
 (exclusive on ->s_umount).
 ->kill_sb() takes a write-locked superblock, does all shutdown work on it,
 unlocks and drops the reference.
 --------------------------- address_space_operations --------------------------
 prototypes:
 	int (*writepage)(struct page *page, struct writeback_control *wbc);
 	int (*readpage)(struct file *, struct page *);
 	int (*sync_page)(struct page *);
 	int (*writepages)(struct address_space *, struct writeback_control *);
 	int (*set_page_dirty)(struct page *page);
 	int (*readpages)(struct file *filp, struct address_space *mapping,
 			struct list_head *pages, unsigned nr_pages);
 	int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
 	int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
 	sector_t (*bmap)(struct address_space *, sector_t);
 	int (*invalidatepage) (struct page *, unsigned long);
 	int (*releasepage) (struct page *, int);
 	int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
 			loff_t offset, unsigned long nr_segs);
 	int (*launder_page) (struct page *);
 locking rules:
 	All except set_page_dirty may block
 			BKL	PageLocked(page)	i_sem
 writepage:		no	yes, unlocks (see below)
 readpage:		no	yes, unlocks
 sync_page:		no	maybe
 writepages:		no
 set_page_dirty		no	no
 readpages:		no
 prepare_write:		no	yes			yes
 commit_write:		no	yes			yes
 write_begin:		no	locks the page		yes
 write_end:		no	yes, unlocks		yes
 perform_write:		no	n/a			yes
 bmap:			yes
 invalidatepage:		no	yes
 releasepage:		no	yes
 direct_IO:		no
 launder_page:		no	yes
 	->prepare_write(), ->commit_write(), ->sync_page() and ->readpage()
 may be called from the request handler (/dev/loop).
 	->readpage() unlocks the page, either synchronously or via I/O
 completion.
 	->readpages() populates the pagecache with the passed pages and starts
 I/O against them.  They come unlocked upon I/O completion.
 	->writepage() is used for two purposes: for "memory cleansing" and for
 "sync".  These are quite different operations and the behaviour may differ
 depending upon the mode.
 If writepage is called for sync (wbc->sync_mode != WBC_SYNC_NONE) then
 it *must* start I/O against the page, even if that would involve
 blocking on in-progress I/O.
 If writepage is called for memory cleansing (sync_mode ==
 WBC_SYNC_NONE) then its role is to get as much writeout underway as
 possible.  So writepage should try to avoid blocking against
 currently-in-progress I/O.
 If the filesystem is not called for "sync" and it determines that it
 would need to block against in-progress I/O to be able to start new I/O
 against the page the filesystem should redirty the page with
 redirty_page_for_writepage(), then unlock the page and return zero.
 This may also be done to avoid internal deadlocks, but rarely.
 If the filesystem is called for sync then it must wait on any
 in-progress I/O and then start new I/O.
 The filesystem should unlock the page synchronously, before returning to the
 caller, unless ->writepage() returns special WRITEPAGE_ACTIVATE
 value. WRITEPAGE_ACTIVATE means that page cannot really be written out
 currently, and VM should stop calling ->writepage() on this page for some
 time. VM does this by moving page to the head of the active list, hence the
 name.
 Unless the filesystem is going to redirty_page_for_writepage(), unlock the page
 and return zero, writepage *must* run set_page_writeback() against the page,
 followed by unlocking it.  Once set_page_writeback() has been run against the
 page, write I/O can be submitted and the write I/O completion handler must run
 end_page_writeback() once the I/O is complete.  If no I/O is submitted, the
 filesystem must run end_page_writeback() against the page before returning from
 writepage.
 That is: after 2.5.12, pages which are under writeout are *not* locked.  Note,
 if the filesystem needs the page to be locked during writeout, that is ok, too,
 the page is allowed to be unlocked at any point in time between the calls to
 set_page_writeback() and end_page_writeback().
 Note, failure to run either redirty_page_for_writepage() or the combination of
 set_page_writeback()/end_page_writeback() on a page submitted to writepage
 will leave the page itself marked clean but it will be tagged as dirty in the
 radix tree.  This incoherency can lead to all sorts of hard-to-debug problems
 in the filesystem like having dirty inodes at umount and losing written data.
 	->sync_page() locking rules are not well-defined - usually it is called
 with lock on page, but that is not guaranteed. Considering the currently
 existing instances of this method ->sync_page() itself doesn't look
 well-defined...
 	->writepages() is used for periodic writeback and for syscall-initiated
 sync operations.  The address_space should start I/O against at least
 *nr_to_write pages.  *nr_to_write must be decremented for each page which is
 written.  The address_space implementation may write more (or less) pages
 than *nr_to_write asks for, but it should try to be reasonably close.  If
 nr_to_write is NULL, all dirty pages must be written.
 writepages should _only_ write pages which are present on
 mapping->io_pages.
 	->set_page_dirty() is called from various places in the kernel
 when the target page is marked as needing writeback.  It may be called
 under spinlock (it cannot block) and is sometimes called with the page
 not locked.
 	->bmap() is currently used by legacy ioctl() (FIBMAP) provided by some
 filesystems and by the swapper. The latter will eventually go away. All
 instances do not actually need the BKL. Please, keep it that way and don't
 breed new callers.
 	->invalidatepage() is called when the filesystem must attempt to drop
 some or all of the buffers from the page when it is being truncated.  It
 returns zero on success.  If ->invalidatepage is zero, the kernel uses
 block_invalidatepage() instead.
 	->releasepage() is called when the kernel is about to try to drop the
 buffers from the page in preparation for freeing it.  It returns zero to
 indicate that the buffers are (or may be) freeable.  If ->releasepage is zero,
 the kernel assumes that the fs has no private interest in the buffers.
 	->launder_page() may be called prior to releasing a page if
 it is still found to be dirty. It returns zero if the page was successfully
 cleaned, or an error value if not. Note that in order to prevent the page
 getting mapped back in and redirtied, it needs to be kept locked
 across the entire operation.
 	Note: currently almost all instances of address_space methods are
 using BKL for internal serialization and that's one of the worst sources
 of contention. Normally they are calling library functions (in fs/buffer.c)
 and pass foo_get_block() as a callback (on local block-based filesystems,
 indeed). BKL is not needed for library stuff and is usually taken by
 foo_get_block(). It's an overkill, since block bitmaps can be protected by
 internal fs locking and real critical areas are much smaller than the areas
 filesystems protect now.
 ----------------------- file_lock_operations ------------------------------
 prototypes:
 	void (*fl_insert)(struct file_lock *);	/* lock insertion callback */
 	void (*fl_remove)(struct file_lock *);	/* lock removal callback */
 	void (*fl_copy_lock)(struct file_lock *, struct file_lock *);
 	void (*fl_release_private)(struct file_lock *);
 locking rules:
 			BKL	may block
 fl_insert:		yes	no
 fl_remove:		yes	no
 fl_copy_lock:		yes	no
 fl_release_private:	yes	yes
 ----------------------- lock_manager_operations ---------------------------
 prototypes:
 	int (*fl_compare_owner)(struct file_lock *, struct file_lock *);
 	void (*fl_notify)(struct file_lock *);  /* unblock callback */
 	void (*fl_copy_lock)(struct file_lock *, struct file_lock *);
 	void (*fl_release_private)(struct file_lock *);
 	void (*fl_break)(struct file_lock *); /* break_lease callback */
 locking rules:
 			BKL	may block
 fl_compare_owner:	yes	no
 fl_notify:		yes	no
 fl_copy_lock:		yes	no
 fl_release_private:	yes	yes
 fl_break:		yes	no
 	Currently only NFSD and NLM provide instances of this class. None of the
 them block. If you have out-of-tree instances - please, show up. Locking
 in that area will change.
 --------------------------- buffer_head -----------------------------------
 prototypes:
 	void (*b_end_io)(struct buffer_head *bh, int uptodate);
 locking rules:
 	called from interrupts. In other words, extreme care is needed here.
 bh is locked, but that's all warranties we have here. Currently only RAID1,
 highmem, fs/buffer.c, and fs/ntfs/aops.c are providing these. Block devices
 call this method upon the IO completion.
 --------------------------- block_device_operations -----------------------
 prototypes:
 	int (*open) (struct inode *, struct file *);
 	int (*release) (struct inode *, struct file *);
 	int (*ioctl) (struct inode *, struct file *, unsigned, unsigned long);
 	int (*media_changed) (struct gendisk *);
 	int (*revalidate_disk) (struct gendisk *);
 locking rules:
 			BKL	bd_sem
 open:			yes	yes
 release:		yes	yes
 ioctl:			yes	no
 media_changed:		no	no
 revalidate_disk:	no	no
 The last two are called only from check_disk_change().
 --------------------------- file_operations -------------------------------
 prototypes:
 	loff_t (*llseek) (struct file *, loff_t, int);
 	ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
 	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
 	ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
 	ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
 	int (*readdir) (struct file *, void *, filldir_t);
 	unsigned int (*poll) (struct file *, struct poll_table_struct *);
 	int (*ioctl) (struct inode *, struct file *, unsigned int,
 			unsigned long);
 	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
 	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
 	int (*mmap) (struct file *, struct vm_area_struct *);
 	int (*open) (struct inode *, struct file *);
 	int (*flush) (struct file *);
 	int (*release) (struct inode *, struct file *);
 	int (*fsync) (struct file *, struct dentry *, int datasync);
 	int (*aio_fsync) (struct kiocb *, int datasync);
 	int (*fasync) (int, struct file *, int);
 	int (*lock) (struct file *, int, struct file_lock *);
 	ssize_t (*readv) (struct file *, const struct iovec *, unsigned long,
 			loff_t *);
 	ssize_t (*writev) (struct file *, const struct iovec *, unsigned long,
 			loff_t *);
 	ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t,
 			void __user *);
 	ssize_t (*sendpage) (struct file *, struct page *, int, size_t,
 			loff_t *, int);
 	unsigned long (*get_unmapped_area)(struct file *, unsigned long,
 			unsigned long, unsigned long, unsigned long);
 	int (*check_flags)(int);
 	int (*dir_notify)(struct file *, unsigned long);
 };
 locking rules:
 	All except ->poll() may block.
 			BKL
 llseek:			no	(see below)
 read:			no
 aio_read:		no
 write:			no
 aio_write:		no
 readdir: 		no
 poll:			no
 ioctl:			yes	(see below)
 unlocked_ioctl:		no	(see below)
 compat_ioctl:		no
 mmap:			no
 open:			maybe	(see below)
 flush:			no
 release:		no
 fsync:			no	(see below)
 aio_fsync:		no
 fasync:			yes	(see below)
 lock:			yes
 readv:			no
 writev:			no
 sendfile:		no
 sendpage:		no
 get_unmapped_area:	no
 check_flags:		no
 dir_notify:		no
 ->llseek() locking has moved from llseek to the individual llseek
 implementations.  If your fs is not using generic_file_llseek, you
 need to acquire and release the appropriate locks in your ->llseek().
 For many filesystems, it is probably safe to acquire the inode
 semaphore.  Note some filesystems (i.e. remote ones) provide no
 protection for i_size so you will need to use the BKL.
 ->open() locking is in-transit: big lock partially moved into the methods.
 The only exception is ->open() in the instances of file_operations that never
 end up in ->i_fop/->proc_fops, i.e. ones that belong to character devices
 (chrdev_open() takes lock before replacing ->f_op and calling the secondary
 method. As soon as we fix the handling of module reference counters all
 instances of ->open() will be called without the BKL.
 Note: ext2_release() was *the* source of contention on fs-intensive
 loads and dropping BKL on ->release() helps to get rid of that (we still
 grab BKL for cases when we close a file that had been opened r/w, but that
 can and should be done using the internal locking with smaller critical areas).
 Current worst offender is ext2_get_block()...
 ->fasync() is a mess. This area needs a big cleanup and that will probably
 affect locking.
 ->readdir() and ->ioctl() on directories must be changed. Ideally we would
 move ->readdir() to inode_operations and use a separate method for directory
 ->ioctl() or kill the latter completely. One of the problems is that for
 anything that resembles union-mount we won't have a struct file for all
 components. And there are other reasons why the current interface is a mess...
 ->ioctl() on regular files is superceded by the ->unlocked_ioctl() that
 doesn't take the BKL.
 ->read on directories probably must go away - we should just enforce -EISDIR
 in sys_read() and friends.
 ->fsync() has i_mutex on inode.
 --------------------------- dquot_operations -------------------------------
 prototypes:
 	int (*initialize) (struct inode *, int);
 	int (*drop) (struct inode *);
 	int (*alloc_space) (struct inode *, qsize_t, int);
 	int (*alloc_inode) (const struct inode *, unsigned long);
 	int (*free_space) (struct inode *, qsize_t);
 	int (*free_inode) (const struct inode *, unsigned long);
 	int (*transfer) (struct inode *, struct iattr *);
 	int (*write_dquot) (struct dquot *);
 	int (*acquire_dquot) (struct dquot *);
 	int (*release_dquot) (struct dquot *);
 	int (*mark_dirty) (struct dquot *);
 	int (*write_info) (struct super_block *, int);
 These operations are intended to be more or less wrapping functions that ensure
 a proper locking wrt the filesystem and call the generic quota operations.
 What filesystem should expect from the generic quota functions:
 		FS recursion	Held locks when called
 initialize:	yes		maybe dqonoff_sem
 drop:		yes		-
 alloc_space:	->mark_dirty()	-
 alloc_inode:	->mark_dirty()	-
 free_space:	->mark_dirty()	-
 free_inode:	->mark_dirty()	-
 transfer:	yes		-
 write_dquot:	yes		dqonoff_sem or dqptr_sem
 acquire_dquot:	yes		dqonoff_sem or dqptr_sem
 release_dquot:	yes		dqonoff_sem or dqptr_sem
 mark_dirty:	no		-
 write_info:	yes		dqonoff_sem
 FS recursion means calling ->quota_read() and ->quota_write() from superblock
 operations.
 ->alloc_space(), ->alloc_inode(), ->free_space(), ->free_inode() are called
 only directly by the filesystem and do not call any fs functions only
 the ->mark_dirty() operation.
 More details about quota locking can be found in fs/dquot.c.
 --------------------------- vm_operations_struct -----------------------------
 prototypes:
 	void (*open)(struct vm_area_struct*);
 	void (*close)(struct vm_area_struct*);
 	int (*fault)(struct vm_area_struct*, struct vm_fault *);
 	struct page *(*nopage)(struct vm_area_struct*, unsigned long, int *);
 	int (*page_mkwrite)(struct vm_area_struct *, struct page *);
 locking rules:
 		BKL	mmap_sem	PageLocked(page)
 open:		no	yes
 close:		no	yes
 fault:		no	yes
 nopage:		no	yes
 page_mkwrite:	no	yes		no
 	->page_mkwrite() is called when a previously read-only page is
 about to become writeable. The file system is responsible for
 protecting against truncate races. Once appropriate action has been
 taking to lock out truncate, the page range should be verified to be
 within i_size. The page mapping should also be checked that it is not
 NULL.
 ================================================================================
 			Dubious stuff
 (if you break something or notice that it is broken and do not fix it yourself
 - at least put it here)
 ipc/shm.c::shm_delete() - may need BKL.
 ->read() and ->write() in many drivers are (probably) missing BKL.
 drivers/sgi/char/graphics.c::sgi_graphics_nopage() - may need BKL.

Documentation/filesystems/porting

Diff comments View file @ 12debc4

Documentation/filesystems/vfs.txt

Diff comments View file @ 12debc4

 	      Overview of the Linux Virtual File System
 	Original author: Richard Gooch <rgooch@atnf.csiro.au>
 		  Last updated on June 24, 2007.
   Copyright (C) 1999 Richard Gooch
   Copyright (C) 2005 Pekka Enberg
   This file is released under the GPLv2.
 Introduction
 ============
 The Virtual File System (also known as the Virtual Filesystem Switch)
 is the software layer in the kernel that provides the filesystem
 interface to userspace programs. It also provides an abstraction
 within the kernel which allows different filesystem implementations to
 coexist.
 VFS system calls open(2), stat(2), read(2), write(2), chmod(2) and so
 on are called from a process context. Filesystem locking is described
 in the document Documentation/filesystems/Locking.
 Directory Entry Cache (dcache)
 ------------------------------
 The VFS implements the open(2), stat(2), chmod(2), and similar system
 calls. The pathname argument that is passed to them is used by the VFS
 to search through the directory entry cache (also known as the dentry
 cache or dcache). This provides a very fast look-up mechanism to
 translate a pathname (filename) into a specific dentry. Dentries live
 in RAM and are never saved to disc: they exist only for performance.
 The dentry cache is meant to be a view into your entire filespace. As
 most computers cannot fit all dentries in the RAM at the same time,
 some bits of the cache are missing. In order to resolve your pathname
 into a dentry, the VFS may have to resort to creating dentries along
 the way, and then loading the inode. This is done by looking up the
 inode.
 The Inode Object
 ----------------
 An individual dentry usually has a pointer to an inode. Inodes are
 filesystem objects such as regular files, directories, FIFOs and other
 beasts.  They live either on the disc (for block device filesystems)
 or in the memory (for pseudo filesystems). Inodes that live on the
 disc are copied into the memory when required and changes to the inode
 are written back to disc. A single inode can be pointed to by multiple
 dentries (hard links, for example, do this).
 To look up an inode requires that the VFS calls the lookup() method of
 the parent directory inode. This method is installed by the specific
 filesystem implementation that the inode lives in. Once the VFS has
 the required dentry (and hence the inode), we can do all those boring
 things like open(2) the file, or stat(2) it to peek at the inode
 data. The stat(2) operation is fairly simple: once the VFS has the
 dentry, it peeks at the inode data and passes some of it back to
 userspace.
 The File Object
 ---------------
 Opening a file requires another operation: allocation of a file
 structure (this is the kernel-side implementation of file
 descriptors). The freshly allocated file structure is initialized with
 a pointer to the dentry and a set of file operation member functions.
 These are taken from the inode data. The open() file method is then
 called so the specific filesystem implementation can do it's work. You
 can see that this is another switch performed by the VFS. The file
 structure is placed into the file descriptor table for the process.
 Reading, writing and closing files (and other assorted VFS operations)
 is done by using the userspace file descriptor to grab the appropriate
 file structure, and then calling the required file structure method to
 do whatever is required. For as long as the file is open, it keeps the
 dentry in use, which in turn means that the VFS inode is still in use.
 Registering and Mounting a Filesystem
 =====================================
 To register and unregister a filesystem, use the following API
 functions:
    #include <linux/fs.h>
    extern int register_filesystem(struct file_system_type *);
    extern int unregister_filesystem(struct file_system_type *);
 The passed struct file_system_type describes your filesystem. When a
 request is made to mount a device onto a directory in your filespace,
 the VFS will call the appropriate get_sb() method for the specific
 filesystem. The dentry for the mount point will then be updated to
 point to the root inode for the new filesystem.
 You can see all filesystems that are registered to the kernel in the
 file /proc/filesystems.
 struct file_system_type
 -----------------------
 This describes the filesystem. As of kernel 2.6.22, the following
 members are defined:
 struct file_system_type {
 	const char *name;
 	int fs_flags;
         int (*get_sb) (struct file_system_type *, int,
                        const char *, void *, struct vfsmount *);
         void (*kill_sb) (struct super_block *);
         struct module *owner;
         struct file_system_type * next;
         struct list_head fs_supers;
 	struct lock_class_key s_lock_key;
 	struct lock_class_key s_umount_key;
 };
   name: the name of the filesystem type, such as "ext2", "iso9660",
 	"msdos" and so on
   fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.)
   get_sb: the method to call when a new instance of this
 	filesystem should be mounted
   kill_sb: the method to call when an instance of this filesystem
 	should be unmounted
   owner: for internal VFS use: you should initialize this to THIS_MODULE in
   	most cases.
   next: for internal VFS use: you should initialize this to NULL
   s_lock_key, s_umount_key: lockdep-specific
 The get_sb() method has the following arguments:
   struct file_system_type *fs_type: decribes the filesystem, partly initialized
   	by the specific filesystem code
   int flags: mount flags
   const char *dev_name: the device name we are mounting.
   void *data: arbitrary mount options, usually comes as an ASCII
 	string
   struct vfsmount *mnt: a vfs-internal representation of a mount point
 The get_sb() method must determine if the block device specified
 in the dev_name and fs_type contains a filesystem of the type the method
 supports. If it succeeds in opening the named block device, it initializes a
 struct super_block descriptor for the filesystem contained by the block device.
 On failure it returns an error.
 The most interesting member of the superblock structure that the
 get_sb() method fills in is the "s_op" field. This is a pointer to
 a "struct super_operations" which describes the next level of the
 filesystem implementation.
 Usually, a filesystem uses one of the generic get_sb() implementations
 and provides a fill_super() method instead. The generic methods are:
   get_sb_bdev: mount a filesystem residing on a block device
   get_sb_nodev: mount a filesystem that is not backed by a device
   get_sb_single: mount a filesystem which shares the instance between
   	all mounts
 A fill_super() method implementation has the following arguments:
   struct super_block *sb: the superblock structure. The method fill_super()
   	must initialize this properly.
   void *data: arbitrary mount options, usually comes as an ASCII
 	string
   int silent: whether or not to be silent on error
 The Superblock Object
 =====================
 A superblock object represents a mounted filesystem.
 struct super_operations
 -----------------------
 This describes how the VFS can manipulate the superblock of your
 filesystem. As of kernel 2.6.22, the following members are defined:
 struct super_operations {
         struct inode *(*alloc_inode)(struct super_block *sb);
         void (*destroy_inode)(struct inode *);
-        void (*read_inode) (struct inode *);
         void (*dirty_inode) (struct inode *);
         int (*write_inode) (struct inode *, int);
         void (*put_inode) (struct inode *);
         void (*drop_inode) (struct inode *);
         void (*delete_inode) (struct inode *);
         void (*put_super) (struct super_block *);
         void (*write_super) (struct super_block *);
         int (*sync_fs)(struct super_block *sb, int wait);
         void (*write_super_lockfs) (struct super_block *);
         void (*unlockfs) (struct super_block *);
         int (*statfs) (struct dentry *, struct kstatfs *);
         int (*remount_fs) (struct super_block *, int *, char *);
         void (*clear_inode) (struct inode *);
         void (*umount_begin) (struct super_block *);
         int (*show_options)(struct seq_file *, struct vfsmount *);
         ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
         ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
 };
 All methods are called without any locks being held, unless otherwise
 noted. This means that most methods can block safely. All methods are
 only called from a process context (i.e. not from an interrupt handler
 or bottom half).
   alloc_inode: this method is called by inode_alloc() to allocate memory
  	for struct inode and initialize it.  If this function is not
  	defined, a simple 'struct inode' is allocated.  Normally
  	alloc_inode will be used to allocate a larger structure which
  	contains a 'struct inode' embedded within it.
   destroy_inode: this method is called by destroy_inode() to release
   	resources allocated for struct inode.  It is only required if
   	->alloc_inode was defined and simply undoes anything done by
 	->alloc_inode.
-  read_inode: this method is called to read a specific inode from the
-        mounted filesystem.  The i_ino member in the struct inode is
-	initialized by the VFS to indicate which inode to read. Other
-	members are filled in by this method.
-	You can set this to NULL and use iget5_locked() instead of iget()
-	to read inodes.  This is necessary for filesystems for which the
-	inode number is not sufficient to identify an inode.
   dirty_inode: this method is called by the VFS to mark an inode dirty.
   write_inode: this method is called when the VFS needs to write an
 	inode to disc.  The second parameter indicates whether the write
 	should be synchronous or not, not all filesystems check this flag.
   put_inode: called when the VFS inode is removed from the inode
 	cache.
   drop_inode: called when the last access to the inode is dropped,
 	with the inode_lock spinlock held.
 	This method should be either NULL (normal UNIX filesystem
 	semantics) or "generic_delete_inode" (for filesystems that do not
 	want to cache inodes - causing "delete_inode" to always be
 	called regardless of the value of i_nlink)
 	The "generic_delete_inode()" behavior is equivalent to the
 	old practice of using "force_delete" in the put_inode() case,
 	but does not have the races that the "force_delete()" approach
 	had.
   delete_inode: called when the VFS wants to delete an inode
   put_super: called when the VFS wishes to free the superblock
 	(i.e. unmount). This is called with the superblock lock held
   write_super: called when the VFS superblock needs to be written to
 	disc. This method is optional
   sync_fs: called when VFS is writing out all dirty data associated with
   	a superblock. The second parameter indicates whether the method
 	should wait until the write out has been completed. Optional.
   write_super_lockfs: called when VFS is locking a filesystem and
   	forcing it into a consistent state.  This method is currently
   	used by the Logical Volume Manager (LVM).
   unlockfs: called when VFS is unlocking a filesystem and making it writable
   	again.
   statfs: called when the VFS needs to get filesystem statistics. This
 	is called with the kernel lock held
   remount_fs: called when the filesystem is remounted. This is called
 	with the kernel lock held
   clear_inode: called then the VFS clears the inode. Optional
   umount_begin: called when the VFS is unmounting a filesystem.
   show_options: called by the VFS to show mount options for /proc/<pid>/mounts.
   quota_read: called by the VFS to read from filesystem quota file.
   quota_write: called by the VFS to write to filesystem quota file.
-The read_inode() method is responsible for filling in the "i_op"
+Whoever sets up the inode is responsible for filling in the "i_op" field. This
-field. This is a pointer to a "struct inode_operations" which
+is a pointer to a "struct inode_operations" which describes the methods that
-describes the methods that can be performed on individual inodes.
+can be performed on individual inodes.
 The Inode Object
 ================
 An inode object represents an object within the filesystem.
 struct inode_operations
 -----------------------
 This describes how the VFS can manipulate an inode in your
 filesystem. As of kernel 2.6.22, the following members are defined:
 struct inode_operations {
 	int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
 	struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *);
 	int (*link) (struct dentry *,struct inode *,struct dentry *);
 	int (*unlink) (struct inode *,struct dentry *);
 	int (*symlink) (struct inode *,struct dentry *,const char *);
 	int (*mkdir) (struct inode *,struct dentry *,int);
 	int (*rmdir) (struct inode *,struct dentry *);
 	int (*mknod) (struct inode *,struct dentry *,int,dev_t);
 	int (*rename) (struct inode *, struct dentry *,
 			struct inode *, struct dentry *);
 	int (*readlink) (struct dentry *, char __user *,int);
         void * (*follow_link) (struct dentry *, struct nameidata *);
         void (*put_link) (struct dentry *, struct nameidata *, void *);
 	void (*truncate) (struct inode *);
 	int (*permission) (struct inode *, int, struct nameidata *);
 	int (*setattr) (struct dentry *, struct iattr *);
 	int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
 	int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
 	ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
 	ssize_t (*listxattr) (struct dentry *, char *, size_t);
 	int (*removexattr) (struct dentry *, const char *);
 	void (*truncate_range)(struct inode *, loff_t, loff_t);
 };
 Again, all methods are called without any locks being held, unless
 otherwise noted.
   create: called by the open(2) and creat(2) system calls. Only
 	required if you want to support regular files. The dentry you
 	get should not have an inode (i.e. it should be a negative
 	dentry). Here you will probably call d_instantiate() with the
 	dentry and the newly created inode
   lookup: called when the VFS needs to look up an inode in a parent
 	directory. The name to look for is found in the dentry. This
 	method must call d_add() to insert the found inode into the
 	dentry. The "i_count" field in the inode structure should be
 	incremented. If the named inode does not exist a NULL inode
 	should be inserted into the dentry (this is called a negative
 	dentry). Returning an error code from this routine must only
 	be done on a real error, otherwise creating inodes with system
 	calls like create(2), mknod(2), mkdir(2) and so on will fail.
 	If you wish to overload the dentry methods then you should
 	initialise the "d_dop" field in the dentry; this is a pointer
 	to a struct "dentry_operations".
 	This method is called with the directory inode semaphore held
   link: called by the link(2) system call. Only required if you want
 	to support hard links. You will probably need to call
 	d_instantiate() just as you would in the create() method
   unlink: called by the unlink(2) system call. Only required if you
 	want to support deleting inodes
   symlink: called by the symlink(2) system call. Only required if you
 	want to support symlinks. You will probably need to call
 	d_instantiate() just as you would in the create() method
   mkdir: called by the mkdir(2) system call. Only required if you want
 	to support creating subdirectories. You will probably need to
 	call d_instantiate() just as you would in the create() method
   rmdir: called by the rmdir(2) system call. Only required if you want
 	to support deleting subdirectories
   mknod: called by the mknod(2) system call to create a device (char,
 	block) inode or a named pipe (FIFO) or socket. Only required
 	if you want to support creating these types of inodes. You
 	will probably need to call d_instantiate() just as you would
 	in the create() method
   rename: called by the rename(2) system call to rename the object to
 	have the parent and name given by the second inode and dentry.
   readlink: called by the readlink(2) system call. Only required if
 	you want to support reading symbolic links
   follow_link: called by the VFS to follow a symbolic link to the
 	inode it points to.  Only required if you want to support
 	symbolic links.  This method returns a void pointer cookie
 	that is passed to put_link().
   put_link: called by the VFS to release resources allocated by
   	follow_link().  The cookie returned by follow_link() is passed
   	to this method as the last parameter.  It is used by
   	filesystems such as NFS where page cache is not stable
   	(i.e. page that was installed when the symbolic link walk
   	started might not be in the page cache at the end of the
   	walk).
   truncate: called by the VFS to change the size of a file.  The
  	i_size field of the inode is set to the desired size by the
  	VFS before this method is called.  This method is called by
  	the truncate(2) system call and related functionality.
   permission: called by the VFS to check for access rights on a POSIX-like
   	filesystem.
   setattr: called by the VFS to set attributes for a file. This method
   	is called by chmod(2) and related system calls.
   getattr: called by the VFS to get attributes of a file. This method
   	is called by stat(2) and related system calls.
   setxattr: called by the VFS to set an extended attribute for a file.
   	Extended attribute is a name:value pair associated with an
   	inode. This method is called by setxattr(2) system call.
   getxattr: called by the VFS to retrieve the value of an extended
   	attribute name. This method is called by getxattr(2) function
   	call.
   listxattr: called by the VFS to list all extended attributes for a
   	given file. This method is called by listxattr(2) system call.
   removexattr: called by the VFS to remove an extended attribute from
   	a file. This method is called by removexattr(2) system call.
   truncate_range: a method provided by the underlying filesystem to truncate a
   	range of blocks , i.e. punch a hole somewhere in a file.
 The Address Space Object
 ========================
 The address space object is used to group and manage pages in the page
 cache.  It can be used to keep track of the pages in a file (or
 anything else) and also track the mapping of sections of the file into
 process address spaces.
 There are a number of distinct yet related services that an
 address-space can provide.  These include communicating memory
 pressure, page lookup by address, and keeping track of pages tagged as
 Dirty or Writeback.
 The first can be used independently to the others.  The VM can try to
 either write dirty pages in order to clean them, or release clean
 pages in order to reuse them.  To do this it can call the ->writepage
 method on dirty pages, and ->releasepage on clean pages with
 PagePrivate set. Clean pages without PagePrivate and with no external
 references will be released without notice being given to the
 address_space.
 To achieve this functionality, pages need to be placed on an LRU with
 lru_cache_add and mark_page_active needs to be called whenever the
 page is used.
 Pages are normally kept in a radix tree index by ->index. This tree
 maintains information about the PG_Dirty and PG_Writeback status of
 each page, so that pages with either of these flags can be found
 quickly.
 The Dirty tag is primarily used by mpage_writepages - the default
 ->writepages method.  It uses the tag to find dirty pages to call
 ->writepage on.  If mpage_writepages is not used (i.e. the address
 provides its own ->writepages) , the PAGECACHE_TAG_DIRTY tag is
 almost unused.  write_inode_now and sync_inode do use it (through
 __sync_single_inode) to check if ->writepages has been successful in
 writing out the whole address_space.
 The Writeback tag is used by filemap*wait* and sync_page* functions,
 via wait_on_page_writeback_range, to wait for all writeback to
 complete.  While waiting ->sync_page (if defined) will be called on
 each page that is found to require writeback.
 An address_space handler may attach extra information to a page,
 typically using the 'private' field in the 'struct page'.  If such
 information is attached, the PG_Private flag should be set.  This will
 cause various VM routines to make extra calls into the address_space
 handler to deal with that data.
 An address space acts as an intermediate between storage and
 application.  Data is read into the address space a whole page at a
 time, and provided to the application either by copying of the page,
 or by memory-mapping the page.
 Data is written into the address space by the application, and then
 written-back to storage typically in whole pages, however the
 address_space has finer control of write sizes.
 The read process essentially only requires 'readpage'.  The write
 process is more complicated and uses prepare_write/commit_write or
 set_page_dirty to write data into the address_space, and writepage,
 sync_page, and writepages to writeback data to storage.
 Adding and removing pages to/from an address_space is protected by the
 inode's i_mutex.
 When data is written to a page, the PG_Dirty flag should be set.  It
 typically remains set until writepage asks for it to be written.  This
 should clear PG_Dirty and set PG_Writeback.  It can be actually
 written at any point after PG_Dirty is clear.  Once it is known to be
 safe, PG_Writeback is cleared.
 Writeback makes use of a writeback_control structure...
 struct address_space_operations
 -------------------------------
 This describes how the VFS can manipulate mapping of a file to page cache in
 your filesystem. As of kernel 2.6.22, the following members are defined:
 struct address_space_operations {
 	int (*writepage)(struct page *page, struct writeback_control *wbc);
 	int (*readpage)(struct file *, struct page *);
 	int (*sync_page)(struct page *);
 	int (*writepages)(struct address_space *, struct writeback_control *);
 	int (*set_page_dirty)(struct page *page);
 	int (*readpages)(struct file *filp, struct address_space *mapping,
 			struct list_head *pages, unsigned nr_pages);
 	int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
 	int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
 	int (*write_begin)(struct file *, struct address_space *mapping,
 				loff_t pos, unsigned len, unsigned flags,
 				struct page **pagep, void **fsdata);
 	int (*write_end)(struct file *, struct address_space *mapping,
 				loff_t pos, unsigned len, unsigned copied,
 				struct page *page, void *fsdata);
 	sector_t (*bmap)(struct address_space *, sector_t);
 	int (*invalidatepage) (struct page *, unsigned long);
 	int (*releasepage) (struct page *, int);
 	ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
 			loff_t offset, unsigned long nr_segs);
 	struct page* (*get_xip_page)(struct address_space *, sector_t,
 			int);
 	/* migrate the contents of a page to the specified target */
 	int (*migratepage) (struct page *, struct page *);
 	int (*launder_page) (struct page *);
 };
   writepage: called by the VM to write a dirty page to backing store.
       This may happen for data integrity reasons (i.e. 'sync'), or
       to free up memory (flush).  The difference can be seen in
       wbc->sync_mode.
       The PG_Dirty flag has been cleared and PageLocked is true.
       writepage should start writeout, should set PG_Writeback,
       and should make sure the page is unlocked, either synchronously
       or asynchronously when the write operation completes.
       If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to
       try too hard if there are problems, and may choose to write out
       other pages from the mapping if that is easier (e.g. due to
       internal dependencies).  If it chooses not to start writeout, it
       should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
       calling ->writepage on that page.
       See the file "Locking" for more details.
   readpage: called by the VM to read a page from backing store.
        The page will be Locked when readpage is called, and should be
        unlocked and marked uptodate once the read completes.
        If ->readpage discovers that it needs to unlock the page for
        some reason, it can do so, and then return AOP_TRUNCATED_PAGE.
        In this case, the page will be relocated, relocked and if
        that all succeeds, ->readpage will be called again.
   sync_page: called by the VM to notify the backing store to perform all
   	queued I/O operations for a page. I/O operations for other pages
 	associated with this address_space object may also be performed.
 	This function is optional and is called only for pages with
   	PG_Writeback set while waiting for the writeback to complete.
   writepages: called by the VM to write out pages associated with the
   	address_space object.  If wbc->sync_mode is WBC_SYNC_ALL, then
   	the writeback_control will specify a range of pages that must be
   	written out.  If it is WBC_SYNC_NONE, then a nr_to_write is given
 	and that many pages should be written if possible.
 	If no ->writepages is given, then mpage_writepages is used
   	instead.  This will choose pages from the address space that are
   	tagged as DIRTY and will pass them to ->writepage.
   set_page_dirty: called by the VM to set a page dirty.
         This is particularly needed if an address space attaches
         private data to a page, and that data needs to be updated when
         a page is dirtied.  This is called, for example, when a memory
 	mapped page gets modified.
 	If defined, it should set the PageDirty flag, and the
         PAGECACHE_TAG_DIRTY tag in the radix tree.
   readpages: called by the VM to read pages associated with the address_space
   	object. This is essentially just a vector version of
   	readpage.  Instead of just one page, several pages are
   	requested.
 	readpages is only used for read-ahead, so read errors are
   	ignored.  If anything goes wrong, feel free to give up.
   prepare_write: called by the generic write path in VM to set up a write
   	request for a page.  This indicates to the address space that
   	the given range of bytes is about to be written.  The
   	address_space should check that the write will be able to
   	complete, by allocating space if necessary and doing any other
   	internal housekeeping.  If the write will update parts of
   	any basic-blocks on storage, then those blocks should be
   	pre-read (if they haven't been read already) so that the
   	updated blocks can be written out properly.
 	The page will be locked.
 	Note: the page _must not_ be marked uptodate in this function
 	(or anywhere else) unless it actually is uptodate right now. As
 	soon as a page is marked uptodate, it is possible for a concurrent
 	read(2) to copy it to userspace.
   commit_write: If prepare_write succeeds, new data will be copied
         into the page and then commit_write will be called.  It will
         typically update the size of the file (if appropriate) and
         mark the inode as dirty, and do any other related housekeeping
         operations.  It should avoid returning an error if possible -
         errors should have been handled by prepare_write.
   write_begin: This is intended as a replacement for prepare_write. The
 	key differences being that:
 		- it returns a locked page (in *pagep) rather than being
 		  given a pre locked page;
 		- it must be able to cope with short writes (where the
 		  length passed to write_begin is greater than the number
 		  of bytes copied into the page).
 	Called by the generic buffered write code to ask the filesystem to
 	prepare to write len bytes at the given offset in the file. The
 	address_space should check that the write will be able to complete,
 	by allocating space if necessary and doing any other internal
 	housekeeping.  If the write will update parts of any basic-blocks on
 	storage, then those blocks should be pre-read (if they haven't been
 	read already) so that the updated blocks can be written out properly.
         The filesystem must return the locked pagecache page for the specified
 	offset, in *pagep, for the caller to write into.
 	flags is a field for AOP_FLAG_xxx flags, described in
 	include/linux/fs.h.
         A void * may be returned in fsdata, which then gets passed into
         write_end.
         Returns 0 on success; < 0 on failure (which is the error code), in
 	which case write_end is not called.
   write_end: After a successful write_begin, and data copy, write_end must
         be called. len is the original len passed to write_begin, and copied
         is the amount that was able to be copied (copied == len is always true
 	if write_begin was called with the AOP_FLAG_UNINTERRUPTIBLE flag).
         The filesystem must take care of unlocking the page and releasing it
         refcount, and updating i_size.
         Returns < 0 on failure, otherwise the number of bytes (<= 'copied')
         that were able to be copied into pagecache.
   bmap: called by the VFS to map a logical block offset within object to
   	physical block number. This method is used by the FIBMAP
   	ioctl and for working with swap-files.  To be able to swap to
   	a file, the file must have a stable mapping to a block
   	device.  The swap system does not go through the filesystem
   	but instead uses bmap to find out where the blocks in the file
   	are and uses those addresses directly.
   invalidatepage: If a page has PagePrivate set, then invalidatepage
         will be called when part or all of the page is to be removed
 	from the address space.  This generally corresponds to either a
 	truncation or a complete invalidation of the address space
 	(in the latter case 'offset' will always be 0).
 	Any private data associated with the page should be updated
 	to reflect this truncation.  If offset is 0, then
 	the private data should be released, because the page
 	must be able to be completely discarded.  This may be done by
         calling the ->releasepage function, but in this case the
         release MUST succeed.
   releasepage: releasepage is called on PagePrivate pages to indicate
         that the page should be freed if possible.  ->releasepage
         should remove any private data from the page and clear the
         PagePrivate flag.  It may also remove the page from the
         address_space.  If this fails for some reason, it may indicate
         failure with a 0 return value.
 	This is used in two distinct though related cases.  The first
         is when the VM finds a clean page with no active users and
         wants to make it a free page.  If ->releasepage succeeds, the
         page will be removed from the address_space and become free.
 	The second case is when a request has been made to invalidate
         some or all pages in an address_space.  This can happen
         through the fadvice(POSIX_FADV_DONTNEED) system call or by the
         filesystem explicitly requesting it as nfs and 9fs do (when
         they believe the cache may be out of date with storage) by
         calling invalidate_inode_pages2().
 	If the filesystem makes such a call, and needs to be certain
         that all pages are invalidated, then its releasepage will
         need to ensure this.  Possibly it can clear the PageUptodate
         bit if it cannot free private data yet.
   direct_IO: called by the generic read/write routines to perform
         direct_IO - that is IO requests which bypass the page cache
         and transfer data directly between the storage and the
         application's address space.
   get_xip_page: called by the VM to translate a block number to a page.
 	The page is valid until the corresponding filesystem is unmounted.
 	Filesystems that want to use execute-in-place (XIP) need to implement
 	it.  An example implementation can be found in fs/ext2/xip.c.
   migrate_page:  This is used to compact the physical memory usage.
         If the VM wants to relocate a page (maybe off a memory card
         that is signalling imminent failure) it will pass a new page
 	and an old page to this function.  migrate_page should
 	transfer any private data across and update any references
         that it has to the page.
   launder_page: Called before freeing a page - it writes back the dirty page. To
   	prevent redirtying the page, it is kept locked during the whole
 	operation.
 The File Object
 ===============
 A file object represents a file opened by a process.
 struct file_operations
 ----------------------
 This describes how the VFS can manipulate an open file. As of kernel
 2.6.22, the following members are defined:
 struct file_operations {
 	struct module *owner;
 	loff_t (*llseek) (struct file *, loff_t, int);
 	ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
 	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
 	ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
 	ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
 	int (*readdir) (struct file *, void *, filldir_t);
 	unsigned int (*poll) (struct file *, struct poll_table_struct *);
 	int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
 	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
 	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
 	int (*mmap) (struct file *, struct vm_area_struct *);
 	int (*open) (struct inode *, struct file *);
 	int (*flush) (struct file *);
 	int (*release) (struct inode *, struct file *);
 	int (*fsync) (struct file *, struct dentry *, int datasync);
 	int (*aio_fsync) (struct kiocb *, int datasync);
 	int (*fasync) (int, struct file *, int);
 	int (*lock) (struct file *, int, struct file_lock *);
 	ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *);
 	ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *);
 	ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *);
 	ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
 	unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
 	int (*check_flags)(int);
 	int (*dir_notify)(struct file *filp, unsigned long arg);
 	int (*flock) (struct file *, int, struct file_lock *);
 	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, size_t, unsigned int);
 	ssize_t (*splice_read)(struct file *, struct pipe_inode_info *, size_t, unsigned int);
 };
 Again, all methods are called without any locks being held, unless
 otherwise noted.
   llseek: called when the VFS needs to move the file position index
   read: called by read(2) and related system calls
   aio_read: called by io_submit(2) and other asynchronous I/O operations
   write: called by write(2) and related system calls
   aio_write: called by io_submit(2) and other asynchronous I/O operations
   readdir: called when the VFS needs to read the directory contents
   poll: called by the VFS when a process wants to check if there is
 	activity on this file and (optionally) go to sleep until there
 	is activity. Called by the select(2) and poll(2) system calls
   ioctl: called by the ioctl(2) system call
   unlocked_ioctl: called by the ioctl(2) system call. Filesystems that do not
   	require the BKL should use this method instead of the ioctl() above.
   compat_ioctl: called by the ioctl(2) system call when 32 bit system calls
  	 are used on 64 bit kernels.
   mmap: called by the mmap(2) system call
   open: called by the VFS when an inode should be opened. When the VFS
 	opens a file, it creates a new "struct file". It then calls the
 	open method for the newly allocated file structure. You might
 	think that the open method really belongs in
 	"struct inode_operations", and you may be right. I think it's
 	done the way it is because it makes filesystems simpler to
 	implement. The open() method is a good place to initialize the
 	"private_data" member in the file structure if you want to point
 	to a device structure
   flush: called by the close(2) system call to flush a file
   release: called when the last reference to an open file is closed
   fsync: called by the fsync(2) system call
   fasync: called by the fcntl(2) system call when asynchronous
 	(non-blocking) mode is enabled for a file
   lock: called by the fcntl(2) system call for F_GETLK, F_SETLK, and F_SETLKW
   	commands
   readv: called by the readv(2) system call
   writev: called by the writev(2) system call
   sendfile: called by the sendfile(2) system call
   get_unmapped_area: called by the mmap(2) system call
   check_flags: called by the fcntl(2) system call for F_SETFL command
   dir_notify: called by the fcntl(2) system call for F_NOTIFY command
   flock: called by the flock(2) system call
   splice_write: called by the VFS to splice data from a pipe to a file. This
 		method is used by the splice(2) system call
   splice_read: called by the VFS to splice data from file to a pipe. This
 	       method is used by the splice(2) system call
 Note that the file operations are implemented by the specific
 filesystem in which the inode resides. When opening a device node
 (character or block special) most filesystems will call special
 support routines in the VFS which will locate the required device
 driver information. These support routines replace the filesystem file
 operations with those for the device driver, and then proceed to call
 the new open() method for the file. This is how opening a device file
 in the filesystem eventually ends up calling the device driver open()
 method.
 Directory Entry Cache (dcache)
 ==============================
 struct dentry_operations
 ------------------------
 This describes how a filesystem can overload the standard dentry
 operations. Dentries and the dcache are the domain of the VFS and the
 individual filesystem implementations. Device drivers have no business
 here. These methods may be set to NULL, as they are either optional or
 the VFS uses a default. As of kernel 2.6.22, the following members are
 defined:
 struct dentry_operations {
 	int (*d_revalidate)(struct dentry *, struct nameidata *);
 	int (*d_hash) (struct dentry *, struct qstr *);
 	int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
 	int (*d_delete)(struct dentry *);
 	void (*d_release)(struct dentry *);
 	void (*d_iput)(struct dentry *, struct inode *);
 	char *(*d_dname)(struct dentry *, char *, int);
 };
   d_revalidate: called when the VFS needs to revalidate a dentry. This
 	is called whenever a name look-up finds a dentry in the
 	dcache. Most filesystems leave this as NULL, because all their
 	dentries in the dcache are valid
   d_hash: called when the VFS adds a dentry to the hash table
   d_compare: called when a dentry should be compared with another
   d_delete: called when the last reference to a dentry is
 	deleted. This means no-one is using the dentry, however it is
 	still valid and in the dcache
   d_release: called when a dentry is really deallocated
   d_iput: called when a dentry loses its inode (just prior to its
 	being deallocated). The default when this is NULL is that the
 	VFS calls iput(). If you define this method, you must call
 	iput() yourself
   d_dname: called when the pathname of a dentry should be generated.
 	Usefull for some pseudo filesystems (sockfs, pipefs, ...) to delay
 	pathname generation. (Instead of doing it when dentry is created,
 	its done only when the path is needed.). Real filesystems probably
 	dont want to use it, because their dentries are present in global
 	dcache hash, so their hash should be an invariant. As no lock is
 	held, d_dname() should not try to modify the dentry itself, unless
 	appropriate SMP safety is used. CAUTION : d_path() logic is quite
 	tricky. The correct way to return for example "Hello" is to put it
 	at the end of the buffer, and returns a pointer to the first char.
 	dynamic_dname() helper function is provided to take care of this.
 Example :
 static char *pipefs_dname(struct dentry *dent, char *buffer, int buflen)
 {
 	return dynamic_dname(dentry, buffer, buflen, "pipe:[%lu]",
 				dentry->d_inode->i_ino);
 }
 Each dentry has a pointer to its parent dentry, as well as a hash list
 of child dentries. Child dentries are basically like files in a
 directory.
 Directory Entry Cache API
 --------------------------
 There are a number of functions defined which permit a filesystem to
 manipulate dentries:
   dget: open a new handle for an existing dentry (this just increments
 	the usage count)
   dput: close a handle for a dentry (decrements the usage count). If
 	the usage count drops to 0, the "d_delete" method is called
 	and the dentry is placed on the unused list if the dentry is
 	still in its parents hash list. Putting the dentry on the
 	unused list just means that if the system needs some RAM, it
 	goes through the unused list of dentries and deallocates them.
 	If the dentry has already been unhashed and the usage count
 	drops to 0, in this case the dentry is deallocated after the
 	"d_delete" method is called
   d_drop: this unhashes a dentry from its parents hash list. A
 	subsequent call to dput() will deallocate the dentry if its
 	usage count drops to 0
   d_delete: delete a dentry. If there are no other open references to
 	the dentry then the dentry is turned into a negative dentry
 	(the d_iput() method is called). If there are other
 	references, then d_drop() is called instead
   d_add: add a dentry to its parents hash list and then calls
 	d_instantiate()
   d_instantiate: add a dentry to the alias hash list for the inode and
 	updates the "d_inode" member. The "i_count" member in the
 	inode structure should be set/incremented. If the inode
 	pointer is NULL, the dentry is called a "negative
 	dentry". This function is commonly called when an inode is
 	created for an existing negative dentry
   d_lookup: look up a dentry given its parent and path name component
 	It looks up the child of that given name from the dcache
 	hash table. If it is found, the reference count is incremented
 	and the dentry is returned. The caller must use d_put()
 	to free the dentry when it finishes using it.
 For further information on dentry locking, please refer to the document
 Documentation/filesystems/dentry-locking.txt.
 Resources
 =========
 (Note some of these resources are not up-to-date with the latest kernel
  version.)
 Creating Linux virtual filesystems. 2002
     <http://lwn.net/Articles/13325/>
 The Linux Virtual File-system Layer by Neil Brown. 1999
     <http://www.cse.unsw.edu.au/~neilb/oss/linux-commentary/vfs.html>
 A tour of the Linux VFS by Michael K. Johnson. 1996
     <http://www.tldp.org/LDP/khg/HyperNews/get/fs/vfstour.html>
 A small trail through the Linux kernel by Andries Brouwer. 2001
     <http://www.win.tue.nl/~aeb/linux/vfs/trail.html>

fs/inode.c

Diff comments View file @ 12debc4

 /*
  * linux/fs/inode.c
  *
  * (C) 1997 Linus Torvalds
  */
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/dcache.h>
 #include <linux/init.h>
 #include <linux/quotaops.h>
 #include <linux/slab.h>
 #include <linux/writeback.h>
 #include <linux/module.h>
 #include <linux/backing-dev.h>
 #include <linux/wait.h>
 #include <linux/hash.h>
 #include <linux/swap.h>
 #include <linux/security.h>
 #include <linux/pagemap.h>
 #include <linux/cdev.h>
 #include <linux/bootmem.h>
 #include <linux/inotify.h>
 #include <linux/mount.h>
 /*
  * This is needed for the following functions:
  *  - inode_has_buffers
  *  - invalidate_inode_buffers
  *  - invalidate_bdev
  *
  * FIXME: remove all knowledge of the buffer layer from this file
  */
 #include <linux/buffer_head.h>
 /*
  * New inode.c implementation.
  *
  * This implementation has the basic premise of trying
  * to be extremely low-overhead and SMP-safe, yet be
  * simple enough to be "obviously correct".
  *
  * Famous last words.
  */
 /* inode dynamic allocation 1999, Andrea Arcangeli <andrea@suse.de> */
 /* #define INODE_PARANOIA 1 */
 /* #define INODE_DEBUG 1 */
 /*
  * Inode lookup is no longer as critical as it used to be:
  * most of the lookups are going to be through the dcache.
  */
 #define I_HASHBITS	i_hash_shift
 #define I_HASHMASK	i_hash_mask
 static unsigned int i_hash_mask __read_mostly;
 static unsigned int i_hash_shift __read_mostly;
 /*
  * Each inode can be on two separate lists. One is
  * the hash list of the inode, used for lookups. The
  * other linked list is the "type" list:
  *  "in_use" - valid inode, i_count > 0, i_nlink > 0
  *  "dirty"  - as "in_use" but also dirty
  *  "unused" - valid inode, i_count = 0
  *
  * A "dirty" list is maintained for each super block,
  * allowing for low-overhead inode sync() operations.
  */
 LIST_HEAD(inode_in_use);
 LIST_HEAD(inode_unused);
 static struct hlist_head *inode_hashtable __read_mostly;
 /*
  * A simple spinlock to protect the list manipulations.
  *
  * NOTE! You also have to own the lock if you change
  * the i_state of an inode while it is in use..
  */
 DEFINE_SPINLOCK(inode_lock);
 /*
  * iprune_mutex provides exclusion between the kswapd or try_to_free_pages
  * icache shrinking path, and the umount path.  Without this exclusion,
  * by the time prune_icache calls iput for the inode whose pages it has
  * been invalidating, or by the time it calls clear_inode & destroy_inode
  * from its final dispose_list, the struct super_block they refer to
  * (for inode->i_sb->s_op) may already have been freed and reused.
  */
 static DEFINE_MUTEX(iprune_mutex);
 /*
  * Statistics gathering..
  */
 struct inodes_stat_t inodes_stat;
 static struct kmem_cache * inode_cachep __read_mostly;
 static void wake_up_inode(struct inode *inode)
 {
 	/*
 	 * Prevent speculative execution through spin_unlock(&inode_lock);
 	 */
 	smp_mb();
 	wake_up_bit(&inode->i_state, __I_LOCK);
 }
 static struct inode *alloc_inode(struct super_block *sb)
 {
 	static const struct address_space_operations empty_aops;
 	static struct inode_operations empty_iops;
 	static const struct file_operations empty_fops;
 	struct inode *inode;
 	if (sb->s_op->alloc_inode)
 		inode = sb->s_op->alloc_inode(sb);
 	else
 		inode = (struct inode *) kmem_cache_alloc(inode_cachep, GFP_KERNEL);
 	if (inode) {
 		struct address_space * const mapping = &inode->i_data;
 		inode->i_sb = sb;
 		inode->i_blkbits = sb->s_blocksize_bits;
 		inode->i_flags = 0;
 		atomic_set(&inode->i_count, 1);
 		inode->i_op = &empty_iops;
 		inode->i_fop = &empty_fops;
 		inode->i_nlink = 1;
 		atomic_set(&inode->i_writecount, 0);
 		inode->i_size = 0;
 		inode->i_blocks = 0;
 		inode->i_bytes = 0;
 		inode->i_generation = 0;
 #ifdef CONFIG_QUOTA
 		memset(&inode->i_dquot, 0, sizeof(inode->i_dquot));
 #endif
 		inode->i_pipe = NULL;
 		inode->i_bdev = NULL;
 		inode->i_cdev = NULL;
 		inode->i_rdev = 0;
 		inode->dirtied_when = 0;
 		if (security_inode_alloc(inode)) {
 			if (inode->i_sb->s_op->destroy_inode)
 				inode->i_sb->s_op->destroy_inode(inode);
 			else
 				kmem_cache_free(inode_cachep, (inode));
 			return NULL;
 		}
 		spin_lock_init(&inode->i_lock);
 		lockdep_set_class(&inode->i_lock, &sb->s_type->i_lock_key);
 		mutex_init(&inode->i_mutex);
 		lockdep_set_class(&inode->i_mutex, &sb->s_type->i_mutex_key);
 		init_rwsem(&inode->i_alloc_sem);
 		lockdep_set_class(&inode->i_alloc_sem, &sb->s_type->i_alloc_sem_key);
 		mapping->a_ops = &empty_aops;
  		mapping->host = inode;
 		mapping->flags = 0;
 		mapping_set_gfp_mask(mapping, GFP_HIGHUSER_PAGECACHE);
 		mapping->assoc_mapping = NULL;
 		mapping->backing_dev_info = &default_backing_dev_info;
 		/*
 		 * If the block_device provides a backing_dev_info for client
 		 * inodes then use that.  Otherwise the inode share the bdev's
 		 * backing_dev_info.
 		 */
 		if (sb->s_bdev) {
 			struct backing_dev_info *bdi;
 			bdi = sb->s_bdev->bd_inode_backing_dev_info;
 			if (!bdi)
 				bdi = sb->s_bdev->bd_inode->i_mapping->backing_dev_info;
 			mapping->backing_dev_info = bdi;
 		}
 		inode->i_private = NULL;
 		inode->i_mapping = mapping;
 	}
 	return inode;
 }
 void destroy_inode(struct inode *inode)
 {
 	BUG_ON(inode_has_buffers(inode));
 	security_inode_free(inode);
 	if (inode->i_sb->s_op->destroy_inode)
 		inode->i_sb->s_op->destroy_inode(inode);
 	else
 		kmem_cache_free(inode_cachep, (inode));
 }
 /*
  * These are initializations that only need to be done
  * once, because the fields are idempotent across use
  * of the inode, so let the slab aware of that.
  */
 void inode_init_once(struct inode *inode)
 {
 	memset(inode, 0, sizeof(*inode));
 	INIT_HLIST_NODE(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_dentry);
 	INIT_LIST_HEAD(&inode->i_devices);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
 	rwlock_init(&inode->i_data.tree_lock);
 	spin_lock_init(&inode->i_data.i_mmap_lock);
 	INIT_LIST_HEAD(&inode->i_data.private_list);
 	spin_lock_init(&inode->i_data.private_lock);
 	INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
 	INIT_LIST_HEAD(&inode->i_data.i_mmap_nonlinear);
 	i_size_ordered_init(inode);
 #ifdef CONFIG_INOTIFY
 	INIT_LIST_HEAD(&inode->inotify_watches);
 	mutex_init(&inode->inotify_mutex);
 #endif
 }
 EXPORT_SYMBOL(inode_init_once);
 static void init_once(struct kmem_cache * cachep, void *foo)
 {
 	struct inode * inode = (struct inode *) foo;
 	inode_init_once(inode);
 }
 /*
  * inode_lock must be held
  */
 void __iget(struct inode * inode)
 {
 	if (atomic_read(&inode->i_count)) {
 		atomic_inc(&inode->i_count);
 		return;
 	}
 	atomic_inc(&inode->i_count);
 	if (!(inode->i_state & (I_DIRTY|I_SYNC)))
 		list_move(&inode->i_list, &inode_in_use);
 	inodes_stat.nr_unused--;
 }
 /**
  * clear_inode - clear an inode
  * @inode: inode to clear
  *
  * This is called by the filesystem to tell us
  * that the inode is no longer useful. We just
  * terminate it with extreme prejudice.
  */
 void clear_inode(struct inode *inode)
 {
 	might_sleep();
 	invalidate_inode_buffers(inode);
 	BUG_ON(inode->i_data.nrpages);
 	BUG_ON(!(inode->i_state & I_FREEING));
 	BUG_ON(inode->i_state & I_CLEAR);
 	inode_sync_wait(inode);
 	DQUOT_DROP(inode);
 	if (inode->i_sb->s_op->clear_inode)
 		inode->i_sb->s_op->clear_inode(inode);
 	if (S_ISBLK(inode->i_mode) && inode->i_bdev)
 		bd_forget(inode);
 	if (S_ISCHR(inode->i_mode) && inode->i_cdev)
 		cd_forget(inode);
 	inode->i_state = I_CLEAR;
 }
 EXPORT_SYMBOL(clear_inode);
 /*
  * dispose_list - dispose of the contents of a local list
  * @head: the head of the list to free
  *
  * Dispose-list gets a local list with local inodes in it, so it doesn't
  * need to worry about list corruption and SMP locks.
  */
 static void dispose_list(struct list_head *head)
 {
 	int nr_disposed = 0;
 	while (!list_empty(head)) {
 		struct inode *inode;
 		inode = list_first_entry(head, struct inode, i_list);
 		list_del(&inode->i_list);
 		if (inode->i_data.nrpages)
 			truncate_inode_pages(&inode->i_data, 0);
 		clear_inode(inode);
 		spin_lock(&inode_lock);
 		hlist_del_init(&inode->i_hash);
 		list_del_init(&inode->i_sb_list);
 		spin_unlock(&inode_lock);
 		wake_up_inode(inode);
 		destroy_inode(inode);
 		nr_disposed++;
 	}
 	spin_lock(&inode_lock);
 	inodes_stat.nr_inodes -= nr_disposed;
 	spin_unlock(&inode_lock);
 }
 /*
  * Invalidate all inodes for a device.
  */
 static int invalidate_list(struct list_head *head, struct list_head *dispose)
 {
 	struct list_head *next;
 	int busy = 0, count = 0;
 	next = head->next;
 	for (;;) {
 		struct list_head * tmp = next;
 		struct inode * inode;
 		/*
 		 * We can reschedule here without worrying about the list's
 		 * consistency because the per-sb list of inodes must not
 		 * change during umount anymore, and because iprune_mutex keeps
 		 * shrink_icache_memory() away.
 		 */
 		cond_resched_lock(&inode_lock);
 		next = next->next;
 		if (tmp == head)
 			break;
 		inode = list_entry(tmp, struct inode, i_sb_list);
 		invalidate_inode_buffers(inode);
 		if (!atomic_read(&inode->i_count)) {
 			list_move(&inode->i_list, dispose);
 			inode->i_state |= I_FREEING;
 			count++;
 			continue;
 		}
 		busy = 1;
 	}
 	/* only unused inodes may be cached with i_count zero */
 	inodes_stat.nr_unused -= count;
 	return busy;
 }
 /**
  *	invalidate_inodes	- discard the inodes on a device
  *	@sb: superblock
  *
  *	Discard all of the inodes for a given superblock. If the discard
  *	fails because there are busy inodes then a non zero value is returned.
  *	If the discard is successful all the inodes have been discarded.
  */
 int invalidate_inodes(struct super_block * sb)
 {
 	int busy;
 	LIST_HEAD(throw_away);
 	mutex_lock(&iprune_mutex);
 	spin_lock(&inode_lock);
 	inotify_unmount_inodes(&sb->s_inodes);
 	busy = invalidate_list(&sb->s_inodes, &throw_away);
 	spin_unlock(&inode_lock);
 	dispose_list(&throw_away);
 	mutex_unlock(&iprune_mutex);
 	return busy;
 }
 EXPORT_SYMBOL(invalidate_inodes);
 static int can_unuse(struct inode *inode)
 {
 	if (inode->i_state)
 		return 0;
 	if (inode_has_buffers(inode))
 		return 0;
 	if (atomic_read(&inode->i_count))
 		return 0;
 	if (inode->i_data.nrpages)
 		return 0;
 	return 1;
 }
 /*
  * Scan `goal' inodes on the unused list for freeable ones. They are moved to
  * a temporary list and then are freed outside inode_lock by dispose_list().
  *
  * Any inodes which are pinned purely because of attached pagecache have their
  * pagecache removed.  We expect the final iput() on that inode to add it to
  * the front of the inode_unused list.  So look for it there and if the
  * inode is still freeable, proceed.  The right inode is found 99.9% of the
  * time in testing on a 4-way.
  *
  * If the inode has metadata buffers attached to mapping->private_list then
  * try to remove them.
  */
 static void prune_icache(int nr_to_scan)
 {
 	LIST_HEAD(freeable);
 	int nr_pruned = 0;
 	int nr_scanned;
 	unsigned long reap = 0;
 	mutex_lock(&iprune_mutex);
 	spin_lock(&inode_lock);
 	for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
 		struct inode *inode;
 		if (list_empty(&inode_unused))
 			break;
 		inode = list_entry(inode_unused.prev, struct inode, i_list);
 		if (inode->i_state || atomic_read(&inode->i_count)) {
 			list_move(&inode->i_list, &inode_unused);
 			continue;
 		}
 		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
 			__iget(inode);
 			spin_unlock(&inode_lock);
 			if (remove_inode_buffers(inode))
 				reap += invalidate_mapping_pages(&inode->i_data,
 								0, -1);
 			iput(inode);
 			spin_lock(&inode_lock);
 			if (inode != list_entry(inode_unused.next,
 						struct inode, i_list))
 				continue;	/* wrong inode or list_empty */
 			if (!can_unuse(inode))
 				continue;
 		}
 		list_move(&inode->i_list, &freeable);
 		inode->i_state |= I_FREEING;
 		nr_pruned++;
 	}
 	inodes_stat.nr_unused -= nr_pruned;
 	if (current_is_kswapd())
 		__count_vm_events(KSWAPD_INODESTEAL, reap);
 	else
 		__count_vm_events(PGINODESTEAL, reap);
 	spin_unlock(&inode_lock);
 	dispose_list(&freeable);
 	mutex_unlock(&iprune_mutex);
 }
 /*
  * shrink_icache_memory() will attempt to reclaim some unused inodes.  Here,
  * "unused" means that no dentries are referring to the inodes: the files are
  * not open and the dcache references to those inodes have already been
  * reclaimed.
  *
  * This function is passed the number of inodes to scan, and it returns the
  * total number of remaining possibly-reclaimable inodes.
  */
 static int shrink_icache_memory(int nr, gfp_t gfp_mask)
 {
 	if (nr) {
 		/*
 		 * Nasty deadlock avoidance.  We may hold various FS locks,
 		 * and we don't want to recurse into the FS that called us
 		 * in clear_inode() and friends..
 	 	 */
 		if (!(gfp_mask & __GFP_FS))
 			return -1;
 		prune_icache(nr);
 	}
 	return (inodes_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
 }
 static struct shrinker icache_shrinker = {
 	.shrink = shrink_icache_memory,
 	.seeks = DEFAULT_SEEKS,
 };
 static void __wait_on_freeing_inode(struct inode *inode);
 /*
  * Called with the inode lock held.
  * NOTE: we are not increasing the inode-refcount, you must call __iget()
  * by hand after calling find_inode now! This simplifies iunique and won't
  * add any additional branch in the common code.
  */
 static struct inode * find_inode(struct super_block * sb, struct hlist_head *head, int (*test)(struct inode *, void *), void *data)
 {
 	struct hlist_node *node;
 	struct inode * inode = NULL;
 repeat:
 	hlist_for_each (node, head) {
 		inode = hlist_entry(node, struct inode, i_hash);
 		if (inode->i_sb != sb)
 			continue;
 		if (!test(inode, data))
 			continue;
 		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE)) {
 			__wait_on_freeing_inode(inode);
 			goto repeat;
 		}
 		break;
 	}
 	return node ? inode : NULL;
 }
 /*
  * find_inode_fast is the fast path version of find_inode, see the comment at
  * iget_locked for details.
  */
 static struct inode * find_inode_fast(struct super_block * sb, struct hlist_head *head, unsigned long ino)
 {
 	struct hlist_node *node;
 	struct inode * inode = NULL;
 repeat:
 	hlist_for_each (node, head) {
 		inode = hlist_entry(node, struct inode, i_hash);
 		if (inode->i_ino != ino)
 			continue;
 		if (inode->i_sb != sb)
 			continue;
 		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE)) {
 			__wait_on_freeing_inode(inode);
 			goto repeat;
 		}
 		break;
 	}
 	return node ? inode : NULL;
 }
 /**
  *	new_inode 	- obtain an inode
  *	@sb: superblock
  *
  *	Allocates a new inode for given superblock. The default gfp_mask
  *	for allocations related to inode->i_mapping is GFP_HIGHUSER_PAGECACHE.
  *	If HIGHMEM pages are unsuitable or it is known that pages allocated
  *	for the page cache are not reclaimable or migratable,
  *	mapping_set_gfp_mask() must be called with suitable flags on the
  *	newly created inode's mapping
  *
  */
 struct inode *new_inode(struct super_block *sb)
 {
 	/*
 	 * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
 	 * error if st_ino won't fit in target struct field. Use 32bit counter
 	 * here to attempt to avoid that.
 	 */
 	static unsigned int last_ino;
 	struct inode * inode;
 	spin_lock_prefetch(&inode_lock);
 	inode = alloc_inode(sb);
 	if (inode) {
 		spin_lock(&inode_lock);
 		inodes_stat.nr_inodes++;
 		list_add(&inode->i_list, &inode_in_use);
 		list_add(&inode->i_sb_list, &sb->s_inodes);
 		inode->i_ino = ++last_ino;
 		inode->i_state = 0;
 		spin_unlock(&inode_lock);
 	}
 	return inode;
 }
 EXPORT_SYMBOL(new_inode);
 void unlock_new_inode(struct inode *inode)
 {
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	if (inode->i_mode & S_IFDIR) {
 		struct file_system_type *type = inode->i_sb->s_type;
 		/*
 		 * ensure nobody is actually holding i_mutex
 		 */
 		mutex_destroy(&inode->i_mutex);
 		mutex_init(&inode->i_mutex);
 		lockdep_set_class(&inode->i_mutex, &type->i_mutex_dir_key);
 	}
 #endif
 	/*
 	 * This is special!  We do not need the spinlock
 	 * when clearing I_LOCK, because we're guaranteed
 	 * that nobody else tries to do anything about the
 	 * state of the inode when it is locked, as we
 	 * just created it (so there can be no old holders
 	 * that haven't tested I_LOCK).
 	 */
 	inode->i_state &= ~(I_LOCK|I_NEW);
 	wake_up_inode(inode);
 }
 EXPORT_SYMBOL(unlock_new_inode);
 /*
  * This is called without the inode lock held.. Be careful.
  *
  * We no longer cache the sb_flags in i_flags - see fs.h
  *	-- rmk@arm.uk.linux.org
  */
 static struct inode * get_new_inode(struct super_block *sb, struct hlist_head *head, int (*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *data)
 {
 	struct inode * inode;
 	inode = alloc_inode(sb);
 	if (inode) {
 		struct inode * old;
 		spin_lock(&inode_lock);
 		/* We released the lock, so.. */
 		old = find_inode(sb, head, test, data);
 		if (!old) {
 			if (set(inode, data))
 				goto set_failed;
 			inodes_stat.nr_inodes++;
 			list_add(&inode->i_list, &inode_in_use);
 			list_add(&inode->i_sb_list, &sb->s_inodes);
 			hlist_add_head(&inode->i_hash, head);
 			inode->i_state = I_LOCK|I_NEW;
 			spin_unlock(&inode_lock);
 			/* Return the locked inode with I_NEW set, the
 			 * caller is responsible for filling in the contents
 			 */
 			return inode;
 		}
 		/*
 		 * Uhhuh, somebody else created the same inode under
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
 		__iget(old);
 		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
 		wait_on_inode(inode);
 	}
 	return inode;
 set_failed:
 	spin_unlock(&inode_lock);
 	destroy_inode(inode);
 	return NULL;
 }
 /*
  * get_new_inode_fast is the fast path version of get_new_inode, see the
  * comment at iget_locked for details.
  */
 static struct inode * get_new_inode_fast(struct super_block *sb, struct hlist_head *head, unsigned long ino)
 {
 	struct inode * inode;
 	inode = alloc_inode(sb);
 	if (inode) {
 		struct inode * old;
 		spin_lock(&inode_lock);
 		/* We released the lock, so.. */
 		old = find_inode_fast(sb, head, ino);
 		if (!old) {
 			inode->i_ino = ino;
 			inodes_stat.nr_inodes++;
 			list_add(&inode->i_list, &inode_in_use);
 			list_add(&inode->i_sb_list, &sb->s_inodes);
 			hlist_add_head(&inode->i_hash, head);
 			inode->i_state = I_LOCK|I_NEW;
 			spin_unlock(&inode_lock);
 			/* Return the locked inode with I_NEW set, the
 			 * caller is responsible for filling in the contents
 			 */
 			return inode;
 		}
 		/*
 		 * Uhhuh, somebody else created the same inode under
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
 		__iget(old);
 		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
 		wait_on_inode(inode);
 	}
 	return inode;
 }
 static unsigned long hash(struct super_block *sb, unsigned long hashval)
 {
 	unsigned long tmp;
 	tmp = (hashval * (unsigned long)sb) ^ (GOLDEN_RATIO_PRIME + hashval) /
 			L1_CACHE_BYTES;
 	tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> I_HASHBITS);
 	return tmp & I_HASHMASK;
 }
 /**
  *	iunique - get a unique inode number
  *	@sb: superblock
  *	@max_reserved: highest reserved inode number
  *
  *	Obtain an inode number that is unique on the system for a given
  *	superblock. This is used by file systems that have no natural
  *	permanent inode numbering system. An inode number is returned that
  *	is higher than the reserved limit but unique.
  *
  *	BUGS:
  *	With a large number of inodes live on the file system this function
  *	currently becomes quite slow.
  */
 ino_t iunique(struct super_block *sb, ino_t max_reserved)
 {
 	/*
 	 * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
 	 * error if st_ino won't fit in target struct field. Use 32bit counter
 	 * here to attempt to avoid that.
 	 */
 	static unsigned int counter;
 	struct inode *inode;
 	struct hlist_head *head;
 	ino_t res;
 	spin_lock(&inode_lock);
 	do {
 		if (counter <= max_reserved)
 			counter = max_reserved + 1;
 		res = counter++;
 		head = inode_hashtable + hash(sb, res);
 		inode = find_inode_fast(sb, head, res);
 	} while (inode != NULL);
 	spin_unlock(&inode_lock);
 	return res;
 }
 EXPORT_SYMBOL(iunique);
 struct inode *igrab(struct inode *inode)
 {
 	spin_lock(&inode_lock);
 	if (!(inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE)))
 		__iget(inode);
 	else
 		/*
 		 * Handle the case where s_op->clear_inode is not been
 		 * called yet, and somebody is calling igrab
 		 * while the inode is getting freed.
 		 */
 		inode = NULL;
 	spin_unlock(&inode_lock);
 	return inode;
 }
 EXPORT_SYMBOL(igrab);
 /**
  * ifind - internal function, you want ilookup5() or iget5().
  * @sb:		super block of file system to search
  * @head:       the head of the list to search
  * @test:	callback used for comparisons between inodes
  * @data:	opaque data pointer to pass to @test
  * @wait:	if true wait for the inode to be unlocked, if false do not
  *
  * ifind() searches for the inode specified by @data in the inode
  * cache. This is a generalized version of ifind_fast() for file systems where
  * the inode number is not sufficient for unique identification of an inode.
  *
  * If the inode is in the cache, the inode is returned with an incremented
  * reference count.
  *
  * Otherwise NULL is returned.
  *
  * Note, @test is called with the inode_lock held, so can't sleep.
  */
 static struct inode *ifind(struct super_block *sb,
 		struct hlist_head *head, int (*test)(struct inode *, void *),
 		void *data, const int wait)
 {
 	struct inode *inode;
 	spin_lock(&inode_lock);
 	inode = find_inode(sb, head, test, data);
 	if (inode) {
 		__iget(inode);
 		spin_unlock(&inode_lock);
 		if (likely(wait))
 			wait_on_inode(inode);
 		return inode;
 	}
 	spin_unlock(&inode_lock);
 	return NULL;
 }
 /**
  * ifind_fast - internal function, you want ilookup() or iget().
  * @sb:		super block of file system to search
  * @head:       head of the list to search
  * @ino:	inode number to search for
  *
  * ifind_fast() searches for the inode @ino in the inode cache. This is for
  * file systems where the inode number is sufficient for unique identification
  * of an inode.
  *
  * If the inode is in the cache, the inode is returned with an incremented
  * reference count.
  *
  * Otherwise NULL is returned.
  */
 static struct inode *ifind_fast(struct super_block *sb,
 		struct hlist_head *head, unsigned long ino)
 {
 	struct inode *inode;
 	spin_lock(&inode_lock);
 	inode = find_inode_fast(sb, head, ino);
 	if (inode) {
 		__iget(inode);
 		spin_unlock(&inode_lock);
 		wait_on_inode(inode);
 		return inode;
 	}
 	spin_unlock(&inode_lock);
 	return NULL;
 }
 /**
  * ilookup5_nowait - search for an inode in the inode cache
  * @sb:		super block of file system to search
  * @hashval:	hash value (usually inode number) to search for
  * @test:	callback used for comparisons between inodes
  * @data:	opaque data pointer to pass to @test
  *
  * ilookup5() uses ifind() to search for the inode specified by @hashval and
  * @data in the inode cache. This is a generalized version of ilookup() for
  * file systems where the inode number is not sufficient for unique
  * identification of an inode.
  *
  * If the inode is in the cache, the inode is returned with an incremented
  * reference count.  Note, the inode lock is not waited upon so you have to be
  * very careful what you do with the returned inode.  You probably should be
  * using ilookup5() instead.
  *
  * Otherwise NULL is returned.
  *
  * Note, @test is called with the inode_lock held, so can't sleep.
  */
 struct inode *ilookup5_nowait(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
 {
 	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
 	return ifind(sb, head, test, data, 0);
 }
 EXPORT_SYMBOL(ilookup5_nowait);
 /**
  * ilookup5 - search for an inode in the inode cache
  * @sb:		super block of file system to search
  * @hashval:	hash value (usually inode number) to search for
  * @test:	callback used for comparisons between inodes
  * @data:	opaque data pointer to pass to @test
  *
  * ilookup5() uses ifind() to search for the inode specified by @hashval and
  * @data in the inode cache. This is a generalized version of ilookup() for
  * file systems where the inode number is not sufficient for unique
  * identification of an inode.
  *
  * If the inode is in the cache, the inode lock is waited upon and the inode is
  * returned with an incremented reference count.
  *
  * Otherwise NULL is returned.
  *
  * Note, @test is called with the inode_lock held, so can't sleep.
  */
 struct inode *ilookup5(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
 {
 	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
 	return ifind(sb, head, test, data, 1);
 }
 EXPORT_SYMBOL(ilookup5);
 /**
  * ilookup - search for an inode in the inode cache
  * @sb:		super block of file system to search
  * @ino:	inode number to search for
  *
  * ilookup() uses ifind_fast() to search for the inode @ino in the inode cache.
  * This is for file systems where the inode number is sufficient for unique
  * identification of an inode.
  *
  * If the inode is in the cache, the inode is returned with an incremented
  * reference count.
  *
  * Otherwise NULL is returned.
  */
 struct inode *ilookup(struct super_block *sb, unsigned long ino)
 {
 	struct hlist_head *head = inode_hashtable + hash(sb, ino);
 	return ifind_fast(sb, head, ino);
 }
 EXPORT_SYMBOL(ilookup);
 /**
  * iget5_locked - obtain an inode from a mounted file system
  * @sb:		super block of file system
  * @hashval:	hash value (usually inode number) to get
  * @test:	callback used for comparisons between inodes
  * @set:	callback used to initialize a new struct inode
  * @data:	opaque data pointer to pass to @test and @set
  *
- * This is iget() without the read_inode() portion of get_new_inode().
- *
  * iget5_locked() uses ifind() to search for the inode specified by @hashval
  * and @data in the inode cache and if present it is returned with an increased
  * reference count. This is a generalized version of iget_locked() for file
  * systems where the inode number is not sufficient for unique identification
  * of an inode.
  *
  * If the inode is not in cache, get_new_inode() is called to allocate a new
  * inode and this is returned locked, hashed, and with the I_NEW flag set. The
  * file system gets to fill it in before unlocking it via unlock_new_inode().
  *
  * Note both @test and @set are called with the inode_lock held, so can't sleep.
  */
 struct inode *iget5_locked(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *),
 		int (*set)(struct inode *, void *), void *data)
 {
 	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
 	struct inode *inode;
 	inode = ifind(sb, head, test, data, 1);
 	if (inode)
 		return inode;
 	/*
 	 * get_new_inode() will do the right thing, re-trying the search
 	 * in case it had to block at any point.
 	 */
 	return get_new_inode(sb, head, test, set, data);
 }
 EXPORT_SYMBOL(iget5_locked);
 /**
  * iget_locked - obtain an inode from a mounted file system
  * @sb:		super block of file system
  * @ino:	inode number to get
- *
- * This is iget() without the read_inode() portion of get_new_inode_fast().
  *
  * iget_locked() uses ifind_fast() to search for the inode specified by @ino in
  * the inode cache and if present it is returned with an increased reference
  * count. This is for file systems where the inode number is sufficient for
  * unique identification of an inode.
  *
  * If the inode is not in cache, get_new_inode_fast() is called to allocate a
  * new inode and this is returned locked, hashed, and with the I_NEW flag set.
  * The file system gets to fill it in before unlocking it via
  * unlock_new_inode().
  */
 struct inode *iget_locked(struct super_block *sb, unsigned long ino)
 {
 	struct hlist_head *head = inode_hashtable + hash(sb, ino);
 	struct inode *inode;
 	inode = ifind_fast(sb, head, ino);
 	if (inode)
 		return inode;
 	/*
 	 * get_new_inode_fast() will do the right thing, re-trying the search
 	 * in case it had to block at any point.
 	 */
 	return get_new_inode_fast(sb, head, ino);
 }
 EXPORT_SYMBOL(iget_locked);
 /**
  *	__insert_inode_hash - hash an inode
  *	@inode: unhashed inode
  *	@hashval: unsigned long value used to locate this object in the
  *		inode_hashtable.
  *
  *	Add an inode to the inode hash for this superblock.
  */
 void __insert_inode_hash(struct inode *inode, unsigned long hashval)
 {
 	struct hlist_head *head = inode_hashtable + hash(inode->i_sb, hashval);
 	spin_lock(&inode_lock);
 	hlist_add_head(&inode->i_hash, head);
 	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL(__insert_inode_hash);
 /**
  *	remove_inode_hash - remove an inode from the hash
  *	@inode: inode to unhash
  *
  *	Remove an inode from the superblock.
  */
 void remove_inode_hash(struct inode *inode)
 {
 	spin_lock(&inode_lock);
 	hlist_del_init(&inode->i_hash);
 	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL(remove_inode_hash);
 /*
  * Tell the filesystem that this inode is no longer of any interest and should
  * be completely destroyed.
  *
  * We leave the inode in the inode hash table until *after* the filesystem's
  * ->delete_inode completes.  This ensures that an iget (such as nfsd might
  * instigate) will always find up-to-date information either in the hash or on
  * disk.
  *
  * I_FREEING is set so that no-one will take a new reference to the inode while
  * it is being deleted.
  */
 void generic_delete_inode(struct inode *inode)
 {
 	const struct super_operations *op = inode->i_sb->s_op;
 	list_del_init(&inode->i_list);
 	list_del_init(&inode->i_sb_list);
 	inode->i_state |= I_FREEING;
 	inodes_stat.nr_inodes--;
 	spin_unlock(&inode_lock);
 	security_inode_delete(inode);
 	if (op->delete_inode) {
 		void (*delete)(struct inode *) = op->delete_inode;
 		if (!is_bad_inode(inode))
 			DQUOT_INIT(inode);
 		/* Filesystems implementing their own
 		 * s_op->delete_inode are required to call
 		 * truncate_inode_pages and clear_inode()
 		 * internally */
 		delete(inode);
 	} else {
 		truncate_inode_pages(&inode->i_data, 0);
 		clear_inode(inode);
 	}
 	spin_lock(&inode_lock);
 	hlist_del_init(&inode->i_hash);
 	spin_unlock(&inode_lock);
 	wake_up_inode(inode);
 	BUG_ON(inode->i_state != I_CLEAR);
 	destroy_inode(inode);
 }
 EXPORT_SYMBOL(generic_delete_inode);
 static void generic_forget_inode(struct inode *inode)
 {
 	struct super_block *sb = inode->i_sb;
 	if (!hlist_unhashed(&inode->i_hash)) {
 		if (!(inode->i_state & (I_DIRTY|I_SYNC)))
 			list_move(&inode->i_list, &inode_unused);
 		inodes_stat.nr_unused++;
 		if (sb->s_flags & MS_ACTIVE) {
 			spin_unlock(&inode_lock);
 			return;
 		}
 		inode->i_state |= I_WILL_FREE;
 		spin_unlock(&inode_lock);
 		write_inode_now(inode, 1);
 		spin_lock(&inode_lock);
 		inode->i_state &= ~I_WILL_FREE;
 		inodes_stat.nr_unused--;
 		hlist_del_init(&inode->i_hash);
 	}
 	list_del_init(&inode->i_list);
 	list_del_init(&inode->i_sb_list);
 	inode->i_state |= I_FREEING;
 	inodes_stat.nr_inodes--;
 	spin_unlock(&inode_lock);
 	if (inode->i_data.nrpages)
 		truncate_inode_pages(&inode->i_data, 0);
 	clear_inode(inode);
 	wake_up_inode(inode);
 	destroy_inode(inode);
 }
 /*
  * Normal UNIX filesystem behaviour: delete the
  * inode when the usage count drops to zero, and
  * i_nlink is zero.
  */
 void generic_drop_inode(struct inode *inode)
 {
 	if (!inode->i_nlink)
 		generic_delete_inode(inode);
 	else
 		generic_forget_inode(inode);
 }
 EXPORT_SYMBOL_GPL(generic_drop_inode);
 /*
  * Called when we're dropping the last reference
  * to an inode.
  *
  * Call the FS "drop()" function, defaulting to
  * the legacy UNIX filesystem behaviour..
  *
  * NOTE! NOTE! NOTE! We're called with the inode lock
  * held, and the drop function is supposed to release
  * the lock!
  */
 static inline void iput_final(struct inode *inode)
 {
 	const struct super_operations *op = inode->i_sb->s_op;
 	void (*drop)(struct inode *) = generic_drop_inode;
 	if (op && op->drop_inode)
 		drop = op->drop_inode;
 	drop(inode);
 }
 /**
  *	iput	- put an inode
  *	@inode: inode to put
  *
  *	Puts an inode, dropping its usage count. If the inode use count hits
  *	zero, the inode is then freed and may also be destroyed.
  *
  *	Consequently, iput() can sleep.
  */
 void iput(struct inode *inode)
 {
 	if (inode) {
 		const struct super_operations *op = inode->i_sb->s_op;
 		BUG_ON(inode->i_state == I_CLEAR);
 		if (op && op->put_inode)
 			op->put_inode(inode);
 		if (atomic_dec_and_lock(&inode->i_count, &inode_lock))
 			iput_final(inode);
 	}
 }
 EXPORT_SYMBOL(iput);
 /**
  *	bmap	- find a block number in a file
  *	@inode: inode of file
  *	@block: block to find
  *
  *	Returns the block number on the device holding the inode that
  *	is the disk block number for the block of the file requested.
  *	That is, asked for block 4 of inode 1 the function will return the
  *	disk block relative to the disk start that holds that block of the
  *	file.
  */
 sector_t bmap(struct inode * inode, sector_t block)
 {
 	sector_t res = 0;
 	if (inode->i_mapping->a_ops->bmap)
 		res = inode->i_mapping->a_ops->bmap(inode->i_mapping, block);
 	return res;
 }
 EXPORT_SYMBOL(bmap);
 /**
  *	touch_atime	-	update the access time
  *	@mnt: mount the inode is accessed on
  *	@dentry: dentry accessed
  *
  *	Update the accessed time on an inode and mark it for writeback.
  *	This function automatically handles read only file systems and media,
  *	as well as the "noatime" flag and inode specific "noatime" markers.
  */
 void touch_atime(struct vfsmount *mnt, struct dentry *dentry)
 {
 	struct inode *inode = dentry->d_inode;
 	struct timespec now;
 	if (inode->i_flags & S_NOATIME)
 		return;
 	if (IS_NOATIME(inode))
 		return;
 	if ((inode->i_sb->s_flags & MS_NODIRATIME) && S_ISDIR(inode->i_mode))
 		return;
 	/*
 	 * We may have a NULL vfsmount when coming from NFSD
 	 */
 	if (mnt) {
 		if (mnt->mnt_flags & MNT_NOATIME)
 			return;
 		if ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode))
 			return;
 		if (mnt->mnt_flags & MNT_RELATIME) {
 			/*
 			 * With relative atime, only update atime if the
 			 * previous atime is earlier than either the ctime or
 			 * mtime.
 			 */
 			if (timespec_compare(&inode->i_mtime,
 						&inode->i_atime) < 0 &&
 			    timespec_compare(&inode->i_ctime,
 						&inode->i_atime) < 0)
 				return;
 		}
 	}
 	now = current_fs_time(inode->i_sb);
 	if (timespec_equal(&inode->i_atime, &now))
 		return;
 	inode->i_atime = now;
 	mark_inode_dirty_sync(inode);
 }
 EXPORT_SYMBOL(touch_atime);
 /**
  *	file_update_time	-	update mtime and ctime time
  *	@file: file accessed
  *
  *	Update the mtime and ctime members of an inode and mark the inode
  *	for writeback.  Note that this function is meant exclusively for
  *	usage in the file write path of filesystems, and filesystems may
  *	choose to explicitly ignore update via this function with the
  *	S_NOCTIME inode flag, e.g. for network filesystem where these
  *	timestamps are handled by the server.
  */
 void file_update_time(struct file *file)
 {
 	struct inode *inode = file->f_path.dentry->d_inode;
 	struct timespec now;
 	int sync_it = 0;
 	if (IS_NOCMTIME(inode))
 		return;
 	if (IS_RDONLY(inode))
 		return;
 	now = current_fs_time(inode->i_sb);
 	if (!timespec_equal(&inode->i_mtime, &now)) {
 		inode->i_mtime = now;
 		sync_it = 1;
 	}
 	if (!timespec_equal(&inode->i_ctime, &now)) {
 		inode->i_ctime = now;
 		sync_it = 1;
 	}
 	if (IS_I_VERSION(inode)) {
 		inode_inc_iversion(inode);
 		sync_it = 1;
 	}
 	if (sync_it)
 		mark_inode_dirty_sync(inode);
 }
 EXPORT_SYMBOL(file_update_time);
 int inode_needs_sync(struct inode *inode)
 {
 	if (IS_SYNC(inode))
 		return 1;
 	if (S_ISDIR(inode->i_mode) && IS_DIRSYNC(inode))
 		return 1;
 	return 0;
 }
 EXPORT_SYMBOL(inode_needs_sync);
 int inode_wait(void *word)
 {
 	schedule();
 	return 0;
 }
 /*
  * If we try to find an inode in the inode hash while it is being
  * deleted, we have to wait until the filesystem completes its
  * deletion before reporting that it isn't found.  This function waits
  * until the deletion _might_ have completed.  Callers are responsible
  * to recheck inode state.
  *
  * It doesn't matter if I_LOCK is not set initially, a call to
  * wake_up_inode() after removing from the hash list will DTRT.
  *
  * This is called with inode_lock held.
  */
 static void __wait_on_freeing_inode(struct inode *inode)
 {
 	wait_queue_head_t *wq;
 	DEFINE_WAIT_BIT(wait, &inode->i_state, __I_LOCK);
 	wq = bit_waitqueue(&inode->i_state, __I_LOCK);
 	prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
 	spin_unlock(&inode_lock);
 	schedule();
 	finish_wait(wq, &wait.wait);
 	spin_lock(&inode_lock);
 }
 /*
  * We rarely want to lock two inodes that do not have a parent/child
  * relationship (such as directory, child inode) simultaneously. The
  * vast majority of file systems should be able to get along fine
  * without this. Do not use these functions except as a last resort.
  */
 void inode_double_lock(struct inode *inode1, struct inode *inode2)
 {
 	if (inode1 == NULL || inode2 == NULL || inode1 == inode2) {
 		if (inode1)
 			mutex_lock(&inode1->i_mutex);
 		else if (inode2)
 			mutex_lock(&inode2->i_mutex);
 		return;
 	}
 	if (inode1 < inode2) {
 		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT);
 		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD);
 	} else {
 		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT);
 		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD);
 	}
 }
 EXPORT_SYMBOL(inode_double_lock);
 void inode_double_unlock(struct inode *inode1, struct inode *inode2)
 {
 	if (inode1)
 		mutex_unlock(&inode1->i_mutex);
 	if (inode2 && inode2 != inode1)
 		mutex_unlock(&inode2->i_mutex);
 }
 EXPORT_SYMBOL(inode_double_unlock);
 static __initdata unsigned long ihash_entries;
 static int __init set_ihash_entries(char *str)
 {
 	if (!str)
 		return 0;
 	ihash_entries = simple_strtoul(str, &str, 0);
 	return 1;
 }
 __setup("ihash_entries=", set_ihash_entries);
 /*
  * Initialize the waitqueues and inode hash table.
  */
 void __init inode_init_early(void)
 {
 	int loop;
 	/* If hashes are distributed across NUMA nodes, defer
 	 * hash allocation until vmalloc space is available.
 	 */
 	if (hashdist)
 		return;
 	inode_hashtable =
 		alloc_large_system_hash("Inode-cache",
 					sizeof(struct hlist_head),
 					ihash_entries,
 					14,
 					HASH_EARLY,
 					&i_hash_shift,
 					&i_hash_mask,
 					0);
 	for (loop = 0; loop < (1 << i_hash_shift); loop++)
 		INIT_HLIST_HEAD(&inode_hashtable[loop]);
 }
 void __init inode_init(void)
 {
 	int loop;
 	/* inode slab cache */
 	inode_cachep = kmem_cache_create("inode_cache",
 					 sizeof(struct inode),
 					 0,
 					 (SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
 					 SLAB_MEM_SPREAD),
 					 init_once);
 	register_shrinker(&icache_shrinker);
 	/* Hash may have been set up in inode_init_early */
 	if (!hashdist)
 		return;
 	inode_hashtable =
 		alloc_large_system_hash("Inode-cache",
 					sizeof(struct hlist_head),
 					ihash_entries,
 					14,
 					0,
 					&i_hash_shift,
 					&i_hash_mask,
 					0);
 	for (loop = 0; loop < (1 << i_hash_shift); loop++)
 		INIT_HLIST_HEAD(&inode_hashtable[loop]);
 }
 void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
 {
 	inode->i_mode = mode;
 	if (S_ISCHR(mode)) {
 		inode->i_fop = &def_chr_fops;
 		inode->i_rdev = rdev;
 	} else if (S_ISBLK(mode)) {
 		inode->i_fop = &def_blk_fops;
 		inode->i_rdev = rdev;
 	} else if (S_ISFIFO(mode))
 		inode->i_fop = &def_fifo_fops;
 	else if (S_ISSOCK(mode))
 		inode->i_fop = &bad_sock_fops;
 	else
 		printk(KERN_DEBUG "init_special_inode: bogus i_mode (%o)\n",
 		       mode);
 }
 EXPORT_SYMBOL(init_special_inode);

include/linux/fs.h

Diff comments View file @ 12debc4

 #ifndef _LINUX_FS_H
 #define _LINUX_FS_H
 /*
  * This file has definitions for some important file table
  * structures etc.
  */
 #include <linux/limits.h>
 #include <linux/ioctl.h>
 /*
  * It's silly to have NR_OPEN bigger than NR_FILE, but you can change
  * the file limit at runtime and only root can increase the per-process
  * nr_file rlimit, so it's safe to set up a ridiculously high absolute
  * upper limit on files-per-process.
  *
  * Some programs (notably those using select()) may have to be
  * recompiled to take full advantage of the new limits..
  */
 /* Fixed constants first: */
 #undef NR_OPEN
 extern int sysctl_nr_open;
 #define INR_OPEN 1024		/* Initial setting for nfile rlimits */
 #define BLOCK_SIZE_BITS 10
 #define BLOCK_SIZE (1<<BLOCK_SIZE_BITS)
 #define SEEK_SET	0	/* seek relative to beginning of file */
 #define SEEK_CUR	1	/* seek relative to current file position */
 #define SEEK_END	2	/* seek relative to end of file */
 #define SEEK_MAX	SEEK_END
 /* And dynamically-tunable limits and defaults: */
 struct files_stat_struct {
 	int nr_files;		/* read only */
 	int nr_free_files;	/* read only */
 	int max_files;		/* tunable */
 };
 extern struct files_stat_struct files_stat;
 extern int get_max_files(void);
 struct inodes_stat_t {
 	int nr_inodes;
 	int nr_unused;
 	int dummy[5];		/* padding for sysctl ABI compatibility */
 };
 extern struct inodes_stat_t inodes_stat;
 extern int leases_enable, lease_break_time;
 #ifdef CONFIG_DNOTIFY
 extern int dir_notify_enable;
 #endif
 #define NR_FILE  8192	/* this can well be larger on a larger system */
 #define MAY_EXEC 1
 #define MAY_WRITE 2
 #define MAY_READ 4
 #define MAY_APPEND 8
 #define FMODE_READ 1
 #define FMODE_WRITE 2
 /* Internal kernel extensions */
 #define FMODE_LSEEK	4
 #define FMODE_PREAD	8
 #define FMODE_PWRITE	FMODE_PREAD	/* These go hand in hand */
 /* File is being opened for execution. Primary users of this flag are
    distributed filesystems that can use it to achieve correct ETXTBUSY
    behavior for cross-node execution/opening_for_writing of files */
 #define FMODE_EXEC	16
 #define RW_MASK		1
 #define RWA_MASK	2
 #define READ 0
 #define WRITE 1
 #define READA 2		/* read-ahead  - don't block if no resources */
 #define SWRITE 3	/* for ll_rw_block() - wait for buffer lock */
 #define READ_SYNC	(READ | (1 << BIO_RW_SYNC))
 #define READ_META	(READ | (1 << BIO_RW_META))
 #define WRITE_SYNC	(WRITE | (1 << BIO_RW_SYNC))
 #define WRITE_BARRIER	((1 << BIO_RW) | (1 << BIO_RW_BARRIER))
 #define SEL_IN		1
 #define SEL_OUT		2
 #define SEL_EX		4
 /* public flags for file_system_type */
 #define FS_REQUIRES_DEV 1
 #define FS_BINARY_MOUNTDATA 2
 #define FS_HAS_SUBTYPE 4
 #define FS_REVAL_DOT	16384	/* Check the paths ".", ".." for staleness */
 #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move()
 					 * during rename() internally.
 					 */
 /*
  * These are the fs-independent mount-flags: up to 32 flags are supported
  */
 #define MS_RDONLY	 1	/* Mount read-only */
 #define MS_NOSUID	 2	/* Ignore suid and sgid bits */
 #define MS_NODEV	 4	/* Disallow access to device special files */
 #define MS_NOEXEC	 8	/* Disallow program execution */
 #define MS_SYNCHRONOUS	16	/* Writes are synced at once */
 #define MS_REMOUNT	32	/* Alter flags of a mounted FS */
 #define MS_MANDLOCK	64	/* Allow mandatory locks on an FS */
 #define MS_DIRSYNC	128	/* Directory modifications are synchronous */
 #define MS_NOATIME	1024	/* Do not update access times. */
 #define MS_NODIRATIME	2048	/* Do not update directory access times */
 #define MS_BIND		4096
 #define MS_MOVE		8192
 #define MS_REC		16384
 #define MS_VERBOSE	32768	/* War is peace. Verbosity is silence.
 				   MS_VERBOSE is deprecated. */
 #define MS_SILENT	32768
 #define MS_POSIXACL	(1<<16)	/* VFS does not apply the umask */
 #define MS_UNBINDABLE	(1<<17)	/* change to unbindable */
 #define MS_PRIVATE	(1<<18)	/* change to private */
 #define MS_SLAVE	(1<<19)	/* change to slave */
 #define MS_SHARED	(1<<20)	/* change to shared */
 #define MS_RELATIME	(1<<21)	/* Update atime relative to mtime/ctime. */
 #define MS_KERNMOUNT	(1<<22) /* this is a kern_mount call */
 #define MS_I_VERSION	(1<<23) /* Update inode I_version field */
 #define MS_ACTIVE	(1<<30)
 #define MS_NOUSER	(1<<31)
 /*
  * Superblock flags that can be altered by MS_REMOUNT
  */
 #define MS_RMT_MASK	(MS_RDONLY|MS_SYNCHRONOUS|MS_MANDLOCK)
 /*
  * Old magic mount flag and mask
  */
 #define MS_MGC_VAL 0xC0ED0000
 #define MS_MGC_MSK 0xffff0000
 /* Inode flags - they have nothing to superblock flags now */
 #define S_SYNC		1	/* Writes are synced at once */
 #define S_NOATIME	2	/* Do not update access times */
 #define S_APPEND	4	/* Append-only file */
 #define S_IMMUTABLE	8	/* Immutable file */
 #define S_DEAD		16	/* removed, but still open directory */
 #define S_NOQUOTA	32	/* Inode is not counted to quota */
 #define S_DIRSYNC	64	/* Directory modifications are synchronous */
 #define S_NOCMTIME	128	/* Do not update file c/mtime */
 #define S_SWAPFILE	256	/* Do not truncate: swapon got its bmaps */
 #define S_PRIVATE	512	/* Inode is fs-internal */
 /*
  * Note that nosuid etc flags are inode-specific: setting some file-system
  * flags just means all the inodes inherit those flags by default. It might be
  * possible to override it selectively if you really wanted to with some
  * ioctl() that is not currently implemented.
  *
  * Exception: MS_RDONLY is always applied to the entire file system.
  *
  * Unfortunately, it is possible to change a filesystems flags with it mounted
  * with files in use.  This means that all of the inodes will not have their
  * i_flags updated.  Hence, i_flags no longer inherit the superblock mount
  * flags, so these have to be checked separately. -- rmk@arm.uk.linux.org
  */
 #define __IS_FLG(inode,flg) ((inode)->i_sb->s_flags & (flg))
 #define IS_RDONLY(inode) ((inode)->i_sb->s_flags & MS_RDONLY)
 #define IS_SYNC(inode)		(__IS_FLG(inode, MS_SYNCHRONOUS) || \
 					((inode)->i_flags & S_SYNC))
 #define IS_DIRSYNC(inode)	(__IS_FLG(inode, MS_SYNCHRONOUS|MS_DIRSYNC) || \
 					((inode)->i_flags & (S_SYNC|S_DIRSYNC)))
 #define IS_MANDLOCK(inode)	__IS_FLG(inode, MS_MANDLOCK)
 #define IS_NOATIME(inode)   __IS_FLG(inode, MS_RDONLY|MS_NOATIME)
 #define IS_I_VERSION(inode)   __IS_FLG(inode, MS_I_VERSION)
 #define IS_NOQUOTA(inode)	((inode)->i_flags & S_NOQUOTA)
 #define IS_APPEND(inode)	((inode)->i_flags & S_APPEND)
 #define IS_IMMUTABLE(inode)	((inode)->i_flags & S_IMMUTABLE)
 #define IS_POSIXACL(inode)	__IS_FLG(inode, MS_POSIXACL)
 #define IS_DEADDIR(inode)	((inode)->i_flags & S_DEAD)
 #define IS_NOCMTIME(inode)	((inode)->i_flags & S_NOCMTIME)
 #define IS_SWAPFILE(inode)	((inode)->i_flags & S_SWAPFILE)
 #define IS_PRIVATE(inode)	((inode)->i_flags & S_PRIVATE)
 /* the read-only stuff doesn't really belong here, but any other place is
    probably as bad and I don't want to create yet another include file. */
 #define BLKROSET   _IO(0x12,93)	/* set device read-only (0 = read-write) */
 #define BLKROGET   _IO(0x12,94)	/* get read-only status (0 = read_write) */
 #define BLKRRPART  _IO(0x12,95)	/* re-read partition table */
 #define BLKGETSIZE _IO(0x12,96)	/* return device size /512 (long *arg) */
 #define BLKFLSBUF  _IO(0x12,97)	/* flush buffer cache */
 #define BLKRASET   _IO(0x12,98)	/* set read ahead for block device */
 #define BLKRAGET   _IO(0x12,99)	/* get current read ahead setting */
 #define BLKFRASET  _IO(0x12,100)/* set filesystem (mm/filemap.c) read-ahead */
 #define BLKFRAGET  _IO(0x12,101)/* get filesystem (mm/filemap.c) read-ahead */
 #define BLKSECTSET _IO(0x12,102)/* set max sectors per request (ll_rw_blk.c) */
 #define BLKSECTGET _IO(0x12,103)/* get max sectors per request (ll_rw_blk.c) */
 #define BLKSSZGET  _IO(0x12,104)/* get block device sector size */
 #if 0
 #define BLKPG      _IO(0x12,105)/* See blkpg.h */
 /* Some people are morons.  Do not use sizeof! */
 #define BLKELVGET  _IOR(0x12,106,size_t)/* elevator get */
 #define BLKELVSET  _IOW(0x12,107,size_t)/* elevator set */
 /* This was here just to show that the number is taken -
    probably all these _IO(0x12,*) ioctls should be moved to blkpg.h. */
 #endif
 /* A jump here: 108-111 have been used for various private purposes. */
 #define BLKBSZGET  _IOR(0x12,112,size_t)
 #define BLKBSZSET  _IOW(0x12,113,size_t)
 #define BLKGETSIZE64 _IOR(0x12,114,size_t)	/* return device size in bytes (u64 *arg) */
 #define BLKTRACESETUP _IOWR(0x12,115,struct blk_user_trace_setup)
 #define BLKTRACESTART _IO(0x12,116)
 #define BLKTRACESTOP _IO(0x12,117)
 #define BLKTRACETEARDOWN _IO(0x12,118)
 #define BMAP_IOCTL 1		/* obsolete - kept for compatibility */
 #define FIBMAP	   _IO(0x00,1)	/* bmap access */
 #define FIGETBSZ   _IO(0x00,2)	/* get the block size used for bmap */
 #define	FS_IOC_GETFLAGS			_IOR('f', 1, long)
 #define	FS_IOC_SETFLAGS			_IOW('f', 2, long)
 #define	FS_IOC_GETVERSION		_IOR('v', 1, long)
 #define	FS_IOC_SETVERSION		_IOW('v', 2, long)
 #define FS_IOC32_GETFLAGS		_IOR('f', 1, int)
 #define FS_IOC32_SETFLAGS		_IOW('f', 2, int)
 #define FS_IOC32_GETVERSION		_IOR('v', 1, int)
 #define FS_IOC32_SETVERSION		_IOW('v', 2, int)
 /*
  * Inode flags (FS_IOC_GETFLAGS / FS_IOC_SETFLAGS)
  */
 #define	FS_SECRM_FL			0x00000001 /* Secure deletion */
 #define	FS_UNRM_FL			0x00000002 /* Undelete */
 #define	FS_COMPR_FL			0x00000004 /* Compress file */
 #define FS_SYNC_FL			0x00000008 /* Synchronous updates */
 #define FS_IMMUTABLE_FL			0x00000010 /* Immutable file */
 #define FS_APPEND_FL			0x00000020 /* writes to file may only append */
 #define FS_NODUMP_FL			0x00000040 /* do not dump file */
 #define FS_NOATIME_FL			0x00000080 /* do not update atime */
 /* Reserved for compression usage... */
 #define FS_DIRTY_FL			0x00000100
 #define FS_COMPRBLK_FL			0x00000200 /* One or more compressed clusters */
 #define FS_NOCOMP_FL			0x00000400 /* Don't compress */
 #define FS_ECOMPR_FL			0x00000800 /* Compression error */
 /* End compression flags --- maybe not all used */
 #define FS_BTREE_FL			0x00001000 /* btree format dir */
 #define FS_INDEX_FL			0x00001000 /* hash-indexed directory */
 #define FS_IMAGIC_FL			0x00002000 /* AFS directory */
 #define FS_JOURNAL_DATA_FL		0x00004000 /* Reserved for ext3 */
 #define FS_NOTAIL_FL			0x00008000 /* file tail should not be merged */
 #define FS_DIRSYNC_FL			0x00010000 /* dirsync behaviour (directories only) */
 #define FS_TOPDIR_FL			0x00020000 /* Top of directory hierarchies*/
 #define FS_EXTENT_FL			0x00080000 /* Extents */
 #define FS_DIRECTIO_FL			0x00100000 /* Use direct i/o */
 #define FS_RESERVED_FL			0x80000000 /* reserved for ext2 lib */
 #define FS_FL_USER_VISIBLE		0x0003DFFF /* User visible flags */
 #define FS_FL_USER_MODIFIABLE		0x000380FF /* User modifiable flags */
 #define SYNC_FILE_RANGE_WAIT_BEFORE	1
 #define SYNC_FILE_RANGE_WRITE		2
 #define SYNC_FILE_RANGE_WAIT_AFTER	4
 #ifdef __KERNEL__
 #include <linux/linkage.h>
 #include <linux/wait.h>
 #include <linux/types.h>
 #include <linux/kdev_t.h>
 #include <linux/dcache.h>
 #include <linux/namei.h>
 #include <linux/stat.h>
 #include <linux/cache.h>
 #include <linux/kobject.h>
 #include <linux/list.h>
 #include <linux/radix-tree.h>
 #include <linux/prio_tree.h>
 #include <linux/init.h>
 #include <linux/pid.h>
 #include <linux/mutex.h>
 #include <linux/capability.h>
 #include <asm/atomic.h>
 #include <asm/semaphore.h>
 #include <asm/byteorder.h>
 struct export_operations;
 struct hd_geometry;
 struct iovec;
 struct nameidata;
 struct kiocb;
 struct pipe_inode_info;
 struct poll_table_struct;
 struct kstatfs;
 struct vm_area_struct;
 struct vfsmount;
 extern void __init inode_init(void);
 extern void __init inode_init_early(void);
 extern void __init mnt_init(void);
 extern void __init files_init(unsigned long);
 struct buffer_head;
 typedef int (get_block_t)(struct inode *inode, sector_t iblock,
 			struct buffer_head *bh_result, int create);
 typedef void (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 			ssize_t bytes, void *private);
 /*
  * Attribute flags.  These should be or-ed together to figure out what
  * has been changed!
  */
 #define ATTR_MODE	1
 #define ATTR_UID	2
 #define ATTR_GID	4
 #define ATTR_SIZE	8
 #define ATTR_ATIME	16
 #define ATTR_MTIME	32
 #define ATTR_CTIME	64
 #define ATTR_ATIME_SET	128
 #define ATTR_MTIME_SET	256
 #define ATTR_FORCE	512	/* Not a change, but a change it */
 #define ATTR_ATTR_FLAG	1024
 #define ATTR_KILL_SUID	2048
 #define ATTR_KILL_SGID	4096
 #define ATTR_FILE	8192
 #define ATTR_KILL_PRIV	16384
 #define ATTR_OPEN	32768	/* Truncating from open(O_TRUNC) */
 /*
  * This is the Inode Attributes structure, used for notify_change().  It
  * uses the above definitions as flags, to know which values have changed.
  * Also, in this manner, a Filesystem can look at only the values it cares
  * about.  Basically, these are the attributes that the VFS layer can
  * request to change from the FS layer.
  *
  * Derek Atkins <warlord@MIT.EDU> 94-10-20
  */
 struct iattr {
 	unsigned int	ia_valid;
 	umode_t		ia_mode;
 	uid_t		ia_uid;
 	gid_t		ia_gid;
 	loff_t		ia_size;
 	struct timespec	ia_atime;
 	struct timespec	ia_mtime;
 	struct timespec	ia_ctime;
 	/*
 	 * Not an attribute, but an auxilary info for filesystems wanting to
 	 * implement an ftruncate() like method.  NOTE: filesystem should
 	 * check for (ia_valid & ATTR_FILE), and not for (ia_file != NULL).
 	 */
 	struct file	*ia_file;
 };
 /*
  * Includes for diskquotas.
  */
 #include <linux/quota.h>
 /**
  * enum positive_aop_returns - aop return codes with specific semantics
  *
  * @AOP_WRITEPAGE_ACTIVATE: Informs the caller that page writeback has
  * 			    completed, that the page is still locked, and
  * 			    should be considered active.  The VM uses this hint
  * 			    to return the page to the active list -- it won't
  * 			    be a candidate for writeback again in the near
  * 			    future.  Other callers must be careful to unlock
  * 			    the page if they get this return.  Returned by
  * 			    writepage();
  *
  * @AOP_TRUNCATED_PAGE: The AOP method that was handed a locked page has
  *  			unlocked it and the page might have been truncated.
  *  			The caller should back up to acquiring a new page and
  *  			trying again.  The aop will be taking reasonable
  *  			precautions not to livelock.  If the caller held a page
  *  			reference, it should drop it before retrying.  Returned
  *  			by readpage().
  *
  * address_space_operation functions return these large constants to indicate
  * special semantics to the caller.  These are much larger than the bytes in a
  * page to allow for functions that return the number of bytes operated on in a
  * given page.
  */
 enum positive_aop_returns {
 	AOP_WRITEPAGE_ACTIVATE	= 0x80000,
 	AOP_TRUNCATED_PAGE	= 0x80001,
 };
 #define AOP_FLAG_UNINTERRUPTIBLE	0x0001 /* will not do a short write */
 #define AOP_FLAG_CONT_EXPAND		0x0002 /* called from cont_expand */
 /*
  * oh the beauties of C type declarations.
  */
 struct page;
 struct address_space;
 struct writeback_control;
 struct iov_iter {
 	const struct iovec *iov;
 	unsigned long nr_segs;
 	size_t iov_offset;
 	size_t count;
 };
 size_t iov_iter_copy_from_user_atomic(struct page *page,
 		struct iov_iter *i, unsigned long offset, size_t bytes);
 size_t iov_iter_copy_from_user(struct page *page,
 		struct iov_iter *i, unsigned long offset, size_t bytes);
 void iov_iter_advance(struct iov_iter *i, size_t bytes);
 int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes);
 size_t iov_iter_single_seg_count(struct iov_iter *i);
 static inline void iov_iter_init(struct iov_iter *i,
 			const struct iovec *iov, unsigned long nr_segs,
 			size_t count, size_t written)
 {
 	i->iov = iov;
 	i->nr_segs = nr_segs;
 	i->iov_offset = 0;
 	i->count = count + written;
 	iov_iter_advance(i, written);
 }
 static inline size_t iov_iter_count(struct iov_iter *i)
 {
 	return i->count;
 }
 struct address_space_operations {
 	int (*writepage)(struct page *page, struct writeback_control *wbc);
 	int (*readpage)(struct file *, struct page *);
 	void (*sync_page)(struct page *);
 	/* Write back some dirty pages from this mapping. */
 	int (*writepages)(struct address_space *, struct writeback_control *);
 	/* Set a page dirty.  Return true if this dirtied it */
 	int (*set_page_dirty)(struct page *page);
 	int (*readpages)(struct file *filp, struct address_space *mapping,
 			struct list_head *pages, unsigned nr_pages);
 	/*
 	 * ext3 requires that a successful prepare_write() call be followed
 	 * by a commit_write() call - they must be balanced
 	 */
 	int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
 	int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
 	int (*write_begin)(struct file *, struct address_space *mapping,
 				loff_t pos, unsigned len, unsigned flags,
 				struct page **pagep, void **fsdata);
 	int (*write_end)(struct file *, struct address_space *mapping,
 				loff_t pos, unsigned len, unsigned copied,
 				struct page *page, void *fsdata);
 	/* Unfortunately this kludge is needed for FIBMAP. Don't use it */
 	sector_t (*bmap)(struct address_space *, sector_t);
 	void (*invalidatepage) (struct page *, unsigned long);
 	int (*releasepage) (struct page *, gfp_t);
 	ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
 			loff_t offset, unsigned long nr_segs);
 	struct page* (*get_xip_page)(struct address_space *, sector_t,
 			int);
 	/* migrate the contents of a page to the specified target */
 	int (*migratepage) (struct address_space *,
 			struct page *, struct page *);
 	int (*launder_page) (struct page *);
 };
 /*
  * pagecache_write_begin/pagecache_write_end must be used by general code
  * to write into the pagecache.
  */
 int pagecache_write_begin(struct file *, struct address_space *mapping,
 				loff_t pos, unsigned len, unsigned flags,
 				struct page **pagep, void **fsdata);
 int pagecache_write_end(struct file *, struct address_space *mapping,
 				loff_t pos, unsigned len, unsigned copied,
 				struct page *page, void *fsdata);
 struct backing_dev_info;
 struct address_space {
 	struct inode		*host;		/* owner: inode, block_device */
 	struct radix_tree_root	page_tree;	/* radix tree of all pages */
 	rwlock_t		tree_lock;	/* and rwlock protecting it */
 	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
 	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
 	spinlock_t		i_mmap_lock;	/* protect tree, count, list */
 	unsigned int		truncate_count;	/* Cover race condition with truncate */
 	unsigned long		nrpages;	/* number of total pages */
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
 	struct backing_dev_info *backing_dev_info; /* device readahead, etc */
 	spinlock_t		private_lock;	/* for use by the address_space */
 	struct list_head	private_list;	/* ditto */
 	struct address_space	*assoc_mapping;	/* ditto */
 } __attribute__((aligned(sizeof(long))));
 	/*
 	 * On most architectures that alignment is already the case; but
 	 * must be enforced here for CRIS, to let the least signficant bit
 	 * of struct page's "mapping" pointer be used for PAGE_MAPPING_ANON.
 	 */
 struct block_device {
 	dev_t			bd_dev;  /* not a kdev_t - it's a search key */
 	struct inode *		bd_inode;	/* will die */
 	int			bd_openers;
 	struct mutex		bd_mutex;	/* open/close mutex */
 	struct semaphore	bd_mount_sem;
 	struct list_head	bd_inodes;
 	void *			bd_holder;
 	int			bd_holders;
 #ifdef CONFIG_SYSFS
 	struct list_head	bd_holder_list;
 #endif
 	struct block_device *	bd_contains;
 	unsigned		bd_block_size;
 	struct hd_struct *	bd_part;
 	/* number of times partitions within this device have been opened. */
 	unsigned		bd_part_count;
 	int			bd_invalidated;
 	struct gendisk *	bd_disk;
 	struct list_head	bd_list;
 	struct backing_dev_info *bd_inode_backing_dev_info;
 	/*
 	 * Private data.  You must have bd_claim'ed the block_device
 	 * to use this.  NOTE:  bd_claim allows an owner to claim
 	 * the same device multiple times, the owner must take special
 	 * care to not mess up bd_private for that case.
 	 */
 	unsigned long		bd_private;
 };
 /*
  * Radix-tree tags, for tagging dirty and writeback pages within the pagecache
  * radix trees
  */
 #define PAGECACHE_TAG_DIRTY	0
 #define PAGECACHE_TAG_WRITEBACK	1
 int mapping_tagged(struct address_space *mapping, int tag);
 /*
  * Might pages of this file be mapped into userspace?
  */
 static inline int mapping_mapped(struct address_space *mapping)
 {
 	return	!prio_tree_empty(&mapping->i_mmap) ||
 		!list_empty(&mapping->i_mmap_nonlinear);
 }
 /*
  * Might pages of this file have been modified in userspace?
  * Note that i_mmap_writable counts all VM_SHARED vmas: do_mmap_pgoff
  * marks vma as VM_SHARED if it is shared, and the file was opened for
  * writing i.e. vma may be mprotected writable even if now readonly.
  */
 static inline int mapping_writably_mapped(struct address_space *mapping)
 {
 	return mapping->i_mmap_writable != 0;
 }
 /*
  * Use sequence counter to get consistent i_size on 32-bit processors.
  */
 #if BITS_PER_LONG==32 && defined(CONFIG_SMP)
 #include <linux/seqlock.h>
 #define __NEED_I_SIZE_ORDERED
 #define i_size_ordered_init(inode) seqcount_init(&inode->i_size_seqcount)
 #else
 #define i_size_ordered_init(inode) do { } while (0)
 #endif
 struct inode {
 	struct hlist_node	i_hash;
 	struct list_head	i_list;
 	struct list_head	i_sb_list;
 	struct list_head	i_dentry;
 	unsigned long		i_ino;
 	atomic_t		i_count;
 	unsigned int		i_nlink;
 	uid_t			i_uid;
 	gid_t			i_gid;
 	dev_t			i_rdev;
 	u64			i_version;
 	loff_t			i_size;
 #ifdef __NEED_I_SIZE_ORDERED
 	seqcount_t		i_size_seqcount;
 #endif
 	struct timespec		i_atime;
 	struct timespec		i_mtime;
 	struct timespec		i_ctime;
 	unsigned int		i_blkbits;
 	blkcnt_t		i_blocks;
 	unsigned short          i_bytes;
 	umode_t			i_mode;
 	spinlock_t		i_lock;	/* i_blocks, i_bytes, maybe i_size */
 	struct mutex		i_mutex;
 	struct rw_semaphore	i_alloc_sem;
 	const struct inode_operations	*i_op;
 	const struct file_operations	*i_fop;	/* former ->i_op->default_file_ops */
 	struct super_block	*i_sb;
 	struct file_lock	*i_flock;
 	struct address_space	*i_mapping;
 	struct address_space	i_data;
 #ifdef CONFIG_QUOTA
 	struct dquot		*i_dquot[MAXQUOTAS];
 #endif
 	struct list_head	i_devices;
 	union {
 		struct pipe_inode_info	*i_pipe;
 		struct block_device	*i_bdev;
 		struct cdev		*i_cdev;
 	};
 	int			i_cindex;
 	__u32			i_generation;
 #ifdef CONFIG_DNOTIFY
 	unsigned long		i_dnotify_mask; /* Directory notify events */
 	struct dnotify_struct	*i_dnotify; /* for directory notifications */
 #endif
 #ifdef CONFIG_INOTIFY
 	struct list_head	inotify_watches; /* watches on this inode */
 	struct mutex		inotify_mutex;	/* protects the watches list */
 #endif
 	unsigned long		i_state;
 	unsigned long		dirtied_when;	/* jiffies of first dirtying */
 	unsigned int		i_flags;
 	atomic_t		i_writecount;
 #ifdef CONFIG_SECURITY
 	void			*i_security;
 #endif
 	void			*i_private; /* fs or device private pointer */
 };
 /*
  * inode->i_mutex nesting subclasses for the lock validator:
  *
  * 0: the object of the current VFS operation
  * 1: parent
  * 2: child/target
  * 3: quota file
  *
  * The locking order between these classes is
  * parent -> child -> normal -> xattr -> quota
  */
 enum inode_i_mutex_lock_class
 {
 	I_MUTEX_NORMAL,
 	I_MUTEX_PARENT,
 	I_MUTEX_CHILD,
 	I_MUTEX_XATTR,
 	I_MUTEX_QUOTA
 };
 extern void inode_double_lock(struct inode *inode1, struct inode *inode2);
 extern void inode_double_unlock(struct inode *inode1, struct inode *inode2);
 /*
  * NOTE: in a 32bit arch with a preemptable kernel and
  * an UP compile the i_size_read/write must be atomic
  * with respect to the local cpu (unlike with preempt disabled),
  * but they don't need to be atomic with respect to other cpus like in
  * true SMP (so they need either to either locally disable irq around
  * the read or for example on x86 they can be still implemented as a
  * cmpxchg8b without the need of the lock prefix). For SMP compiles
  * and 64bit archs it makes no difference if preempt is enabled or not.
  */
 static inline loff_t i_size_read(const struct inode *inode)
 {
 #if BITS_PER_LONG==32 && defined(CONFIG_SMP)
 	loff_t i_size;
 	unsigned int seq;
 	do {
 		seq = read_seqcount_begin(&inode->i_size_seqcount);
 		i_size = inode->i_size;
 	} while (read_seqcount_retry(&inode->i_size_seqcount, seq));
 	return i_size;
 #elif BITS_PER_LONG==32 && defined(CONFIG_PREEMPT)
 	loff_t i_size;
 	preempt_disable();
 	i_size = inode->i_size;
 	preempt_enable();
 	return i_size;
 #else
 	return inode->i_size;
 #endif
 }
 /*
  * NOTE: unlike i_size_read(), i_size_write() does need locking around it
  * (normally i_mutex), otherwise on 32bit/SMP an update of i_size_seqcount
  * can be lost, resulting in subsequent i_size_read() calls spinning forever.
  */
 static inline void i_size_write(struct inode *inode, loff_t i_size)
 {
 #if BITS_PER_LONG==32 && defined(CONFIG_SMP)
 	write_seqcount_begin(&inode->i_size_seqcount);
 	inode->i_size = i_size;
 	write_seqcount_end(&inode->i_size_seqcount);
 #elif BITS_PER_LONG==32 && defined(CONFIG_PREEMPT)
 	preempt_disable();
 	inode->i_size = i_size;
 	preempt_enable();
 #else
 	inode->i_size = i_size;
 #endif
 }
 static inline unsigned iminor(const struct inode *inode)
 {
 	return MINOR(inode->i_rdev);
 }
 static inline unsigned imajor(const struct inode *inode)
 {
 	return MAJOR(inode->i_rdev);
 }
 extern struct block_device *I_BDEV(struct inode *inode);
 struct fown_struct {
 	rwlock_t lock;          /* protects pid, uid, euid fields */
 	struct pid *pid;	/* pid or -pgrp where SIGIO should be sent */
 	enum pid_type pid_type;	/* Kind of process group SIGIO should be sent to */
 	uid_t uid, euid;	/* uid/euid of process setting the owner */
 	int signum;		/* posix.1b rt signal to be delivered on IO */
 };
 /*
  * Track a single file's readahead state
  */
 struct file_ra_state {
 	pgoff_t start;			/* where readahead started */
 	unsigned int size;		/* # of readahead pages */
 	unsigned int async_size;	/* do asynchronous readahead when
 					   there are only # of pages ahead */
 	unsigned int ra_pages;		/* Maximum readahead window */
 	int mmap_miss;			/* Cache miss stat for mmap accesses */
 	loff_t prev_pos;		/* Cache last read() position */
 };
 /*
  * Check if @index falls in the readahead windows.
  */
 static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
 {
 	return (index >= ra->start &&
 		index <  ra->start + ra->size);
 }
 struct file {
 	/*
 	 * fu_list becomes invalid after file_free is called and queued via
 	 * fu_rcuhead for RCU freeing
 	 */
 	union {
 		struct list_head	fu_list;
 		struct rcu_head 	fu_rcuhead;
 	} f_u;
 	struct path		f_path;
 #define f_dentry	f_path.dentry
 #define f_vfsmnt	f_path.mnt
 	const struct file_operations	*f_op;
 	atomic_t		f_count;
 	unsigned int 		f_flags;
 	mode_t			f_mode;
 	loff_t			f_pos;
 	struct fown_struct	f_owner;
 	unsigned int		f_uid, f_gid;
 	struct file_ra_state	f_ra;
 	u64			f_version;
 #ifdef CONFIG_SECURITY
 	void			*f_security;
 #endif
 	/* needed for tty driver, and maybe others */
 	void			*private_data;
 #ifdef CONFIG_EPOLL
 	/* Used by fs/eventpoll.c to link all the hooks to this file */
 	struct list_head	f_ep_links;
 	spinlock_t		f_ep_lock;
 #endif /* #ifdef CONFIG_EPOLL */
 	struct address_space	*f_mapping;
 };
 extern spinlock_t files_lock;
 #define file_list_lock() spin_lock(&files_lock);
 #define file_list_unlock() spin_unlock(&files_lock);
 #define get_file(x)	atomic_inc(&(x)->f_count)
 #define file_count(x)	atomic_read(&(x)->f_count)
 #define	MAX_NON_LFS	((1UL<<31) - 1)
 /* Page cache limit. The filesystems should put that into their s_maxbytes
    limits, otherwise bad things can happen in VM. */
 #if BITS_PER_LONG==32
 #define MAX_LFS_FILESIZE	(((u64)PAGE_CACHE_SIZE << (BITS_PER_LONG-1))-1)
 #elif BITS_PER_LONG==64
 #define MAX_LFS_FILESIZE 	0x7fffffffffffffffUL
 #endif
 #define FL_POSIX	1
 #define FL_FLOCK	2
 #define FL_ACCESS	8	/* not trying to lock, just looking */
 #define FL_EXISTS	16	/* when unlocking, test for existence */
 #define FL_LEASE	32	/* lease held on this file */
 #define FL_CLOSE	64	/* unlock on close */
 #define FL_SLEEP	128	/* A blocking lock */
 /*
  * The POSIX file lock owner is determined by
  * the "struct files_struct" in the thread group
  * (or NULL for no owner - BSD locks).
  *
  * Lockd stuffs a "host" pointer into this.
  */
 typedef struct files_struct *fl_owner_t;
 struct file_lock_operations {
 	void (*fl_insert)(struct file_lock *);	/* lock insertion callback */
 	void (*fl_remove)(struct file_lock *);	/* lock removal callback */
 	void (*fl_copy_lock)(struct file_lock *, struct file_lock *);
 	void (*fl_release_private)(struct file_lock *);
 };
 struct lock_manager_operations {
 	int (*fl_compare_owner)(struct file_lock *, struct file_lock *);
 	void (*fl_notify)(struct file_lock *);	/* unblock callback */
 	int (*fl_grant)(struct file_lock *, struct file_lock *, int);
 	void (*fl_copy_lock)(struct file_lock *, struct file_lock *);
 	void (*fl_release_private)(struct file_lock *);
 	void (*fl_break)(struct file_lock *);
 	int (*fl_mylease)(struct file_lock *, struct file_lock *);
 	int (*fl_change)(struct file_lock **, int);
 };
 /* that will die - we need it for nfs_lock_info */
 #include <linux/nfs_fs_i.h>
 struct file_lock {
 	struct file_lock *fl_next;	/* singly linked list for this inode  */
 	struct list_head fl_link;	/* doubly linked list of all locks */
 	struct list_head fl_block;	/* circular list of blocked processes */
 	fl_owner_t fl_owner;
 	unsigned int fl_pid;
 	struct pid *fl_nspid;
 	wait_queue_head_t fl_wait;
 	struct file *fl_file;
 	unsigned char fl_flags;
 	unsigned char fl_type;
 	loff_t fl_start;
 	loff_t fl_end;
 	struct fasync_struct *	fl_fasync; /* for lease break notifications */
 	unsigned long fl_break_time;	/* for nonblocking lease breaks */
 	struct file_lock_operations *fl_ops;	/* Callbacks for filesystems */
 	struct lock_manager_operations *fl_lmops;	/* Callbacks for lockmanagers */
 	union {
 		struct nfs_lock_info	nfs_fl;
 		struct nfs4_lock_info	nfs4_fl;
 		struct {
 			struct list_head link;	/* link in AFS vnode's pending_locks list */
 			int state;		/* state of grant or error if -ve */
 		} afs;
 	} fl_u;
 };
 /* The following constant reflects the upper bound of the file/locking space */
 #ifndef OFFSET_MAX
 #define INT_LIMIT(x)	(~((x)1 << (sizeof(x)*8 - 1)))
 #define OFFSET_MAX	INT_LIMIT(loff_t)
 #define OFFT_OFFSET_MAX	INT_LIMIT(off_t)
 #endif
 #include <linux/fcntl.h>
 extern int fcntl_getlk(struct file *, struct flock __user *);
 extern int fcntl_setlk(unsigned int, struct file *, unsigned int,
 			struct flock __user *);
 #if BITS_PER_LONG == 32
 extern int fcntl_getlk64(struct file *, struct flock64 __user *);
 extern int fcntl_setlk64(unsigned int, struct file *, unsigned int,
 			struct flock64 __user *);
 #endif
 extern void send_sigio(struct fown_struct *fown, int fd, int band);
 extern int fcntl_setlease(unsigned int fd, struct file *filp, long arg);
 extern int fcntl_getlease(struct file *filp);
 /* fs/sync.c */
 extern int do_sync_mapping_range(struct address_space *mapping, loff_t offset,
 			loff_t endbyte, unsigned int flags);
 /* fs/locks.c */
 extern void locks_init_lock(struct file_lock *);
 extern void locks_copy_lock(struct file_lock *, struct file_lock *);
 extern void locks_remove_posix(struct file *, fl_owner_t);
 extern void locks_remove_flock(struct file *);
 extern void posix_test_lock(struct file *, struct file_lock *);
 extern int posix_lock_file(struct file *, struct file_lock *, struct file_lock *);
 extern int posix_lock_file_wait(struct file *, struct file_lock *);
 extern int posix_unblock_lock(struct file *, struct file_lock *);
 extern int vfs_test_lock(struct file *, struct file_lock *);
 extern int vfs_lock_file(struct file *, unsigned int, struct file_lock *, struct file_lock *);
 extern int vfs_cancel_lock(struct file *filp, struct file_lock *fl);
 extern int flock_lock_file_wait(struct file *filp, struct file_lock *fl);
 extern int __break_lease(struct inode *inode, unsigned int flags);
 extern void lease_get_mtime(struct inode *, struct timespec *time);
 extern int generic_setlease(struct file *, long, struct file_lock **);
 extern int vfs_setlease(struct file *, long, struct file_lock **);
 extern int lease_modify(struct file_lock **, int);
 extern int lock_may_read(struct inode *, loff_t start, unsigned long count);
 extern int lock_may_write(struct inode *, loff_t start, unsigned long count);
 extern struct seq_operations locks_seq_operations;
 struct fasync_struct {
 	int	magic;
 	int	fa_fd;
 	struct	fasync_struct	*fa_next; /* singly linked list */
 	struct	file 		*fa_file;
 };
 #define FASYNC_MAGIC 0x4601
 /* SMP safe fasync helpers: */
 extern int fasync_helper(int, struct file *, int, struct fasync_struct **);
 /* can be called from interrupts */
 extern void kill_fasync(struct fasync_struct **, int, int);
 /* only for net: no internal synchronization */
 extern void __kill_fasync(struct fasync_struct *, int, int);
 extern int __f_setown(struct file *filp, struct pid *, enum pid_type, int force);
 extern int f_setown(struct file *filp, unsigned long arg, int force);
 extern void f_delown(struct file *filp);
 extern pid_t f_getown(struct file *filp);
 extern int send_sigurg(struct fown_struct *fown);
 /*
  *	Umount options
  */
 #define MNT_FORCE	0x00000001	/* Attempt to forcibily umount */
 #define MNT_DETACH	0x00000002	/* Just detach from the tree */
 #define MNT_EXPIRE	0x00000004	/* Mark for expiry */
 extern struct list_head super_blocks;
 extern spinlock_t sb_lock;
 #define S_BIAS (1<<30)
 struct super_block {
 	struct list_head	s_list;		/* Keep this first */
 	dev_t			s_dev;		/* search index; _not_ kdev_t */
 	unsigned long		s_blocksize;
 	unsigned char		s_blocksize_bits;
 	unsigned char		s_dirt;
 	unsigned long long	s_maxbytes;	/* Max file size */
 	struct file_system_type	*s_type;
 	const struct super_operations	*s_op;
 	struct dquot_operations	*dq_op;
  	struct quotactl_ops	*s_qcop;
 	const struct export_operations *s_export_op;
 	unsigned long		s_flags;
 	unsigned long		s_magic;
 	struct dentry		*s_root;
 	struct rw_semaphore	s_umount;
 	struct mutex		s_lock;
 	int			s_count;
 	int			s_syncing;
 	int			s_need_sync_fs;
 	atomic_t		s_active;
 #ifdef CONFIG_SECURITY
 	void                    *s_security;
 #endif
 	struct xattr_handler	**s_xattr;
 	struct list_head	s_inodes;	/* all inodes */
 	struct list_head	s_dirty;	/* dirty inodes */
 	struct list_head	s_io;		/* parked for writeback */
 	struct list_head	s_more_io;	/* parked for more writeback */
 	struct hlist_head	s_anon;		/* anonymous dentries for (nfs) exporting */
 	struct list_head	s_files;
 	struct block_device	*s_bdev;
 	struct mtd_info		*s_mtd;
 	struct list_head	s_instances;
 	struct quota_info	s_dquot;	/* Diskquota specific options */
 	int			s_frozen;
 	wait_queue_head_t	s_wait_unfrozen;
 	char s_id[32];				/* Informational name */
 	void 			*s_fs_info;	/* Filesystem private info */
 	/*
 	 * The next field is for VFS *only*. No filesystems have any business
 	 * even looking at it. You had been warned.
 	 */
 	struct mutex s_vfs_rename_mutex;	/* Kludge */
 	/* Granularity of c/m/atime in ns.
 	   Cannot be worse than a second */
 	u32		   s_time_gran;
 	/*
 	 * Filesystem subtype.  If non-empty the filesystem type field
 	 * in /proc/mounts will be "type.subtype"
 	 */
 	char *s_subtype;
 };
 extern struct timespec current_fs_time(struct super_block *sb);
 /*
  * Snapshotting support.
  */
 enum {
 	SB_UNFROZEN = 0,
 	SB_FREEZE_WRITE	= 1,
 	SB_FREEZE_TRANS = 2,
 };
 #define vfs_check_frozen(sb, level) \
 	wait_event((sb)->s_wait_unfrozen, ((sb)->s_frozen < (level)))
 #define get_fs_excl() atomic_inc(&current->fs_excl)
 #define put_fs_excl() atomic_dec(&current->fs_excl)
 #define has_fs_excl() atomic_read(&current->fs_excl)
 #define is_owner_or_cap(inode)	\
 	((current->fsuid == (inode)->i_uid) || capable(CAP_FOWNER))
 /* not quite ready to be deprecated, but... */
 extern void lock_super(struct super_block *);
 extern void unlock_super(struct super_block *);
 /*
  * VFS helper functions..
  */
 extern int vfs_permission(struct nameidata *, int);
 extern int vfs_create(struct inode *, struct dentry *, int, struct nameidata *);
 extern int vfs_mkdir(struct inode *, struct dentry *, int);
 extern int vfs_mknod(struct inode *, struct dentry *, int, dev_t);
 extern int vfs_symlink(struct inode *, struct dentry *, const char *, int);
 extern int vfs_link(struct dentry *, struct inode *, struct dentry *);
 extern int vfs_rmdir(struct inode *, struct dentry *);
 extern int vfs_unlink(struct inode *, struct dentry *);
 extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *);
 /*
  * VFS dentry helper functions.
  */
 extern void dentry_unhash(struct dentry *dentry);
 /*
  * VFS file helper functions.
  */
 extern int file_permission(struct file *, int);
 /*
  * File types
  *
  * NOTE! These match bits 12..15 of stat.st_mode
  * (ie "(i_mode >> 12) & 15").
  */
 #define DT_UNKNOWN	0
 #define DT_FIFO		1
 #define DT_CHR		2
 #define DT_DIR		4
 #define DT_BLK		6
 #define DT_REG		8
 #define DT_LNK		10
 #define DT_SOCK		12
 #define DT_WHT		14
 #define OSYNC_METADATA	(1<<0)
 #define OSYNC_DATA	(1<<1)
 #define OSYNC_INODE	(1<<2)
 int generic_osync_inode(struct inode *, struct address_space *, int);
 /*
  * This is the "filldir" function type, used by readdir() to let
  * the kernel specify what kind of dirent layout it wants to have.
  * This allows the kernel to read directories into kernel space or
  * to have different dirent layouts depending on the binary type.
  */
 typedef int (*filldir_t)(void *, const char *, int, loff_t, u64, unsigned);
 struct block_device_operations {
 	int (*open) (struct inode *, struct file *);
 	int (*release) (struct inode *, struct file *);
 	int (*ioctl) (struct inode *, struct file *, unsigned, unsigned long);
 	long (*unlocked_ioctl) (struct file *, unsigned, unsigned long);
 	long (*compat_ioctl) (struct file *, unsigned, unsigned long);
 	int (*direct_access) (struct block_device *, sector_t, unsigned long *);
 	int (*media_changed) (struct gendisk *);
 	int (*revalidate_disk) (struct gendisk *);
 	int (*getgeo)(struct block_device *, struct hd_geometry *);
 	struct module *owner;
 };
 /*
  * "descriptor" for what we're up to with a read.
  * This allows us to use the same read code yet
  * have multiple different users of the data that
  * we read from a file.
  *
  * The simplest case just copies the data to user
  * mode.
  */
 typedef struct {
 	size_t written;
 	size_t count;
 	union {
 		char __user * buf;
 		void *data;
 	} arg;
 	int error;
 } read_descriptor_t;
 typedef int (*read_actor_t)(read_descriptor_t *, struct page *, unsigned long, unsigned long);
 /* These macros are for out of kernel modules to test that
  * the kernel supports the unlocked_ioctl and compat_ioctl
  * fields in struct file_operations. */
 #define HAVE_COMPAT_IOCTL 1
 #define HAVE_UNLOCKED_IOCTL 1
 /*
  * NOTE:
  * read, write, poll, fsync, readv, writev, unlocked_ioctl and compat_ioctl
  * can be called without the big kernel lock held in all filesystems.
  */
 struct file_operations {
 	struct module *owner;
 	loff_t (*llseek) (struct file *, loff_t, int);
 	ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
 	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
 	ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
 	ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
 	int (*readdir) (struct file *, void *, filldir_t);
 	unsigned int (*poll) (struct file *, struct poll_table_struct *);
 	int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
 	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
 	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
 	int (*mmap) (struct file *, struct vm_area_struct *);
 	int (*open) (struct inode *, struct file *);
 	int (*flush) (struct file *, fl_owner_t id);
 	int (*release) (struct inode *, struct file *);
 	int (*fsync) (struct file *, struct dentry *, int datasync);
 	int (*aio_fsync) (struct kiocb *, int datasync);
 	int (*fasync) (int, struct file *, int);
 	int (*lock) (struct file *, int, struct file_lock *);
 	ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
 	unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
 	int (*check_flags)(int);
 	int (*dir_notify)(struct file *filp, unsigned long arg);
 	int (*flock) (struct file *, int, struct file_lock *);
 	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
 	ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
 	int (*setlease)(struct file *, long, struct file_lock **);
 };
 struct inode_operations {
 	int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
 	struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *);
 	int (*link) (struct dentry *,struct inode *,struct dentry *);
 	int (*unlink) (struct inode *,struct dentry *);
 	int (*symlink) (struct inode *,struct dentry *,const char *);
 	int (*mkdir) (struct inode *,struct dentry *,int);
 	int (*rmdir) (struct inode *,struct dentry *);
 	int (*mknod) (struct inode *,struct dentry *,int,dev_t);
 	int (*rename) (struct inode *, struct dentry *,
 			struct inode *, struct dentry *);
 	int (*readlink) (struct dentry *, char __user *,int);
 	void * (*follow_link) (struct dentry *, struct nameidata *);
 	void (*put_link) (struct dentry *, struct nameidata *, void *);
 	void (*truncate) (struct inode *);
 	int (*permission) (struct inode *, int, struct nameidata *);
 	int (*setattr) (struct dentry *, struct iattr *);
 	int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
 	int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
 	ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
 	ssize_t (*listxattr) (struct dentry *, char *, size_t);
 	int (*removexattr) (struct dentry *, const char *);
 	void (*truncate_range)(struct inode *, loff_t, loff_t);
 	long (*fallocate)(struct inode *inode, int mode, loff_t offset,
 			  loff_t len);
 };
 struct seq_file;
 ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
 				unsigned long nr_segs, unsigned long fast_segs,
 				struct iovec *fast_pointer,
 				struct iovec **ret_pointer);
 extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
 extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
 extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *);
 extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *);
 /*
  * NOTE: write_inode, delete_inode, clear_inode, put_inode can be called
  * without the big kernel lock held in all filesystems.
  */
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
 	void (*destroy_inode)(struct inode *);
-	void (*read_inode) (struct inode *);
    	void (*dirty_inode) (struct inode *);
 	int (*write_inode) (struct inode *, int);
 	void (*put_inode) (struct inode *);
 	void (*drop_inode) (struct inode *);
 	void (*delete_inode) (struct inode *);
 	void (*put_super) (struct super_block *);
 	void (*write_super) (struct super_block *);
 	int (*sync_fs)(struct super_block *sb, int wait);
 	void (*write_super_lockfs) (struct super_block *);
 	void (*unlockfs) (struct super_block *);
 	int (*statfs) (struct dentry *, struct kstatfs *);
 	int (*remount_fs) (struct super_block *, int *, char *);
 	void (*clear_inode) (struct inode *);
 	void (*umount_begin) (struct vfsmount *, int);
 	int (*show_options)(struct seq_file *, struct vfsmount *);
 	int (*show_stats)(struct seq_file *, struct vfsmount *);
 #ifdef CONFIG_QUOTA
 	ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
 	ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
 #endif
 };
 /*
  * Inode state bits.  Protected by inode_lock.
  *
  * Three bits determine the dirty state of the inode, I_DIRTY_SYNC,
  * I_DIRTY_DATASYNC and I_DIRTY_PAGES.
  *
  * Four bits define the lifetime of an inode.  Initially, inodes are I_NEW,
  * until that flag is cleared.  I_WILL_FREE, I_FREEING and I_CLEAR are set at
  * various stages of removing an inode.
  *
  * Two bits are used for locking and completion notification, I_LOCK and I_SYNC.
  *
  * I_DIRTY_SYNC		Inode is dirty, but doesn't have to be written on
  *			fdatasync().  i_atime is the usual cause.
  * I_DIRTY_DATASYNC	Inode is dirty and must be written on fdatasync(), f.e.
  *			because i_size changed.
  * I_DIRTY_PAGES	Inode has dirty pages.  Inode itself may be clean.
  * I_NEW		get_new_inode() sets i_state to I_LOCK|I_NEW.  Both
  *			are cleared by unlock_new_inode(), called from iget().
  * I_WILL_FREE		Must be set when calling write_inode_now() if i_count
  *			is zero.  I_FREEING must be set when I_WILL_FREE is
  *			cleared.
  * I_FREEING		Set when inode is about to be freed but still has dirty
  *			pages or buffers attached or the inode itself is still
  *			dirty.
  * I_CLEAR		Set by clear_inode().  In this state the inode is clean
  *			and can be destroyed.
  *
  *			Inodes that are I_WILL_FREE, I_FREEING or I_CLEAR are
  *			prohibited for many purposes.  iget() must wait for
  *			the inode to be completely released, then create it
  *			anew.  Other functions will just ignore such inodes,
  *			if appropriate.  I_LOCK is used for waiting.
  *
  * I_LOCK		Serves as both a mutex and completion notification.
  *			New inodes set I_LOCK.  If two processes both create
  *			the same inode, one of them will release its inode and
  *			wait for I_LOCK to be released before returning.
  *			Inodes in I_WILL_FREE, I_FREEING or I_CLEAR state can
  *			also cause waiting on I_LOCK, without I_LOCK actually
  *			being set.  find_inode() uses this to prevent returning
  *			nearly-dead inodes.
  * I_SYNC		Similar to I_LOCK, but limited in scope to writeback
  *			of inode dirty data.  Having a separate lock for this
  *			purpose reduces latency and prevents some filesystem-
  *			specific deadlocks.
  *
  * Q: What is the difference between I_WILL_FREE and I_FREEING?
  * Q: igrab() only checks on (I_FREEING|I_WILL_FREE).  Should it also check on
  *    I_CLEAR?  If not, why?
  */
 #define I_DIRTY_SYNC		1
 #define I_DIRTY_DATASYNC	2
 #define I_DIRTY_PAGES		4
 #define I_NEW			8
 #define I_WILL_FREE		16
 #define I_FREEING		32
 #define I_CLEAR			64
 #define __I_LOCK		7
 #define I_LOCK			(1 << __I_LOCK)
 #define __I_SYNC		8
 #define I_SYNC			(1 << __I_SYNC)
 #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
 extern void __mark_inode_dirty(struct inode *, int);
 static inline void mark_inode_dirty(struct inode *inode)
 {
 	__mark_inode_dirty(inode, I_DIRTY);
 }
 static inline void mark_inode_dirty_sync(struct inode *inode)
 {
 	__mark_inode_dirty(inode, I_DIRTY_SYNC);
 }
 /**
  * inc_nlink - directly increment an inode's link count
  * @inode: inode
  *
  * This is a low-level filesystem helper to replace any
  * direct filesystem manipulation of i_nlink.  Currently,
  * it is only here for parity with dec_nlink().
  */
 static inline void inc_nlink(struct inode *inode)
 {
 	inode->i_nlink++;
 }
 static inline void inode_inc_link_count(struct inode *inode)
 {
 	inc_nlink(inode);
 	mark_inode_dirty(inode);
 }
 /**
  * drop_nlink - directly drop an inode's link count
  * @inode: inode
  *
  * This is a low-level filesystem helper to replace any
  * direct filesystem manipulation of i_nlink.  In cases
  * where we are attempting to track writes to the
  * filesystem, a decrement to zero means an imminent
  * write when the file is truncated and actually unlinked
  * on the filesystem.
  */
 static inline void drop_nlink(struct inode *inode)
 {
 	inode->i_nlink--;
 }
 /**
  * clear_nlink - directly zero an inode's link count
  * @inode: inode
  *
  * This is a low-level filesystem helper to replace any
  * direct filesystem manipulation of i_nlink.  See
  * drop_nlink() for why we care about i_nlink hitting zero.
  */
 static inline void clear_nlink(struct inode *inode)
 {
 	inode->i_nlink = 0;
 }
 static inline void inode_dec_link_count(struct inode *inode)
 {
 	drop_nlink(inode);
 	mark_inode_dirty(inode);
 }
 /**
  * inode_inc_iversion - increments i_version
  * @inode: inode that need to be updated
  *
  * Every time the inode is modified, the i_version field will be incremented.
  * The filesystem has to be mounted with i_version flag
  */
 static inline void inode_inc_iversion(struct inode *inode)
 {
        spin_lock(&inode->i_lock);
        inode->i_version++;
        spin_unlock(&inode->i_lock);
 }
 extern void touch_atime(struct vfsmount *mnt, struct dentry *dentry);
 static inline void file_accessed(struct file *file)
 {
 	if (!(file->f_flags & O_NOATIME))
 		touch_atime(file->f_path.mnt, file->f_path.dentry);
 }
 int sync_inode(struct inode *inode, struct writeback_control *wbc);
 struct file_system_type {
 	const char *name;
 	int fs_flags;
 	int (*get_sb) (struct file_system_type *, int,
 		       const char *, void *, struct vfsmount *);
 	void (*kill_sb) (struct super_block *);
 	struct module *owner;
 	struct file_system_type * next;
 	struct list_head fs_supers;
 	struct lock_class_key s_lock_key;
 	struct lock_class_key s_umount_key;
 	struct lock_class_key i_lock_key;
 	struct lock_class_key i_mutex_key;
 	struct lock_class_key i_mutex_dir_key;
 	struct lock_class_key i_alloc_sem_key;
 };
 extern int get_sb_bdev(struct file_system_type *fs_type,
 	int flags, const char *dev_name, void *data,
 	int (*fill_super)(struct super_block *, void *, int),
 	struct vfsmount *mnt);
 extern int get_sb_single(struct file_system_type *fs_type,
 	int flags, void *data,
 	int (*fill_super)(struct super_block *, void *, int),
 	struct vfsmount *mnt);
 extern int get_sb_nodev(struct file_system_type *fs_type,
 	int flags, void *data,
 	int (*fill_super)(struct super_block *, void *, int),
 	struct vfsmount *mnt);
 void generic_shutdown_super(struct super_block *sb);
 void kill_block_super(struct super_block *sb);
 void kill_anon_super(struct super_block *sb);
 void kill_litter_super(struct super_block *sb);
 void deactivate_super(struct super_block *sb);
 int set_anon_super(struct super_block *s, void *data);
 struct super_block *sget(struct file_system_type *type,
 			int (*test)(struct super_block *,void *),
 			int (*set)(struct super_block *,void *),
 			void *data);
 extern int get_sb_pseudo(struct file_system_type *, char *,
 	const struct super_operations *ops, unsigned long,
 	struct vfsmount *mnt);
 extern int simple_set_mnt(struct vfsmount *mnt, struct super_block *sb);
 int __put_super(struct super_block *sb);
 int __put_super_and_need_restart(struct super_block *sb);
 void unnamed_dev_init(void);
 /* Alas, no aliases. Too much hassle with bringing module.h everywhere */
 #define fops_get(fops) \
 	(((fops) && try_module_get((fops)->owner) ? (fops) : NULL))
 #define fops_put(fops) \
 	do { if (fops) module_put((fops)->owner); } while(0)
 extern int register_filesystem(struct file_system_type *);
 extern int unregister_filesystem(struct file_system_type *);
 extern struct vfsmount *kern_mount_data(struct file_system_type *, void *data);
 #define kern_mount(type) kern_mount_data(type, NULL)
 extern int may_umount_tree(struct vfsmount *);
 extern int may_umount(struct vfsmount *);
 extern void umount_tree(struct vfsmount *, int, struct list_head *);
 extern void release_mounts(struct list_head *);
 extern long do_mount(char *, char *, char *, unsigned long, void *);
 extern struct vfsmount *copy_tree(struct vfsmount *, struct dentry *, int);
 extern void mnt_set_mountpoint(struct vfsmount *, struct dentry *,
 				  struct vfsmount *);
 extern struct vfsmount *collect_mounts(struct vfsmount *, struct dentry *);
 extern void drop_collected_mounts(struct vfsmount *);
 extern int vfs_statfs(struct dentry *, struct kstatfs *);
 /* /sys/fs */
 extern struct kobject *fs_kobj;
 #define FLOCK_VERIFY_READ  1
 #define FLOCK_VERIFY_WRITE 2
 extern int locks_mandatory_locked(struct inode *);
 extern int locks_mandatory_area(int, struct inode *, struct file *, loff_t, size_t);
 /*
  * Candidates for mandatory locking have the setgid bit set
  * but no group execute bit -  an otherwise meaningless combination.
  */
 static inline int __mandatory_lock(struct inode *ino)
 {
 	return (ino->i_mode & (S_ISGID | S_IXGRP)) == S_ISGID;
 }
 /*
  * ... and these candidates should be on MS_MANDLOCK mounted fs,
  * otherwise these will be advisory locks
  */
 static inline int mandatory_lock(struct inode *ino)
 {
 	return IS_MANDLOCK(ino) && __mandatory_lock(ino);
 }
 static inline int locks_verify_locked(struct inode *inode)
 {
 	if (mandatory_lock(inode))
 		return locks_mandatory_locked(inode);
 	return 0;
 }
 extern int rw_verify_area(int, struct file *, loff_t *, size_t);
 static inline int locks_verify_truncate(struct inode *inode,
 				    struct file *filp,
 				    loff_t size)
 {
 	if (inode->i_flock && mandatory_lock(inode))
 		return locks_mandatory_area(
 			FLOCK_VERIFY_WRITE, inode, filp,
 			size < inode->i_size ? size : inode->i_size,
 			(size < inode->i_size ? inode->i_size - size
 			 : size - inode->i_size)
 		);
 	return 0;
 }
 static inline int break_lease(struct inode *inode, unsigned int mode)
 {
 	if (inode->i_flock)
 		return __break_lease(inode, mode);
 	return 0;
 }
 /* fs/open.c */
 extern int do_truncate(struct dentry *, loff_t start, unsigned int time_attrs,
 		       struct file *filp);
 extern long do_sys_open(int dfd, const char __user *filename, int flags,
 			int mode);
 extern struct file *filp_open(const char *, int, int);
 extern struct file * dentry_open(struct dentry *, struct vfsmount *, int);
 extern int filp_close(struct file *, fl_owner_t id);
 extern char * getname(const char __user *);
 /* fs/dcache.c */
 extern void __init vfs_caches_init_early(void);
 extern void __init vfs_caches_init(unsigned long);
 extern struct kmem_cache *names_cachep;
 #define __getname()	kmem_cache_alloc(names_cachep, GFP_KERNEL)
 #define __putname(name) kmem_cache_free(names_cachep, (void *)(name))
 #ifndef CONFIG_AUDITSYSCALL
 #define putname(name)   __putname(name)
 #else
 extern void putname(const char *name);
 #endif
 #ifdef CONFIG_BLOCK
 extern int register_blkdev(unsigned int, const char *);
 extern void unregister_blkdev(unsigned int, const char *);
 extern struct block_device *bdget(dev_t);
 extern void bd_set_size(struct block_device *, loff_t size);
 extern void bd_forget(struct inode *inode);
 extern void bdput(struct block_device *);
 extern struct block_device *open_by_devnum(dev_t, unsigned);
 extern const struct address_space_operations def_blk_aops;
 #else
 static inline void bd_forget(struct inode *inode) {}
 #endif
 extern const struct file_operations def_blk_fops;
 extern const struct file_operations def_chr_fops;
 extern const struct file_operations bad_sock_fops;
 extern const struct file_operations def_fifo_fops;
 #ifdef CONFIG_BLOCK
 extern int ioctl_by_bdev(struct block_device *, unsigned, unsigned long);
 extern int blkdev_ioctl(struct inode *, struct file *, unsigned, unsigned long);
 extern int blkdev_driver_ioctl(struct inode *inode, struct file *file,
 			       struct gendisk *disk, unsigned cmd,
 			       unsigned long arg);
 extern long compat_blkdev_ioctl(struct file *, unsigned, unsigned long);
 extern int blkdev_get(struct block_device *, mode_t, unsigned);
 extern int blkdev_put(struct block_device *);
 extern int bd_claim(struct block_device *, void *);
 extern void bd_release(struct block_device *);
 #ifdef CONFIG_SYSFS
 extern int bd_claim_by_disk(struct block_device *, void *, struct gendisk *);
 extern void bd_release_from_disk(struct block_device *, struct gendisk *);
 #else
 #define bd_claim_by_disk(bdev, holder, disk)	bd_claim(bdev, holder)
 #define bd_release_from_disk(bdev, disk)	bd_release(bdev)
 #endif
 #endif
 /* fs/char_dev.c */
 #define CHRDEV_MAJOR_HASH_SIZE	255
 extern int alloc_chrdev_region(dev_t *, unsigned, unsigned, const char *);
 extern int register_chrdev_region(dev_t, unsigned, const char *);
 extern int register_chrdev(unsigned int, const char *,
 			   const struct file_operations *);
 extern void unregister_chrdev(unsigned int, const char *);
 extern void unregister_chrdev_region(dev_t, unsigned);
 extern int chrdev_open(struct inode *, struct file *);
 extern void chrdev_show(struct seq_file *,off_t);
 /* fs/block_dev.c */
 #define BDEVNAME_SIZE	32	/* Largest string for a blockdev identifier */
 #ifdef CONFIG_BLOCK
 #define BLKDEV_MAJOR_HASH_SIZE	255
 extern const char *__bdevname(dev_t, char *buffer);
 extern const char *bdevname(struct block_device *bdev, char *buffer);
 extern struct block_device *lookup_bdev(const char *);
 extern struct block_device *open_bdev_excl(const char *, int, void *);
 extern void close_bdev_excl(struct block_device *);
 extern void blkdev_show(struct seq_file *,off_t);
 #else
 #define BLKDEV_MAJOR_HASH_SIZE	0
 #endif
 extern void init_special_inode(struct inode *, umode_t, dev_t);
 /* Invalid inode operations -- fs/bad_inode.c */
 extern void make_bad_inode(struct inode *);
 extern int is_bad_inode(struct inode *);
 extern const struct file_operations read_fifo_fops;
 extern const struct file_operations write_fifo_fops;
 extern const struct file_operations rdwr_fifo_fops;
 extern int fs_may_remount_ro(struct super_block *);
 #ifdef CONFIG_BLOCK
 /*
  * return READ, READA, or WRITE
  */
 #define bio_rw(bio)		((bio)->bi_rw & (RW_MASK | RWA_MASK))
 /*
  * return data direction, READ or WRITE
  */
 #define bio_data_dir(bio)	((bio)->bi_rw & 1)
 extern int check_disk_change(struct block_device *);
 extern int __invalidate_device(struct block_device *);
 extern int invalidate_partition(struct gendisk *, int);
 #endif
 extern int invalidate_inodes(struct super_block *);
 unsigned long __invalidate_mapping_pages(struct address_space *mapping,
 					pgoff_t start, pgoff_t end,
 					bool be_atomic);
 unsigned long invalidate_mapping_pages(struct address_space *mapping,
 					pgoff_t start, pgoff_t end);
 static inline unsigned long __deprecated
 invalidate_inode_pages(struct address_space *mapping)
 {
 	return invalidate_mapping_pages(mapping, 0, ~0UL);
 }
 static inline void invalidate_remote_inode(struct inode *inode)
 {
 	if (S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
 	    S_ISLNK(inode->i_mode))
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 }
 extern int invalidate_inode_pages2(struct address_space *mapping);
 extern int invalidate_inode_pages2_range(struct address_space *mapping,
 					 pgoff_t start, pgoff_t end);
 extern int write_inode_now(struct inode *, int);
 extern int filemap_fdatawrite(struct address_space *);
 extern int filemap_flush(struct address_space *);
 extern int filemap_fdatawait(struct address_space *);
 extern int filemap_write_and_wait(struct address_space *mapping);
 extern int filemap_write_and_wait_range(struct address_space *mapping,
 				        loff_t lstart, loff_t lend);
 extern int wait_on_page_writeback_range(struct address_space *mapping,
 				pgoff_t start, pgoff_t end);
 extern int __filemap_fdatawrite_range(struct address_space *mapping,
 				loff_t start, loff_t end, int sync_mode);
 extern long do_fsync(struct file *file, int datasync);
 extern void sync_supers(void);
 extern void sync_filesystems(int wait);
 extern void __fsync_super(struct super_block *sb);
 extern void emergency_sync(void);
 extern void emergency_remount(void);
 extern int do_remount_sb(struct super_block *sb, int flags,
 			 void *data, int force);
 #ifdef CONFIG_BLOCK
 extern sector_t bmap(struct inode *, sector_t);
 #endif
 extern int notify_change(struct dentry *, struct iattr *);
 extern int permission(struct inode *, int, struct nameidata *);
 extern int generic_permission(struct inode *, int,
 		int (*check_acl)(struct inode *, int));
 extern int get_write_access(struct inode *);
 extern int deny_write_access(struct file *);
 static inline void put_write_access(struct inode * inode)
 {
 	atomic_dec(&inode->i_writecount);
 }
 static inline void allow_write_access(struct file *file)
 {
 	if (file)
 		atomic_inc(&file->f_path.dentry->d_inode->i_writecount);
 }
 extern int do_pipe(int *);
 extern struct file *create_read_pipe(struct file *f);
 extern struct file *create_write_pipe(void);
 extern void free_write_pipe(struct file *);
 extern int open_namei(int dfd, const char *, int, int, struct nameidata *);
 extern int may_open(struct nameidata *, int, int);
 extern int kernel_read(struct file *, unsigned long, char *, unsigned long);
 extern struct file * open_exec(const char *);
 /* fs/dcache.c -- generic fs support functions */
 extern int is_subdir(struct dentry *, struct dentry *);
 extern ino_t find_inode_number(struct dentry *, struct qstr *);
 #include <linux/err.h>
 /* needed for stackable file system support */
 extern loff_t default_llseek(struct file *file, loff_t offset, int origin);
 extern loff_t vfs_llseek(struct file *file, loff_t offset, int origin);
 extern void inode_init_once(struct inode *);
 extern void iput(struct inode *);
 extern struct inode * igrab(struct inode *);
 extern ino_t iunique(struct super_block *, ino_t);
 extern int inode_needs_sync(struct inode *inode);
 extern void generic_delete_inode(struct inode *inode);
 extern void generic_drop_inode(struct inode *inode);
 extern struct inode *ilookup5_nowait(struct super_block *sb,
 		unsigned long hashval, int (*test)(struct inode *, void *),
 		void *data);
 extern struct inode *ilookup5(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data);
 extern struct inode *ilookup(struct super_block *sb, unsigned long ino);
 extern struct inode * iget5_locked(struct super_block *, unsigned long, int (*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *);
 extern struct inode * iget_locked(struct super_block *, unsigned long);
 extern void unlock_new_inode(struct inode *);
-static inline struct inode *iget(struct super_block *sb, unsigned long ino)
-{
-	struct inode *inode = iget_locked(sb, ino);
-	if (inode && (inode->i_state & I_NEW)) {
-		sb->s_op->read_inode(inode);
-		unlock_new_inode(inode);
-	}
-	return inode;
-}
 extern void __iget(struct inode * inode);
 extern void iget_failed(struct inode *);
 extern void clear_inode(struct inode *);
 extern void destroy_inode(struct inode *);
 extern struct inode *new_inode(struct super_block *);
 extern int __remove_suid(struct dentry *, int);
 extern int should_remove_suid(struct dentry *);
 extern int remove_suid(struct dentry *);
 extern void __insert_inode_hash(struct inode *, unsigned long hashval);
 extern void remove_inode_hash(struct inode *);
 static inline void insert_inode_hash(struct inode *inode) {
 	__insert_inode_hash(inode, inode->i_ino);
 }
 extern struct file * get_empty_filp(void);
 extern void file_move(struct file *f, struct list_head *list);
 extern void file_kill(struct file *f);
 #ifdef CONFIG_BLOCK
 struct bio;
 extern void submit_bio(int, struct bio *);
 extern int bdev_read_only(struct block_device *);
 #endif
 extern int set_blocksize(struct block_device *, int);
 extern int sb_set_blocksize(struct super_block *, int);
 extern int sb_min_blocksize(struct super_block *, int);
 extern int sb_has_dirty_inodes(struct super_block *);
 extern int generic_file_mmap(struct file *, struct vm_area_struct *);
 extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
 extern int file_read_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size);
 int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk);
 extern ssize_t generic_file_aio_read(struct kiocb *, const struct iovec *, unsigned long, loff_t);
 extern ssize_t generic_file_aio_write(struct kiocb *, const struct iovec *, unsigned long, loff_t);
 extern ssize_t generic_file_aio_write_nolock(struct kiocb *, const struct iovec *,
 		unsigned long, loff_t);
 extern ssize_t generic_file_direct_write(struct kiocb *, const struct iovec *,
 		unsigned long *, loff_t, loff_t *, size_t, size_t);
 extern ssize_t generic_file_buffered_write(struct kiocb *, const struct iovec *,
 		unsigned long, loff_t, loff_t *, size_t, ssize_t);
 extern ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos);
 extern ssize_t do_sync_write(struct file *filp, const char __user *buf, size_t len, loff_t *ppos);
 extern void do_generic_mapping_read(struct address_space *mapping,
 				    struct file_ra_state *, struct file *,
 				    loff_t *, read_descriptor_t *, read_actor_t);
 extern int generic_segment_checks(const struct iovec *iov,
 		unsigned long *nr_segs, size_t *count, int access_flags);
 /* fs/splice.c */
 extern ssize_t generic_file_splice_read(struct file *, loff_t *,
 		struct pipe_inode_info *, size_t, unsigned int);
 extern ssize_t generic_file_splice_write(struct pipe_inode_info *,
 		struct file *, loff_t *, size_t, unsigned int);
 extern ssize_t generic_file_splice_write_nolock(struct pipe_inode_info *,
 		struct file *, loff_t *, size_t, unsigned int);
 extern ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe,
 		struct file *out, loff_t *, size_t len, unsigned int flags);
 extern long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
 		size_t len, unsigned int flags);
 extern void
 file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping);
 extern loff_t no_llseek(struct file *file, loff_t offset, int origin);
 extern loff_t generic_file_llseek(struct file *file, loff_t offset, int origin);
 extern loff_t remote_llseek(struct file *file, loff_t offset, int origin);
 extern int generic_file_open(struct inode * inode, struct file * filp);
 extern int nonseekable_open(struct inode * inode, struct file * filp);
 #ifdef CONFIG_FS_XIP
 extern ssize_t xip_file_read(struct file *filp, char __user *buf, size_t len,
 			     loff_t *ppos);
 extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma);
 extern ssize_t xip_file_write(struct file *filp, const char __user *buf,
 			      size_t len, loff_t *ppos);
 extern int xip_truncate_page(struct address_space *mapping, loff_t from);
 #else
 static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
 {
 	return 0;
 }
 #endif
 static inline void do_generic_file_read(struct file * filp, loff_t *ppos,
 					read_descriptor_t * desc,
 					read_actor_t actor)
 {
 	do_generic_mapping_read(filp->f_mapping,
 				&filp->f_ra,
 				filp,
 				ppos,
 				desc,
 				actor);
 }
 #ifdef CONFIG_BLOCK
 ssize_t __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	struct block_device *bdev, const struct iovec *iov, loff_t offset,
 	unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io,
 	int lock_type);
 enum {
 	DIO_LOCKING = 1, /* need locking between buffered and direct access */
 	DIO_NO_LOCKING,  /* bdev; no locking at all between buffered/direct */
 	DIO_OWN_LOCKING, /* filesystem locks buffered and direct internally */
 };
 static inline ssize_t blockdev_direct_IO(int rw, struct kiocb *iocb,
 	struct inode *inode, struct block_device *bdev, const struct iovec *iov,
 	loff_t offset, unsigned long nr_segs, get_block_t get_block,
 	dio_iodone_t end_io)
 {
 	return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
 				nr_segs, get_block, end_io, DIO_LOCKING);
 }
 static inline ssize_t blockdev_direct_IO_no_locking(int rw, struct kiocb *iocb,
 	struct inode *inode, struct block_device *bdev, const struct iovec *iov,
 	loff_t offset, unsigned long nr_segs, get_block_t get_block,
 	dio_iodone_t end_io)
 {
 	return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
 				nr_segs, get_block, end_io, DIO_NO_LOCKING);
 }
 static inline ssize_t blockdev_direct_IO_own_locking(int rw, struct kiocb *iocb,
 	struct inode *inode, struct block_device *bdev, const struct iovec *iov,
 	loff_t offset, unsigned long nr_segs, get_block_t get_block,
 	dio_iodone_t end_io)
 {
 	return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
 				nr_segs, get_block, end_io, DIO_OWN_LOCKING);
 }
 #endif
 extern const struct file_operations generic_ro_fops;
 #define special_file(m) (S_ISCHR(m)||S_ISBLK(m)||S_ISFIFO(m)||S_ISSOCK(m))
 extern int vfs_readlink(struct dentry *, char __user *, int, const char *);
 extern int vfs_follow_link(struct nameidata *, const char *);
 extern int page_readlink(struct dentry *, char __user *, int);
 extern void *page_follow_link_light(struct dentry *, struct nameidata *);
 extern void page_put_link(struct dentry *, struct nameidata *, void *);
 extern int __page_symlink(struct inode *inode, const char *symname, int len,
 		gfp_t gfp_mask);
 extern int page_symlink(struct inode *inode, const char *symname, int len);
 extern const struct inode_operations page_symlink_inode_operations;
 extern int generic_readlink(struct dentry *, char __user *, int);
 extern void generic_fillattr(struct inode *, struct kstat *);
 extern int vfs_getattr(struct vfsmount *, struct dentry *, struct kstat *);
 void inode_add_bytes(struct inode *inode, loff_t bytes);
 void inode_sub_bytes(struct inode *inode, loff_t bytes);
 loff_t inode_get_bytes(struct inode *inode);
 void inode_set_bytes(struct inode *inode, loff_t bytes);
 extern int vfs_readdir(struct file *, filldir_t, void *);
 extern int vfs_stat(char __user *, struct kstat *);
 extern int vfs_lstat(char __user *, struct kstat *);
 extern int vfs_stat_fd(int dfd, char __user *, struct kstat *);
 extern int vfs_lstat_fd(int dfd, char __user *, struct kstat *);
 extern int vfs_fstat(unsigned int, struct kstat *);
 extern long vfs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg);
 extern int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
 		    unsigned long arg);
 extern void get_filesystem(struct file_system_type *fs);
 extern void put_filesystem(struct file_system_type *fs);
 extern struct file_system_type *get_fs_type(const char *name);
 extern struct super_block *get_super(struct block_device *);
 extern struct super_block *user_get_super(dev_t);
 extern void drop_super(struct super_block *sb);
 extern int dcache_dir_open(struct inode *, struct file *);
 extern int dcache_dir_close(struct inode *, struct file *);
 extern loff_t dcache_dir_lseek(struct file *, loff_t, int);
 extern int dcache_readdir(struct file *, void *, filldir_t);
 extern int simple_getattr(struct vfsmount *, struct dentry *, struct kstat *);
 extern int simple_statfs(struct dentry *, struct kstatfs *);
 extern int simple_link(struct dentry *, struct inode *, struct dentry *);
 extern int simple_unlink(struct inode *, struct dentry *);
 extern int simple_rmdir(struct inode *, struct dentry *);
 extern int simple_rename(struct inode *, struct dentry *, struct inode *, struct dentry *);
 extern int simple_sync_file(struct file *, struct dentry *, int);
 extern int simple_empty(struct dentry *);
 extern int simple_readpage(struct file *file, struct page *page);
 extern int simple_prepare_write(struct file *file, struct page *page,
 			unsigned offset, unsigned to);
 extern int simple_write_begin(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned flags,
 			struct page **pagep, void **fsdata);
 extern int simple_write_end(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned copied,
 			struct page *page, void *fsdata);
 extern struct dentry *simple_lookup(struct inode *, struct dentry *, struct nameidata *);
 extern ssize_t generic_read_dir(struct file *, char __user *, size_t, loff_t *);
 extern const struct file_operations simple_dir_operations;
 extern const struct inode_operations simple_dir_inode_operations;
 struct tree_descr { char *name; const struct file_operations *ops; int mode; };
 struct dentry *d_alloc_name(struct dentry *, const char *);
 extern int simple_fill_super(struct super_block *, int, struct tree_descr *);
 extern int simple_pin_fs(struct file_system_type *, struct vfsmount **mount, int *count);
 extern void simple_release_fs(struct vfsmount **mount, int *count);
 extern ssize_t simple_read_from_buffer(void __user *, size_t, loff_t *, const void *, size_t);
 #ifdef CONFIG_MIGRATION
 extern int buffer_migrate_page(struct address_space *,
 				struct page *, struct page *);
 #else
 #define buffer_migrate_page NULL
 #endif
 extern int inode_change_ok(struct inode *, struct iattr *);
 extern int __must_check inode_setattr(struct inode *, struct iattr *);
 extern void file_update_time(struct file *file);
 static inline ino_t parent_ino(struct dentry *dentry)
 {
 	ino_t res;
 	spin_lock(&dentry->d_lock);
 	res = dentry->d_parent->d_inode->i_ino;
 	spin_unlock(&dentry->d_lock);
 	return res;
 }
 /* kernel/fork.c */
 extern int unshare_files(void);
 /* Transaction based IO helpers */
 /*
  * An argresp is stored in an allocated page and holds the
  * size of the argument or response, along with its content
  */
 struct simple_transaction_argresp {
 	ssize_t size;
 	char data[0];
 };
 #define SIMPLE_TRANSACTION_LIMIT (PAGE_SIZE - sizeof(struct simple_transaction_argresp))
 char *simple_transaction_get(struct file *file, const char __user *buf,
 				size_t size);
 ssize_t simple_transaction_read(struct file *file, char __user *buf,
 				size_t size, loff_t *pos);
 int simple_transaction_release(struct inode *inode, struct file *file);
 static inline void simple_transaction_set(struct file *file, size_t n)
 {
 	struct simple_transaction_argresp *ar = file->private_data;
 	BUG_ON(n > SIMPLE_TRANSACTION_LIMIT);
 	/*
 	 * The barrier ensures that ar->size will really remain zero until
 	 * ar->data is ready for reading.
 	 */
 	smp_mb();
 	ar->size = n;
 }
 /*
  * simple attribute files
  *
  * These attributes behave similar to those in sysfs:
  *
  * Writing to an attribute immediately sets a value, an open file can be
  * written to multiple times.
  *
  * Reading from an attribute creates a buffer from the value that might get
  * read with multiple read calls. When the attribute has been read
  * completely, no further read calls are possible until the file is opened
  * again.
  *
  * All attributes contain a text representation of a numeric value
  * that are accessed with the get() and set() functions.
  */
 #define DEFINE_SIMPLE_ATTRIBUTE(__fops, __get, __set, __fmt)		\
 static int __fops ## _open(struct inode *inode, struct file *file)	\
 {									\
 	__simple_attr_check_format(__fmt, 0ull);			\
 	return simple_attr_open(inode, file, __get, __set, __fmt);	\
 }									\
 static struct file_operations __fops = {				\
 	.owner	 = THIS_MODULE,						\
 	.open	 = __fops ## _open,					\
 	.release = simple_attr_close,					\
 	.read	 = simple_attr_read,					\
 	.write	 = simple_attr_write,					\
 };
 static inline void __attribute__((format(printf, 1, 2)))
 __simple_attr_check_format(const char *fmt, ...)
 {
 	/* don't do anything, just let the compiler check the arguments; */
 }
 int simple_attr_open(struct inode *inode, struct file *file,
 		     u64 (*get)(void *), void (*set)(void *, u64),
 		     const char *fmt);
 int simple_attr_close(struct inode *inode, struct file *file);
 ssize_t simple_attr_read(struct file *file, char __user *buf,
 			 size_t len, loff_t *ppos);
 ssize_t simple_attr_write(struct file *file, const char __user *buf,
 			  size_t len, loff_t *ppos);
 #ifdef CONFIG_SECURITY
 static inline char *alloc_secdata(void)
 {
 	return (char *)get_zeroed_page(GFP_KERNEL);
 }
 static inline void free_secdata(void *secdata)
 {
 	free_page((unsigned long)secdata);
 }
 #else
 static inline char *alloc_secdata(void)
 {
 	return (char *)1;
 }
 static inline void free_secdata(void *secdata)
 { }
 #endif	/* CONFIG_SECURITY */
 struct ctl_table;
 int proc_nr_files(struct ctl_table *table, int write, struct file *filp,
 		  void __user *buffer, size_t *lenp, loff_t *ppos);
 int get_filesystem_list(char * buf);
 #endif /* __KERNEL__ */
 #endif /* _LINUX_FS_H */

1	Changes since 2.5.0:	1	Changes since 2.5.0:
2		2
3	---	3	---
4	[recommended]	4	[recommended]
5		5
6	New helpers: sb_bread(), sb_getblk(), sb_find_get_block(), set_bh(),	6	New helpers: sb_bread(), sb_getblk(), sb_find_get_block(), set_bh(),
7	sb_set_blocksize() and sb_min_blocksize().	7	sb_set_blocksize() and sb_min_blocksize().
8		8
9	Use them.	9	Use them.
10		10
11	(sb_find_get_block() replaces 2.4's get_hash_table())	11	(sb_find_get_block() replaces 2.4's get_hash_table())
12		12
13	---	13	---
14	[recommended]	14	[recommended]
15		15
16	New methods: ->alloc_inode() and ->destroy_inode().	16	New methods: ->alloc_inode() and ->destroy_inode().
17		17
18	Remove inode->u.foo_inode_i	18	Remove inode->u.foo_inode_i
19	Declare	19	Declare
20	struct foo_inode_info {	20	struct foo_inode_info {
21	/* fs-private stuff */	21	/* fs-private stuff */
22	struct inode vfs_inode;	22	struct inode vfs_inode;
23	};	23	};
24	static inline struct foo_inode_info FOO_I(struct inode inode)	24	static inline struct foo_inode_info FOO_I(struct inode inode)
25	{	25	{
26	return list_entry(inode, struct foo_inode_info, vfs_inode);	26	return list_entry(inode, struct foo_inode_info, vfs_inode);
27	}	27	}
28		28
29	Use FOO_I(inode) instead of &inode->u.foo_inode_i;	29	Use FOO_I(inode) instead of &inode->u.foo_inode_i;
30		30
31	Add foo_alloc_inode() and foo_destroy_inode() - the former should allocate	31	Add foo_alloc_inode() and foo_destroy_inode() - the former should allocate
32	foo_inode_info and return the address of ->vfs_inode, the latter should free	32	foo_inode_info and return the address of ->vfs_inode, the latter should free
33	FOO_I(inode) (see in-tree filesystems for examples).	33	FOO_I(inode) (see in-tree filesystems for examples).
34		34
35	Make them ->alloc_inode and ->destroy_inode in your super_operations.	35	Make them ->alloc_inode and ->destroy_inode in your super_operations.
36		36
37	Keep in mind that now you need explicit initialization of private data -	37	Keep in mind that now you need explicit initialization of private data
38	typically in ->read_inode() and after getting an inode from new_inode().	38	typically between calling iget_locked() and unlocking the inode.
39		39
40	At some point that will become mandatory.	40	At some point that will become mandatory.
41		41
42	---	42	---
43	[mandatory]	43	[mandatory]
44		44
45	Change of file_system_type method (->read_super to ->get_sb)	45	Change of file_system_type method (->read_super to ->get_sb)
46		46
47	->read_super() is no more. Ditto for DECLARE_FSTYPE and DECLARE_FSTYPE_DEV.	47	->read_super() is no more. Ditto for DECLARE_FSTYPE and DECLARE_FSTYPE_DEV.
48		48
49	Turn your foo_read_super() into a function that would return 0 in case of	49	Turn your foo_read_super() into a function that would return 0 in case of
50	success and negative number in case of error (-EINVAL unless you have more	50	success and negative number in case of error (-EINVAL unless you have more
51	informative error value to report). Call it foo_fill_super(). Now declare	51	informative error value to report). Call it foo_fill_super(). Now declare
52		52
53	int foo_get_sb(struct file_system_type *fs_type,	53	int foo_get_sb(struct file_system_type *fs_type,
54	int flags, const char dev_name, void data, struct vfsmount *mnt)	54	int flags, const char dev_name, void data, struct vfsmount *mnt)
55	{	55	{
56	return get_sb_bdev(fs_type, flags, dev_name, data, foo_fill_super,	56	return get_sb_bdev(fs_type, flags, dev_name, data, foo_fill_super,
57	mnt);	57	mnt);
58	}	58	}
59		59
60	(or similar with s/bdev/nodev/ or s/bdev/single/, depending on the kind of	60	(or similar with s/bdev/nodev/ or s/bdev/single/, depending on the kind of
61	filesystem).	61	filesystem).
62		62
63	Replace DECLARE_FSTYPE... with explicit initializer and have ->get_sb set as	63	Replace DECLARE_FSTYPE... with explicit initializer and have ->get_sb set as
64	foo_get_sb.	64	foo_get_sb.
65		65
66	---	66	---
67	[mandatory]	67	[mandatory]
68		68
69	Locking change: ->s_vfs_rename_sem is taken only by cross-directory renames.	69	Locking change: ->s_vfs_rename_sem is taken only by cross-directory renames.
70	Most likely there is no need to change anything, but if you relied on	70	Most likely there is no need to change anything, but if you relied on
71	global exclusion between renames for some internal purpose - you need to	71	global exclusion between renames for some internal purpose - you need to
72	change your internal locking. Otherwise exclusion warranties remain the	72	change your internal locking. Otherwise exclusion warranties remain the
73	same (i.e. parents and victim are locked, etc.).	73	same (i.e. parents and victim are locked, etc.).
74		74
75	---	75	---
76	[informational]	76	[informational]
77		77
78	Now we have the exclusion between ->lookup() and directory removal (by	78	Now we have the exclusion between ->lookup() and directory removal (by
79	->rmdir() and ->rename()). If you used to need that exclusion and do	79	->rmdir() and ->rename()). If you used to need that exclusion and do
80	it by internal locking (most of filesystems couldn't care less) - you	80	it by internal locking (most of filesystems couldn't care less) - you
81	can relax your locking.	81	can relax your locking.
82		82
83	---	83	---
84	[mandatory]	84	[mandatory]
85		85
86	->lookup(), ->truncate(), ->create(), ->unlink(), ->mknod(), ->mkdir(),	86	->lookup(), ->truncate(), ->create(), ->unlink(), ->mknod(), ->mkdir(),
87	->rmdir(), ->link(), ->lseek(), ->symlink(), ->rename()	87	->rmdir(), ->link(), ->lseek(), ->symlink(), ->rename()
88	and ->readdir() are called without BKL now. Grab it on entry, drop upon return	88	and ->readdir() are called without BKL now. Grab it on entry, drop upon return
89	- that will guarantee the same locking you used to have. If your method or its	89	- that will guarantee the same locking you used to have. If your method or its
90	parts do not need BKL - better yet, now you can shift lock_kernel() and	90	parts do not need BKL - better yet, now you can shift lock_kernel() and
91	unlock_kernel() so that they would protect exactly what needs to be	91	unlock_kernel() so that they would protect exactly what needs to be
92	protected.	92	protected.
93		93
94	---	94	---
95	[mandatory]	95	[mandatory]
96		96
97	BKL is also moved from around sb operations. ->write_super() Is now called	97	BKL is also moved from around sb operations. ->write_super() Is now called
98	without BKL held. BKL should have been shifted into individual fs sb_op	98	without BKL held. BKL should have been shifted into individual fs sb_op
99	functions. If you don't need it, remove it.	99	functions. If you don't need it, remove it.
100		100
101	---	101	---
102	[informational]	102	[informational]
103		103
104	check for ->link() target not being a directory is done by callers. Feel	104	check for ->link() target not being a directory is done by callers. Feel
105	free to drop it...	105	free to drop it...
106		106
107	---	107	---
108	[informational]	108	[informational]
109		109
110	->link() callers hold ->i_mutex on the object we are linking to. Some of your	110	->link() callers hold ->i_mutex on the object we are linking to. Some of your
111	problems might be over...	111	problems might be over...
112		112
113	---	113	---
114	[mandatory]	114	[mandatory]
115		115
116	new file_system_type method - kill_sb(superblock). If you are converting	116	new file_system_type method - kill_sb(superblock). If you are converting
117	an existing filesystem, set it according to ->fs_flags:	117	an existing filesystem, set it according to ->fs_flags:
118	FS_REQUIRES_DEV - kill_block_super	118	FS_REQUIRES_DEV - kill_block_super
119	FS_LITTER - kill_litter_super	119	FS_LITTER - kill_litter_super
120	neither - kill_anon_super	120	neither - kill_anon_super
121	FS_LITTER is gone - just remove it from fs_flags.	121	FS_LITTER is gone - just remove it from fs_flags.
122		122
123	---	123	---
124	[mandatory]	124	[mandatory]
125		125
126	FS_SINGLE is gone (actually, that had happened back when ->get_sb()	126	FS_SINGLE is gone (actually, that had happened back when ->get_sb()
127	went in - and hadn't been documented ;-/). Just remove it from fs_flags	127	went in - and hadn't been documented ;-/). Just remove it from fs_flags
128	(and see ->get_sb() entry for other actions).	128	(and see ->get_sb() entry for other actions).
129		129
130	---	130	---
131	[mandatory]	131	[mandatory]
132		132
133	->setattr() is called without BKL now. Caller _always_ holds ->i_mutex, so	133	->setattr() is called without BKL now. Caller _always_ holds ->i_mutex, so
134	watch for ->i_mutex-grabbing code that might be used by your ->setattr().	134	watch for ->i_mutex-grabbing code that might be used by your ->setattr().
135	Callers of notify_change() need ->i_mutex now.	135	Callers of notify_change() need ->i_mutex now.
136		136
137	---	137	---
138	[recommended]	138	[recommended]
139		139
140	New super_block field "struct export_operations *s_export_op" for	140	New super_block field "struct export_operations *s_export_op" for
141	explicit support for exporting, e.g. via NFS. The structure is fully	141	explicit support for exporting, e.g. via NFS. The structure is fully
142	documented at its declaration in include/linux/fs.h, and in	142	documented at its declaration in include/linux/fs.h, and in
143	Documentation/filesystems/Exporting.	143	Documentation/filesystems/Exporting.
144		144
145	Briefly it allows for the definition of decode_fh and encode_fh operations	145	Briefly it allows for the definition of decode_fh and encode_fh operations
146	to encode and decode filehandles, and allows the filesystem to use	146	to encode and decode filehandles, and allows the filesystem to use
147	a standard helper function for decode_fh, and provide file-system specific	147	a standard helper function for decode_fh, and provide file-system specific
148	support for this helper, particularly get_parent.	148	support for this helper, particularly get_parent.
149		149
150	It is planned that this will be required for exporting once the code	150	It is planned that this will be required for exporting once the code
151	settles down a bit.	151	settles down a bit.
152		152
153	[mandatory]	153	[mandatory]
154		154
155	s_export_op is now required for exporting a filesystem.	155	s_export_op is now required for exporting a filesystem.
156	isofs, ext2, ext3, resierfs, fat	156	isofs, ext2, ext3, resierfs, fat
157	can be used as examples of very different filesystems.	157	can be used as examples of very different filesystems.
158		158
159	---	159	---
160	[mandatory]	160	[mandatory]
161		161
162	iget4() and the read_inode2 callback have been superseded by iget5_locked()	162	iget4() and the read_inode2 callback have been superseded by iget5_locked()
163	which has the following prototype,	163	which has the following prototype,
164		164
165	struct inode iget5_locked(struct super_block sb, unsigned long ino,	165	struct inode iget5_locked(struct super_block sb, unsigned long ino,
166	int (test)(struct inode , void *),	166	int (test)(struct inode , void *),
167	int (set)(struct inode , void *),	167	int (set)(struct inode , void *),
168	void *data);	168	void *data);
169		169
170	'test' is an additional function that can be used when the inode	170	'test' is an additional function that can be used when the inode
171	number is not sufficient to identify the actual file object. 'set'	171	number is not sufficient to identify the actual file object. 'set'
172	should be a non-blocking function that initializes those parts of a	172	should be a non-blocking function that initializes those parts of a
173	newly created inode to allow the test function to succeed. 'data' is	173	newly created inode to allow the test function to succeed. 'data' is
174	passed as an opaque value to both test and set functions.	174	passed as an opaque value to both test and set functions.
175		175
176	When the inode has been created by iget5_locked(), it will be returned with	176	When the inode has been created by iget5_locked(), it will be returned with the
177	the I_NEW flag set and will still be locked. read_inode has not been	177	I_NEW flag set and will still be locked. The filesystem then needs to finalize
178	called so the file system still has to finalize the initialization. Once	178	the initialization. Once the inode is initialized it must be unlocked by
179	the inode is initialized it must be unlocked by calling unlock_new_inode().	179	calling unlock_new_inode().
180		180
181	The filesystem is responsible for setting (and possibly testing) i_ino	181	The filesystem is responsible for setting (and possibly testing) i_ino
182	when appropriate. There is also a simpler iget_locked function that	182	when appropriate. There is also a simpler iget_locked function that
183	just takes the superblock and inode number as arguments and does the	183	just takes the superblock and inode number as arguments and does the
184	test and set for you.	184	test and set for you.
185		185
186	e.g.	186	e.g.
187	inode = iget_locked(sb, ino);	187	inode = iget_locked(sb, ino);
188	if (inode->i_state & I_NEW) {	188	if (inode->i_state & I_NEW) {
189	err = read_inode_from_disk(inode);	189	err = read_inode_from_disk(inode);
190	if (err < 0) {	190	if (err < 0) {
191	iget_failed(inode);	191	iget_failed(inode);
192	return err;	192	return err;
193	}	193	}
194	unlock_new_inode(inode);	194	unlock_new_inode(inode);
195	}	195	}
196		196
197	Note that if the process of setting up a new inode fails, then iget_failed()	197	Note that if the process of setting up a new inode fails, then iget_failed()
198	should be called on the inode to render it dead, and an appropriate error	198	should be called on the inode to render it dead, and an appropriate error
199	should be passed back to the caller.	199	should be passed back to the caller.
200		200
201	---	201	---
202	[recommended]	202	[recommended]
203		203
204	->getattr() finally getting used. See instances in nfs, minix, etc.	204	->getattr() finally getting used. See instances in nfs, minix, etc.
205		205
206	---	206	---
207	[mandatory]	207	[mandatory]
208		208
209	->revalidate() is gone. If your filesystem had it - provide ->getattr()	209	->revalidate() is gone. If your filesystem had it - provide ->getattr()
210	and let it call whatever you had as ->revlidate() + (for symlinks that	210	and let it call whatever you had as ->revlidate() + (for symlinks that
211	had ->revalidate()) add calls in ->follow_link()/->readlink().	211	had ->revalidate()) add calls in ->follow_link()/->readlink().
212		212
213	---	213	---
214	[mandatory]	214	[mandatory]
215		215
216	->d_parent changes are not protected by BKL anymore. Read access is safe	216	->d_parent changes are not protected by BKL anymore. Read access is safe
217	if at least one of the following is true:	217	if at least one of the following is true:
218	* filesystem has no cross-directory rename()	218	* filesystem has no cross-directory rename()
219	* dcache_lock is held	219	* dcache_lock is held
220	* we know that parent had been locked (e.g. we are looking at	220	* we know that parent had been locked (e.g. we are looking at
221	->d_parent of ->lookup() argument).	221	->d_parent of ->lookup() argument).
222	* we are called from ->rename().	222	* we are called from ->rename().
223	* the child's ->d_lock is held	223	* the child's ->d_lock is held
224	Audit your code and add locking if needed. Notice that any place that is	224	Audit your code and add locking if needed. Notice that any place that is
225	not protected by the conditions above is risky even in the old tree - you	225	not protected by the conditions above is risky even in the old tree - you
226	had been relying on BKL and that's prone to screwups. Old tree had quite	226	had been relying on BKL and that's prone to screwups. Old tree had quite
227	a few holes of that kind - unprotected access to ->d_parent leading to	227	a few holes of that kind - unprotected access to ->d_parent leading to
228	anything from oops to silent memory corruption.	228	anything from oops to silent memory corruption.
229		229
230	---	230	---
231	[mandatory]	231	[mandatory]
232		232
233	FS_NOMOUNT is gone. If you use it - just set MS_NOUSER in flags	233	FS_NOMOUNT is gone. If you use it - just set MS_NOUSER in flags
234	(see rootfs for one kind of solution and bdev/socket/pipe for another).	234	(see rootfs for one kind of solution and bdev/socket/pipe for another).
235		235
236	---	236	---
237	[recommended]	237	[recommended]
238		238
239	Use bdev_read_only(bdev) instead of is_read_only(kdev). The latter	239	Use bdev_read_only(bdev) instead of is_read_only(kdev). The latter
240	is still alive, but only because of the mess in drivers/s390/block/dasd.c.	240	is still alive, but only because of the mess in drivers/s390/block/dasd.c.
241	As soon as it gets fixed is_read_only() will die.	241	As soon as it gets fixed is_read_only() will die.
242		242
243	---	243	---
244	[mandatory]	244	[mandatory]
245		245
246	->permission() is called without BKL now. Grab it on entry, drop upon	246	->permission() is called without BKL now. Grab it on entry, drop upon
247	return - that will guarantee the same locking you used to have. If	247	return - that will guarantee the same locking you used to have. If
248	your method or its parts do not need BKL - better yet, now you can	248	your method or its parts do not need BKL - better yet, now you can
249	shift lock_kernel() and unlock_kernel() so that they would protect	249	shift lock_kernel() and unlock_kernel() so that they would protect
250	exactly what needs to be protected.	250	exactly what needs to be protected.
251		251
252	---	252	---
253	[mandatory]	253	[mandatory]
254		254
255	->statfs() is now called without BKL held. BKL should have been	255	->statfs() is now called without BKL held. BKL should have been
256	shifted into individual fs sb_op functions where it's not clear that	256	shifted into individual fs sb_op functions where it's not clear that
257	it's safe to remove it. If you don't need it, remove it.	257	it's safe to remove it. If you don't need it, remove it.
258		258
259	---	259	---
260	[mandatory]	260	[mandatory]
261		261
262	is_read_only() is gone; use bdev_read_only() instead.	262	is_read_only() is gone; use bdev_read_only() instead.
263		263
264	---	264	---
265	[mandatory]	265	[mandatory]
266		266
267	destroy_buffers() is gone; use invalidate_bdev().	267	destroy_buffers() is gone; use invalidate_bdev().
268		268
269	---	269	---
270	[mandatory]	270	[mandatory]
271		271
272	fsync_dev() is gone; use fsync_bdev(). NOTE: lvm breakage is	272	fsync_dev() is gone; use fsync_bdev(). NOTE: lvm breakage is
273	deliberate; as soon as struct block_device * is propagated in a reasonable	273	deliberate; as soon as struct block_device * is propagated in a reasonable
274	way by that code fixing will become trivial; until then nothing can be	274	way by that code fixing will become trivial; until then nothing can be
275	done.	275	done.
276		276