26 Oct, 2010

40 commits

  • Instead of always assigning an increasing inode number in new_inode
    move the call to assign it into those callers that actually need it.
    For now callers that need it is estimated conservatively, that is
    the call is added to all filesystems that do not assign an i_ino
    by themselves. For a few more filesystems we can avoid assigning
    any inode number given that they aren't user visible, and for others
    it could be done lazily when an inode number is actually needed,
    but that's left for later patches.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • new_inode() dirties a contended cache line to get increasing
    inode numbers. This limits performance on workloads that cause
    significant parallel inode allocation.

    Solve this problem by using a per_cpu variable fed by the shared
    last_ino in batches of 1024 allocations. This reduces contention on
    the shared last_ino, and give same spreading ino numbers than before
    (i.e. same wraparound after 2^32 allocations).

    Signed-off-by: Eric Dumazet
    Signed-off-by: Nick Piggin
    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Eric Dumazet
     
  • Clones an existing reference to inode; caller must already hold one.

    Signed-off-by: Al Viro

    Al Viro
     
  • Split up inode_add_to_list/__inode_add_to_list. Locking for the two
    lists will be split soon so these helpers really don't buy us much
    anymore.

    The __ prefixes for the sb list helpers will go away soon, but until
    inode_lock is gone we'll need them to distinguish between the locked
    and unlocked variants.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Now that iunique is not abusing find_inode anymore we can move the i_ref
    increment back to where it belongs.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Stop abusing find_inode_fast for iunique and opencode the inode hash walk.
    Introduce a new iunique_lock to protect the iunique counters once inode_lock
    is removed.

    Based on a patch originally from Nick Piggin.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Before replacing the inode hash locking with a more scalable
    mechanism, factor the removal of the inode from the hashes rather
    than open coding it in several places.

    Based on a patch originally from Nick Piggin.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Convert the inode LRU to use lazy updates to reduce lock and
    cacheline traffic. We avoid moving inodes around in the LRU list
    during iget/iput operations so these frequent operations don't need
    to access the LRUs. Instead, we defer the refcount checks to
    reclaim-time and use a per-inode state flag, I_REFERENCED, to tell
    reclaim that iget has touched the inode in the past. This means that
    only reclaim should be touching the LRU with any frequency, hence
    significantly reducing lock acquisitions and the amount contention
    on LRU updates.

    This also removes the inode_in_use list, which means we now only
    have one list for tracking the inode LRU status. This makes it much
    simpler to split out the LRU list operations under it's own lock.

    Signed-off-by: Nick Piggin
    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Nick Piggin
     
  • The number of inodes allocated does not need to be tied to the
    addition or removal of an inode to/from a list. If we are not tied
    to a list lock, we could update the counters when inodes are
    initialised or destroyed, but to do that we need to convert the
    counters to be per-cpu (i.e. independent of a lock). This means that
    we have the freedom to change the list/locking implementation
    without needing to care about the counters.

    Based on a patch originally from Eric Dumazet.

    [AV: cleaned up a bit, fixed build breakage on weird configs

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Dave Chinner
     
  • If clone_mnt() happens while mnt_make_readonly() is running, the
    cloned mount might have MNT_WRITE_HOLD flag set, which results in
    mnt_want_write() spinning forever on this mount.

    Needs CAP_SYS_ADMIN to trigger deliberately and unlikely to happen
    accidentally. But if it does happen it can hang the machine.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Make node look as if it was on hlist, with hlist_del()
    working correctly. Usable without any locking...

    Convert a couple of places where we want to do that to
    inode->i_hash.

    Signed-off-by: Al Viro

    Al Viro
     
  • note: for race-free uses you inode_lock held

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • We are in fill_super(); again, no inodes with zero i_count could
    be around until we set MS_ACTIVE.

    Signed-off-by: Al Viro

    Al Viro
     
  • In fill_super() we hadn't MS_ACTIVE set yet, so there won't
    be any inodes with zero i_count sitting around.

    In put_super() we already have MS_ACTIVE removed *and* we
    had called invalidate_inodes() since then. So again there
    won't be any inodes with zero i_count...

    Signed-off-by: Al Viro

    Al Viro
     
  • It's pointless - we *do* have busy inodes (root directory,
    for one), so that call will fail and attempt to change
    XIP flag will be ignored.

    Signed-off-by: Al Viro

    Al Viro
     
  • If we have the appropriate page already, call __block_write_begin()
    directly instead of releasing and regrabbing it inside of
    block_write_begin().

    Signed-off-by: Namhyung Kim
    Signed-off-by: Al Viro

    Namhyung Kim
     
  • Since inode->i_mode shares its bits for S_IFMT, S_ISDIR should be
    used to distinguish whether it is a dir or not.

    Signed-off-by: Namhyung Kim
    Signed-off-by: Al Viro

    Namhyung Kim
     
  • The aio batching code is using igrab to get an extra reference on the
    inode so it can safely batch. igrab will go ahead and take the global
    inode spinlock, which can be a bottleneck on large machines doing lots
    of AIO.

    In this case, igrab isn't required because we already have a reference
    on the file handle. It is safe to just bump the i_count directly
    on the inode.

    Benchmarking shows this patch brings IOP/s on tons of flash up by about
    2.5X.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Updated Documentation/filesystems/Locking to match the code.

    Signed-off-by: Christoph Hellwig

    Christoph Hellwig
     
  • bh->b_private is initialized within init_buffer(), thus the
    assignment should be redundant. Remove it.

    Signed-off-by: Namhyung Kim
    Signed-off-by: Al Viro

    Namhyung Kim
     
  • Move the EXPORTFS kconfig symbol out of the NETWORK_FILESYSTEMS block
    since it provides a library function that can be (and is) used by other
    (non-network) filesystems.

    This also eliminates a kconfig dependency warning:

    warning: (XFS_FS && BLOCK || NFSD && NETWORK_FILESYSTEMS && INET && FILE_LOCKING && BKL) selects EXPORTFS which has unmet direct dependencies (NETWORK_FILESYSTEMS)

    Signed-off-by: Randy Dunlap
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Cc: Alex Elder
    Cc: xfs-masters@oss.sgi.com
    Signed-off-by: Al Viro

    Randy Dunlap
     
  • Use sync_dirty_buffer instead of the incorrect opencoding it.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Now, rw_verify_area() checsk f_pos is negative or not. And if negative,
    returns -EINVAL.

    But, some special files as /dev/(k)mem and /proc//mem etc.. has
    negative offsets. And we can't do any access via read/write to the
    file(device).

    So introduce FMODE_UNSIGNED_OFFSET to allow negative file offsets.

    Signed-off-by: Wu Fengguang
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Al Viro
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    KAMEZAWA Hiroyuki
     
  • 365b1818 ("add f_flags to struct statfs(64)") resized f_spare within
    struct statfs which caused a UML crash. There is no need to copy f_spare.

    Signed-off-by: Richard Weinberger
    Reported-by: Toralf Förster
    Tested-by: Toralf Förster
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Jeff Dike
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Richard Weinberger
     
  • Documentation: Fix trivial typo in filesystems/sharedsubtree.txt

    This typo is easy to ignore unless you have spent a great deal of time
    thinking about how to eliminate duplicate dentries in unions.

    Signed-off-by: Valerie Aurora
    Signed-off-by: Al Viro

    Valerie Aurora
     
  • The intent was to verify that bh = affs_bread_ino(...) returned a valid
    pointer. We checked "ext_bh" earlier in the function and it's valid
    here.

    Signed-off-by: Dan Carpenter
    Signed-off-by: Al Viro

    Dan Carpenter
     
  • Andrew,

    Could you please review this patch, you probably are the right guy to
    take it, because it crosses fs and net trees.

    Note : /proc/sys/fs/file-nr is a read-only file, so this patch doesnt
    depend on previous patch (sysctl: fix min/max handling in
    __do_proc_doulongvec_minmax())

    Thanks !

    [PATCH V4] fs: allow for more than 2^31 files

    Robin Holt tried to boot a 16TB system and found af_unix was overflowing
    a 32bit value :

    We were seeing a failure which prevented boot. The kernel was incapable
    of creating either a named pipe or unix domain socket. This comes down
    to a common kernel function called unix_create1() which does:

    atomic_inc(&unix_nr_socks);
    if (atomic_read(&unix_nr_socks) > 2 * get_max_files())
    goto out;

    The function get_max_files() is a simple return of files_stat.max_files.
    files_stat.max_files is a signed integer and is computed in
    fs/file_table.c's files_init().

    n = (mempages * (PAGE_SIZE / 1024)) / 10;
    files_stat.max_files = n;

    In our case, mempages (total_ram_pages) is approx 3,758,096,384
    (0xe0000000). That leaves max_files at approximately 1,503,238,553.
    This causes 2 * get_max_files() to integer overflow.

    Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long
    integers, and change af_unix to use an atomic_long_t instead of
    atomic_t.

    get_max_files() is changed to return an unsigned long.
    get_nr_files() is changed to return a long.

    unix_nr_socks is changed from atomic_t to atomic_long_t, while not
    strictly needed to address Robin problem.

    Before patch (on a 64bit kernel) :
    # echo 2147483648 >/proc/sys/fs/file-max
    # cat /proc/sys/fs/file-max
    -18446744071562067968

    After patch:
    # echo 2147483648 >/proc/sys/fs/file-max
    # cat /proc/sys/fs/file-max
    2147483648
    # cat /proc/sys/fs/file-nr
    704 0 2147483648

    Reported-by: Robin Holt
    Signed-off-by: Eric Dumazet
    Acked-by: David Miller
    Reviewed-by: Robin Holt
    Tested-by: Robin Holt
    Signed-off-by: Al Viro

    Eric Dumazet
     
  • Currently isofs_get_blocks() was limited to handle only 4TB files on 32-bit
    architectures because of unnecessary use of iblock variable which was signed
    long. Just remove the variable. The error messages that were using this
    variable should have rather used b_off anyway because that is the block we
    are currently mapping.

    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • __block_write_begin and block_prepare_write are identical except for slightly
    different calling conventions. Convert all callers to the __block_write_begin
    calling conventions and drop block_prepare_write.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Hugetlbfs used to need it, but after the destroy_inode and evict_inode
    changes it's not required anymore.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Add a new helper to write out the inode using the writeback code,
    that is including the correct dirty bit and list manipulation. A few
    of filesystems already opencode this, and a lot of others should be
    using it instead of using write_inode_now which also writes out the
    data.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • The caller that didn't need it is gone.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • …t/khilman/linux-davinci

    * 'davinci-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/khilman/linux-davinci: (50 commits)
    davinci: fix remaining board support after io_pgoffst removal
    davinci: mityomapl138: make file local data static
    arm/davinci: remove duplicated include
    davinci: Initial support for Omapl138-Hawkboard
    davinci: MityDSP-L138/MityARM-1808 read MAC address from I2C Prom
    davinci: add tnetv107x touchscreen platform device
    input: add driver for tnetv107x touchscreen controller
    davinci: add keypad config for tnetv107x evm board
    davinci: add tnetv107x keypad platform device
    input: add driver for tnetv107x on-chip keypad controller
    net: davinci_emac: cleanup unused cpdma code
    net: davinci_emac: switch to new cpdma layer
    net: davinci_emac: separate out cpdma code
    net: davinci_emac: cleanup unused mdio emac code
    omap: cleanup unused davinci mdio arch code
    davinci: cleanup mdio arch code and switch to phy_id
    net: davinci_emac: switch to new mdio
    omap: add mdio platform devices
    davinci: add mdio platform devices
    net: davinci_emac: separate out davinci mdio
    ...

    Fix up trivial conflict in drivers/input/keyboard/Kconfig (two entries
    added next to each other - one from the davinci merge, one from the
    input merge)

    Linus Torvalds
     
  • * 'for-linus' of git://git.open-osd.org/linux-open-osd:
    exofs: Remove inode->i_count manipulation in exofs_new_inode
    fs/exofs: typo fix of faild to failed
    exofs: Set i_mapping->backing_dev_info anyway
    exofs: Cleaup read path in regard with read_for_write

    Linus Torvalds
     
  • Commit b40827fa7268 ("x86-32, mm: Add an initial page table for core
    bootstrapping") added an include directive which is needless and is
    taken care of by a previous one. Remove it.

    Caught-by: Jaswinder Singh Rajput
    Signed-off-by: Borislav Petkov
    Signed-off-by: Linus Torvalds

    Borislav Petkov
     
  • exofs_new_inode() was incrementing the inode->i_count and
    decrementing it in create_done(), in a bad attempt to make sure
    the inode will still be there when the asynchronous create_done()
    finally arrives. This was very stupid because iput() was not called,
    and if it was actually needed, it would leak the inode.

    However all this is not needed, because at exofs_evict_inode()
    we already wait for create_done() by waiting for the
    object_created event. Therefore remove the superfluous ref counting
    and just Thicken the comment at exofs_evict_inode() a bit.

    While at it change places that open coded wait_obj_created()
    to call the already available wrapper.

    CC: Dave Chinner
    CC: Christoph Hellwig
    CC: Nick Piggin
    Signed-off-by: Boaz Harrosh

    Boaz Harrosh
     
  • Signed-off-by: Joe Perches
    Signed-off-by: Boaz Harrosh

    Joe Perches