29 May, 2012

1 commit

  • Pull writeback tree from Wu Fengguang:
    "Mainly from Jan Kara to avoid iput() in the flusher threads."

    * tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: Avoid iput() from flusher thread
    vfs: Rename end_writeback() to clear_inode()
    vfs: Move waiting for inode writeback from end_writeback() to evict_inode()
    writeback: Refactor writeback_single_inode()
    writeback: Remove wb->list_lock from writeback_single_inode()
    writeback: Separate inode requeueing after writeback
    writeback: Move I_DIRTY_PAGES handling
    writeback: Move requeueing when I_SYNC set to writeback_sb_inodes()
    writeback: Move clearing of I_SYNC into inode_sync_complete()
    writeback: initialize global_dirty_limit
    fs: remove 8 bytes of padding from struct writeback_control on 64 bit builds
    mm: page-writeback.c: local functions should not be exposed globally

    Linus Torvalds
     

25 May, 2012

1 commit

  • Pull more networking updates from David Miller:
    "Ok, everything from here on out will be bug fixes."

    1) One final sync of wireless and bluetooth stuff from John Linville.
    These changes have all been in his tree for more than a week, and
    therefore have had the necessary -next exposure. John was just away
    on a trip and didn't have a change to send the pull request until a
    day or two ago.

    2) Put back some defines in user exposed header file areas that were
    removed during the tokenring purge. From Stephen Hemminger and Paul
    Gortmaker.

    3) A bug fix for UDP hash table allocation got lost in the pile due to
    one of those "you got it.. no I've got it.." situations. :-)

    From Tim Bird.

    4) SKB coalescing in TCP needs to have stricter checks, otherwise we'll
    try to coalesce overlapping frags and crash. Fix from Eric Dumazet.

    5) RCU routing table lookups can race with free_fib_info(), causing
    crashes when we deref the device pointers in the route. Fix by
    releasing the net device in the RCU callback. From Yanmin Zhang.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (293 commits)
    tcp: take care of overlaps in tcp_try_coalesce()
    ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
    mm: add a low limit to alloc_large_system_hash
    ipx: restore token ring define to include/linux/ipx.h
    if: restore token ring ARP type to header
    xen: do not disable netfront in dom0
    phy/micrel: Fix ID of KSZ9021
    mISDN: Add X-Tensions USB ISDN TA XC-525
    gianfar:don't add FCB length to hard_header_len
    Bluetooth: Report proper error number in disconnection
    Bluetooth: Create flags for bt_sk()
    Bluetooth: report the right security level in getsockopt
    Bluetooth: Lock the L2CAP channel when sending
    Bluetooth: Restore locking semantics when looking up L2CAP channels
    Bluetooth: Fix a redundant and problematic incoming MTU check
    Bluetooth: Add support for Foxconn/Hon Hai AR5BBU22 0489:E03C
    Bluetooth: Fix EIR data generation for mgmt_device_found
    Bluetooth: Fix Inquiry with RSSI event mask
    Bluetooth: improve readability of l2cap_seq_list code
    Bluetooth: Fix skb length calculation
    ...

    Linus Torvalds
     

24 May, 2012

1 commit

  • UDP stack needs a minimum hash size value for proper operation and also
    uses alloc_large_system_hash() for proper NUMA distribution of its hash
    tables and automatic sizing depending on available system memory.

    On some low memory situations, udp_table_init() must ignore the
    alloc_large_system_hash() result and reallocs a bigger memory area.

    As we cannot easily free old hash table, we leak it and kmemleak can
    issue a warning.

    This patch adds a low limit parameter to alloc_large_system_hash() to
    solve this problem.

    We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table
    allocation.

    Reported-by: Mark Asselstine
    Reported-by: Tim Bird
    Signed-off-by: Eric Dumazet
    Cc: Paul Gortmaker
    Signed-off-by: David S. Miller

    Tim Bird
     

06 May, 2012

3 commits

  • Doing iput() from flusher thread (writeback_sb_inodes()) can create problems
    because iput() can do a lot of work - for example truncate the inode if it's
    the last iput on unlinked file. Some filesystems depend on flusher thread
    progressing (e.g. because they need to flush delay allocated blocks to reduce
    allocation uncertainty) and so flusher thread doing truncate creates
    interesting dependencies and possibilities for deadlocks.

    We get rid of iput() in flusher thread by using the fact that I_SYNC inode
    flag effectively pins the inode in memory. So if we take care to either hold
    i_lock or have I_SYNC set, we can get away without taking inode reference
    in writeback_sb_inodes().

    As a side effect of these changes, we also fix possible use-after-free in
    wb_writeback() because inode_wait_for_writeback() call could try to reacquire
    i_lock on the inode that was already free.

    Signed-off-by: Jan Kara
    Signed-off-by: Fengguang Wu

    Jan Kara
     
  • After we moved inode_sync_wait() from end_writeback() it doesn't make sense
    to call the function end_writeback() anymore. Rename it to clear_inode()
    which well says what the function really does - set I_CLEAR flag.

    Signed-off-by: Jan Kara
    Signed-off-by: Fengguang Wu

    Jan Kara
     
  • Currently, I_SYNC can never be set when evict_inode() (and thus
    end_writeback()) is called because flusher thread holds inode reference while
    inode is under writeback. As a result inode_sync_wait() in those places
    currently does nothing. However that is going to change and unveils problems
    with calling inode_sync_wait() from end_writeback(). Several filesystems call
    end_writeback() after they have deleted the inode (btrfs, gfs2, ...) and other
    filesystems (ext3, ext4, reiserfs, ...) can deadlock when waiting for I_SYNC
    because they call end_writeback() from within a transaction.

    To avoid these issues, we move inode_sync_wait() into evict_inode() before
    calling ->evict_inode(). That way we preserve the current property that
    ->evict_inode() and writeback never run in parallel and all filesystems are
    safe.

    Signed-off-by: Jan Kara
    Signed-off-by: Fengguang Wu

    Jan Kara
     

03 May, 2012

1 commit

  • The conversion of all of the users is not done yet there are too many to change
    in one go and leave the code reviewable. For now I change just the header and
    a few trivial users and rely on CONFIG_UIDGID_STRICT_TYPE_CHECKS not being set
    to ensure that the code will still compile during the transition.

    Helper functions i_uid_read, i_uid_write, i_gid_read, i_gid_write are added
    so that in most cases filesystems can avoid the complexities of multiple user
    namespaces and can concentrate on moving their raw numeric values into and
    out of the vfs data structures.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

08 Apr, 2012

1 commit

  • This represents a change in strategy of how to handle user namespaces.
    Instead of tagging everything explicitly with a user namespace and bulking
    up all of the comparisons of uids and gids in the kernel, all uids and gids
    in use will have a mapping to a flat kuid and kgid spaces respectively. This
    allows much more of the existing logic to be preserved and in general
    allows for faster code.

    In this new and improved world we allow someone to utiliize capabilities
    over an inode if the inodes owner mapps into the capabilities holders user
    namespace and the user has capabilities in their user namespace. Which
    is simple and efficient.

    Moving the fs uid comparisons to be comparisons in a flat kuid space
    follows in later patches, something that is only significant if you
    are using user namespaces.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

21 Mar, 2012

3 commits


11 Mar, 2012

2 commits

  • wait_on_inode() doesn't have ->i_lock

    Signed-off-by: Al Viro

    Al Viro
     
  • 9a7aa12f3911853a introduced additional logic around setting the i_mutex
    lockdep class for directory inodes. The idea was that some filesystems
    may want their own special lockdep class for different directory
    inodes and calling unlock_new_inode() should not clobber one of
    those special classes.

    I believe that the added conditional, around the *negated* return value
    of lockdep_match_class(), caused directory inodes to be placed in the
    wrong lockdep class.

    inode_init_always() sets the i_mutex lockdep class with i_mutex_key for
    all inodes. If the filesystem did not change the class during inode
    initialization, then the conditional mentioned above was false and the
    directory inode was incorrectly left in the non-directory lockdep class.
    If the filesystem did set a special lockdep class, then the conditional
    mentioned above was true and that class was clobbered with
    i_mutex_dir_key.

    This patch removes the negation from the conditional so that the i_mutex
    lockdep class is properly set for directory inodes. Special classes are
    preserved and directory inodes with unmodified classes are set with
    i_mutex_dir_key.

    Signed-off-by: Tyler Hicks
    Reviewed-by: Jan Kara
    Signed-off-by: Al Viro

    Tyler Hicks
     

14 Feb, 2012

1 commit

  • When the number of dentry cache hash table entries gets too high
    (2147483648 entries), as happens by default on a 16TB system, use of a
    signed integer in the dcache_init() initialization loop prevents the
    dentry_hashtable from getting initialized, causing a panic in
    __d_lookup(). Fix this in dcache_init() and similar areas.

    Signed-off-by: Dimitri Sivanich
    Acked-by: David S. Miller
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dimitri Sivanich
     

18 Jan, 2012

1 commit


11 Jan, 2012

1 commit


07 Jan, 2012

1 commit

  • Add a new counter to the superblock that keeps track of unlinked but
    not yet deleted inodes.

    Do not WARN_ON if set_nlink is called with zero count, just do a
    ratelimited printk. This happens on xfs and probably other
    filesystems after an unclean shutdown when the filesystem reads inodes
    which already have zero i_nlink. Reported by Christoph Hellwig.

    Signed-off-by: Miklos Szeredi
    Tested-by: Toshiyuki Okajima
    Signed-off-by: Al Viro

    Miklos Szeredi
     

04 Jan, 2012

3 commits


02 Nov, 2011

1 commit

  • Prevent direct modification of i_nlink by making it const and adding a
    non-const __i_nlink alias.

    Signed-off-by: Miklos Szeredi
    Tested-by: Toshiyuki Okajima
    Signed-off-by: Christoph Hellwig

    Miklos Szeredi
     

28 Oct, 2011

1 commit


26 Aug, 2011

1 commit

  • Purely in-memory filesystems do not use the inode hash as the dcache
    tells us if an entry already exists. As a result, they do not call
    unlock_new_inode, and thus directory inodes do not get put into a
    different lockdep class for i_sem.

    We need the different lockdep classes, because the locking order for
    i_mutex is different for directory inodes and regular inodes. Directory
    inodes can do "readdir()", which takes i_mutex *before* possibly taking
    mm->mmap_sem (due to a page fault while copying the directory entry to
    user space).

    In contrast, regular inodes can be mmap'ed, which takes mm->mmap_sem
    before accessing i_mutex.

    The two cases can never happen for the same inode, so no real deadlock
    can occur, but without the different lockdep classes, lockdep cannot
    understand that. As a result, if CONFIG_DEBUG_LOCK_ALLOC is set, this
    can lead to false positives from lockdep like below:

    find/645 is trying to acquire lock:
    (&mm->mmap_sem){++++++}, at: [] might_fault+0x5c/0xac

    but task is already holding lock:
    (&sb->s_type->i_mutex_key#15){+.+.+.}, at: []
    vfs_readdir+0x5b/0xb4

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&sb->s_type->i_mutex_key#15){+.+.+.}:
    [] lock_acquire+0xbf/0x103
    [] __mutex_lock_common+0x4c/0x361
    [] mutex_lock_nested+0x40/0x45
    [] hugetlbfs_file_mmap+0x82/0x110
    [] mmap_region+0x258/0x432
    [] do_mmap_pgoff+0x2ac/0x306
    [] sys_mmap_pgoff+0x118/0x16a
    [] sys_mmap+0x22/0x24
    [] system_call_fastpath+0x16/0x1b

    -> #0 (&mm->mmap_sem){++++++}:
    [] __lock_acquire+0xa1a/0xcf7
    [] lock_acquire+0xbf/0x103
    [] might_fault+0x89/0xac
    [] filldir+0x6f/0xc7
    [] dcache_readdir+0x67/0x205
    [] vfs_readdir+0x7b/0xb4
    [] sys_getdents+0x7e/0xd1
    [] system_call_fastpath+0x16/0x1b

    This patch moves the directory vs file lockdep annotation into a helper
    function that can be called by in-memory filesystems and has hugetlbfs
    call it.

    Signed-off-by: Josh Boyer
    Acked-by: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Josh Boyer
     

07 Aug, 2011

1 commit

  • The inode structure layout is largely random, and some of the vfs paths
    really do care. The path lookup in particular is already quite D$
    intensive, and profiles show that accessing the 'inode->i_op->xyz'
    fields is quite costly.

    We already optimized the dcache to not unnecessarily load the d_op
    structure for members that are often NULL using the DCACHE_OP_xyz bits
    in dentry->d_flags, and this does something very similar for the inode
    ops that are used during pathname lookup.

    It also re-orders the fields so that the fields accessed by 'stat' are
    together at the beginning of the inode structure, and roughly in the
    order accessed.

    The effect of this seems to be in the 1-2% range for an empty kernel
    "make -j" run (which is fairly kernel-intensive, mostly in filename
    lookup), so it's visible. The numbers are fairly noisy, though, and
    likely depend a lot on exact microarchitecture. So there's more tuning
    to be done.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

01 Aug, 2011

3 commits


27 Jul, 2011

3 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    merge fchmod() and fchmodat() guts, kill ancient broken kludge
    xfs: fix misspelled S_IS...()
    xfs: get rid of open-coded S_ISREG(), etc.
    vfs: document locking requirements for d_move, __d_move and d_materialise_unique
    omfs: fix (mode & S_IFDIR) abuse
    btrfs: S_ISREG(mode) is not mode & S_IFREG...
    ima: fmode_t misspelled as mode_t...
    pci-label.c: size_t misspelled as mode_t
    jffs2: S_ISLNK(mode & S_IFMT) is pointless
    snd_msnd ->mode is fmode_t, not mode_t
    v9fs_iop_get_acl: get rid of unused variable
    vfs: dont chain pipe/anon/socket on superblock s_inodes list
    Documentation: Exporting: update description of d_splice_alias
    fs: add missing unlock in default_llseek()

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback: (27 commits)
    mm: properly reflect task dirty limits in dirty_exceeded logic
    writeback: don't busy retry writeback on new/freeing inodes
    writeback: scale IO chunk size up to half device bandwidth
    writeback: trace global_dirty_state
    writeback: introduce max-pause and pass-good dirty limits
    writeback: introduce smoothed global dirty limit
    writeback: consolidate variable names in balance_dirty_pages()
    writeback: show bdi write bandwidth in debugfs
    writeback: bdi write bandwidth estimation
    writeback: account per-bdi accumulated written pages
    writeback: make writeback_control.nr_to_write straight
    writeback: skip tmpfs early in balance_dirty_pages_ratelimited_nr()
    writeback: trace event writeback_queue_io
    writeback: trace event writeback_single_inode
    writeback: remove .nonblocking and .encountered_congestion
    writeback: remove writeback_control.more_io
    writeback: skip balance_dirty_pages() for in-memory fs
    writeback: add bdi_dirty_limit() kernel-doc
    writeback: avoid extra sync work at enqueue time
    writeback: elevate queue_io() into wb_writeback()
    ...

    Fix up trivial conflicts in fs/fs-writeback.c and mm/filemap.c

    Linus Torvalds
     
  • Workloads using pipes and sockets hit inode_sb_list_lock contention.

    superblock s_inodes list is needed for quota, dirty, pagecache and
    fsnotify management. pipe/anon/socket fs are clearly not candidates for
    these.

    Signed-off-by: Eric Dumazet
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Eric Dumazet
     

21 Jul, 2011

3 commits

  • i_alloc_sem is a rather special rw_semaphore. It's the last one that may
    be released by a non-owner, and it's write side is always mirrored by
    real exclusion. It's intended use it to wait for all pending direct I/O
    requests to finish before starting a truncate.

    Replace it with a hand-grown construct:

    - exclusion for truncates is already guaranteed by i_mutex, so it can
    simply fall way
    - the reader side is replaced by an i_dio_count member in struct inode
    that counts the number of pending direct I/O requests. Truncate can't
    proceed as long as it's non-zero
    - when i_dio_count reaches non-zero we wake up a pending truncate using
    wake_up_bit on a new bit in i_flags
    - new references to i_dio_count can't appear while we are waiting for
    it to read zero because the direct I/O count always needs i_mutex
    (or an equivalent like XFS's i_iolock) for starting a new operation.

    This scheme is much simpler, and saves the space of a spinlock_t and a
    struct list_head in struct inode (typically 160 bits on a non-debug 64-bit
    system).

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Now that we have per-sb shrinkers with a lifecycle that is a subset
    of the superblock lifecycle and can reliably detect a filesystem
    being unmounted, there is not longer any race condition for the
    iprune_sem to protect against. Hence we can remove it.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • With context based shrinkers, we can implement a per-superblock
    shrinker that shrinks the caches attached to the superblock. We
    currently have global shrinkers for the inode and dentry caches that
    split up into per-superblock operations via a coarse proportioning
    method that does not batch very well. The global shrinkers also
    have a dependency - dentries pin inodes - so we have to be very
    careful about how we register the global shrinkers so that the
    implicit call order is always correct.

    With a per-sb shrinker callout, we can encode this dependency
    directly into the per-sb shrinker, hence avoiding the need for
    strictly ordering shrinker registrations. We also have no need for
    any proportioning code for the shrinker subsystem already provides
    this functionality across all shrinkers. Allowing the shrinker to
    operate on a single superblock at a time means that we do less
    superblock list traversals and locking and reclaim should batch more
    effectively. This should result in less CPU overhead for reclaim and
    potentially faster reclaim of items from each filesystem.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     

20 Jul, 2011

4 commits

  • With the inode LRUs moving to per-sb structures, there is no longer
    a need for a global inode_lru_lock. The locking can be made more
    fine-grained by moving to a per-sb LRU lock, isolating the LRU
    operations of different filesytsems completely from each other.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • The inode unused list is currently a global LRU. This does not match
    the other global filesystem cache - the dentry cache - which uses
    per-superblock LRU lists. Hence we have related filesystem object
    types using different LRU reclaimation schemes.

    To enable a per-superblock filesystem cache shrinker, both of these
    caches need to have per-sb unused object LRU lists. Hence this patch
    converts the global inode LRU to per-sb LRUs.

    The patch only does rudimentary per-sb propotioning in the shrinker
    infrastructure, as this gets removed when the per-sb shrinker
    callouts are introduced later on.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Before we split up the inode_lru_lock, the unused inode counter
    needs to be made independent of the global inode_lru_lock. Convert
    it to per-cpu counters to do this.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • never is...

    Signed-off-by: Al Viro

    Al Viro
     

28 Jun, 2011

1 commit

  • Under heavy memory and filesystem load, users observe the assertion
    mapping->nrpages == 0 in end_writeback() trigger. This can be caused by
    page reclaim reclaiming the last page from a mapping in the following
    race:

    CPU0 CPU1
    ...
    shrink_page_list()
    __remove_mapping()
    __delete_from_page_cache()
    radix_tree_delete()
    evict_inode()
    truncate_inode_pages()
    truncate_inode_pages_range()
    pagevec_lookup() - finds nothing
    end_writeback()
    mapping->nrpages != 0 -> BUG
    page->mapping = NULL
    mapping->nrpages--

    Fix the problem by doing a reliable check of mapping->nrpages under
    mapping->tree_lock in end_writeback().

    Analyzed by Jay , lost in LKML, and dug out
    by Miklos Szeredi .

    Cc: Jay
    Cc: Miklos Szeredi
    Signed-off-by: Jan Kara
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

08 Jun, 2011

1 commit

  • Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
    as it's currently the most contended lock in the system for metadata
    heavy workloads. It won't help for single-filesystem workloads for
    which we'll need the I/O-less balance_dirty_pages, but at least we
    can dedicate a cpu to spinning on each bdi now for larger systems.

    Based on earlier patches from Nick Piggin and Dave Chinner.

    It reduces lock contentions to 1/4 in this test case:
    10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram

    lock_stat version 0.3
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    vanilla 2.6.39-rc3:
    inode_wb_list_lock: 42590 44433 0.12 147.74 144127.35 252274 886792 0.08 121.34 917211.23
    ------------------
    inode_wb_list_lock 2 [] bdev_inode_switch_bdi+0x29/0x85
    inode_wb_list_lock 34 [] inode_wb_list_del+0x22/0x49
    inode_wb_list_lock 12893 [] __mark_inode_dirty+0x170/0x1d0
    inode_wb_list_lock 10702 [] writeback_single_inode+0x16d/0x20a
    ------------------
    inode_wb_list_lock 2 [] bdev_inode_switch_bdi+0x29/0x85
    inode_wb_list_lock 19 [] inode_wb_list_del+0x22/0x49
    inode_wb_list_lock 5550 [] __mark_inode_dirty+0x170/0x1d0
    inode_wb_list_lock 8511 [] writeback_sb_inodes+0x10f/0x157

    2.6.39-rc3 + patch:
    &(&wb->list_lock)->rlock: 11383 11657 0.14 151.69 40429.51 90825 527918 0.11 145.90 556843.37
    ------------------------
    &(&wb->list_lock)->rlock 10 [] inode_wb_list_del+0x5f/0x86
    &(&wb->list_lock)->rlock 1493 [] writeback_inodes_wb+0x3d/0x150
    &(&wb->list_lock)->rlock 3652 [] writeback_sb_inodes+0x123/0x16f
    &(&wb->list_lock)->rlock 1412 [] writeback_single_inode+0x17f/0x223
    ------------------------
    &(&wb->list_lock)->rlock 3 [] bdi_lock_two+0x46/0x4b
    &(&wb->list_lock)->rlock 6 [] inode_wb_list_del+0x5f/0x86
    &(&wb->list_lock)->rlock 2061 [] __mark_inode_dirty+0x173/0x1cf
    &(&wb->list_lock)->rlock 2629 [] writeback_sb_inodes+0x123/0x16f

    hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
    akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Wu Fengguang

    Christoph Hellwig
     

27 May, 2011

1 commit

  • Move the lock order description after all the includes, remove several
    fairly outdated and/or incorrect comments, move Andrea's
    copyright/changelog to the top where it belongs, remove the pointless
    filename in the top of the file comment, and remove to useless macros.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig