12 Dec, 2012

1 commit

  • Overhaul struct address_space.assoc_mapping renaming it to
    address_space.private_data and its type is redefined to void*. By this
    approach we consistently name the .private_* elements from struct
    address_space as well as allow extended usage for address_space
    association with other data structures through ->private_data.

    Also, all users of old ->assoc_mapping element are converted to reflect
    its new name and type change (->private_data).

    Signed-off-by: Rafael Aquini
    Cc: Rusty Russell
    Cc: "Michael S. Tsirkin"
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andi Kleen
    Cc: Konrad Rzeszutek Wilk
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     

27 Nov, 2012

1 commit

  • Commit 169ebd90131b ("writeback: Avoid iput() from flusher thread")
    removed iget-iput pair from inode writeback. As a side effect, inodes
    that are dirty during iput_final() call won't be ever added to inode LRU
    (iput_final() doesn't add dirty inodes to LRU and later when the inode
    is cleaned there's noone to add the inode there). Thus inodes are
    effectively unreclaimable until someone looks them up again.

    The practical effect of this bug is limited by the fact that inodes are
    pinned by a dentry for long enough that the inode gets cleaned. But
    still the bug can have nasty consequences leading up to OOM conditions
    under certain circumstances. Following can easily reproduce the
    problem:

    for (( i = 0; i < 1000; i++ )); do
    mkdir $i
    for (( j = 0; j < 1000; j++ )); do
    touch $i/$j
    echo 2 > /proc/sys/vm/drop_caches
    done
    done

    then one needs to run 'sync; ls -lR' to make inodes reclaimable again.

    We fix the issue by inserting unused clean inodes into the LRU after
    writeback finishes in inode_sync_complete().

    Signed-off-by: Jan Kara
    Reported-by: OGAWA Hirofumi
    Cc: Al Viro
    Cc: OGAWA Hirofumi
    Cc: Wu Fengguang
    Cc: Dave Chinner
    Cc: [3.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

09 Oct, 2012

1 commit

  • Implement an interval tree as a replacement for the VMA prio_tree. The
    algorithms are similar to lib/interval_tree.c; however that code can't be
    directly reused as the interval endpoints are not explicitly stored in the
    VMA. So instead, the common algorithm is moved into a template and the
    details (node type, how to get interval endpoints from the node, etc) are
    filled in using the C preprocessor.

    Once the interval tree functions are available, using them as a
    replacement to the VMA prio tree is a relatively simple, mechanical job.

    Signed-off-by: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Hillf Danton
    Cc: Peter Zijlstra
    Cc: Catalin Marinas
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

02 Aug, 2012

1 commit

  • Pull second vfs pile from Al Viro:
    "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
    deadlock reproduced by xfstests 068), symlink and hardlink restriction
    patches, plus assorted cleanups and fixes.

    Note that another fsfreeze deadlock (emergency thaw one) is *not*
    dealt with - the series by Fernando conflicts a lot with Jan's, breaks
    userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
    for massive vfsmount leak; this is going to be handled next cycle.
    There probably will be another pull request, but that stuff won't be
    in it."

    Fix up trivial conflicts due to unrelated changes next to each other in
    drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
    delousing target_core_file a bit
    Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
    fs: Remove old freezing mechanism
    ext2: Implement freezing
    btrfs: Convert to new freezing mechanism
    nilfs2: Convert to new freezing mechanism
    ntfs: Convert to new freezing mechanism
    fuse: Convert to new freezing mechanism
    gfs2: Convert to new freezing mechanism
    ocfs2: Convert to new freezing mechanism
    xfs: Convert to new freezing code
    ext4: Convert to new freezing mechanism
    fs: Protect write paths by sb_start_write - sb_end_write
    fs: Skip atime update on frozen filesystem
    fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
    fs: Improve filesystem freezing handling
    switch the protection of percpu_counter list to spinlock
    nfsd: Push mnt_want_write() outside of i_mutex
    btrfs: Push mnt_want_write() outside of i_mutex
    fat: Push mnt_want_write() outside of i_mutex
    ...

    Linus Torvalds
     

31 Jul, 2012

2 commits

  • It is unexpected to block reading of frozen filesystem because of atime update.
    Also handling blocking on frozen filesystem because of atime update would make
    locking more complex than it already is. So just skip atime update when
    filesystem is frozen like we skip it when filesystem is remounted read-only.

    BugLink: https://bugs.launchpad.net/bugs/897421
    Tested-by: Kamal Mostafa
    Tested-by: Peter M. Petrakis
    Tested-by: Dann Frazier
    Tested-by: Massimo Morana
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • Most of places where we want freeze protection coincides with the places where
    we also have remount-ro protection. So make mnt_want_write() and
    mnt_drop_write() (and their _file alternative) prevent freezing as well.
    For the few cases that are really interested only in remount-ro protection
    provide new function variants.

    BugLink: https://bugs.launchpad.net/bugs/897421
    Tested-by: Kamal Mostafa
    Tested-by: Peter M. Petrakis
    Tested-by: Dann Frazier
    Tested-by: Massimo Morana
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     

27 Jul, 2012

1 commit

  • Pull large btrfs update from Chris Mason:
    "This pull request is very large, and the two main features in here
    have been under testing/devel for quite a while.

    We have subvolume quotas from the strato developers. This enables
    full tracking of how many blocks are allocated to each subvolume (and
    all snapshots) and you can set limits on a per-subvolume basis. You
    can also create quota groups and toss multiple subvolumes into a big
    group. It's everything you need to be a web hosting company and give
    each user their own subvolume.

    The userland side of the quotas is being refreshed, they'll send out
    details on where to grab it soon.

    Next is the kernel side of btrfs send/receive from Alexander Block.
    This leverages the same infrastructure as the quota code to figure out
    relationships between blocks and their owners. It can then compute
    the difference between two snapshots and sends the diffs in a neutral
    format into userland.

    The basic model:

    create a snapshot
    send that snapshot as the initial backup
    make changes
    create a second snapshot
    send the incremental as a backup
    delete the first snapshot
    (use the second snapshot for the next incremental)

    The receive portion is all in userland, and in the 'next' branch of my
    btrfs-progs repo.

    There's still some work to do in terms of optimizing the send side
    from kernel to userland. The really important part is figuring out
    how two snapshots are different, and this is where we are
    concentrating right now. The initial send of a dataset is a little
    slower than tar, but the incremental sends are dramatically faster
    than what rsync can do.

    On top of all of that, we have a nice queue of fixes, cleanups and
    optimizations."

    Fix up trivial modify/del conflict in fs/btrfs/ioctl.c

    Also fix up semantic conflict in fs/btrfs/send.c: the interface to
    dentry_open() changed in commit 765927b2d508 ("switch dentry_open() to
    struct path, make it grab references itself"), and since it now grabs
    whatever references it needs, we should no longer do the mntget() on the
    mnt (and we need to dput() the dentry reference we took).

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (65 commits)
    Btrfs: uninit variable fixes in send/receive
    Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive
    Btrfs: add btrfs_compare_trees function
    Btrfs: introduce subvol uuids and times
    Btrfs: make iref_to_path non static
    Btrfs: add a barrier before a waitqueue_active check
    Btrfs: call the ordered free operation without any locks held
    Btrfs: Check INCOMPAT flags on remount and add helper function
    Btrfs: add helper for tree enumeration
    btrfs: allow cross-subvolume file clone
    Btrfs: improve multi-thread buffer read
    Btrfs: make btrfs's allocation smoothly with preallocation
    Btrfs: lock the transition from dirty to writeback for an eb
    Btrfs: fix potential race in extent buffer freeing
    Btrfs: don't return true in releasepage unless we actually freed the eb
    Btrfs: suppress printk() if all device I/O stats are zero
    Btrfs: remove unwanted printk() for btrfs device I/O stats
    Btrfs: rewrite BTRFS_SETGET_FUNCS
    Btrfs: zero unused bytes in inode item
    Btrfs: kill free_space pointer from inode structure
    ...

    Conflicts:
    fs/btrfs/ioctl.c

    Linus Torvalds
     

24 Jul, 2012

1 commit

  • Before the update_time inode operation was indroduced, it was
    not possible to prevent updates of atime on RO subvolumes. VFS
    was only able to check for RO on the mount, but did not know
    anything about btrfs subvolumes.

    btrfs_update_time does now check if the root is RO and skip
    updating of times.

    Signed-off-by: Alexander Block

    Alexander Block
     

14 Jul, 2012

1 commit


02 Jun, 2012

2 commits

  • Pull vfs changes from Al Viro.
    "A lot of misc stuff. The obvious groups:
    * Miklos' atomic_open series; kills the damn abuse of
    ->d_revalidate() by NFS, which was the major stumbling block for
    all work in that area.
    * ripping security_file_mmap() and dealing with deadlocks in the
    area; sanitizing the neighborhood of vm_mmap()/vm_munmap() in
    general.
    * ->encode_fh() switched to saner API; insane fake dentry in
    mm/cleancache.c gone.
    * assorted annotations in fs (endianness, __user)
    * parts of Artem's ->s_dirty work (jff2 and reiserfs parts)
    * ->update_time() work from Josef.
    * other bits and pieces all over the place.

    Normally it would've been in two or three pull requests, but
    signal.git stuff had eaten a lot of time during this cycle ;-/"

    Fix up trivial conflicts in Documentation/filesystems/vfs.txt (the
    'truncate_range' inode method was removed by the VM changes, the VFS
    update adds an 'update_time()' method), and in fs/btrfs/ulist.[ch] (due
    to sparse fix added twice, with other changes nearby).

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (95 commits)
    nfs: don't open in ->d_revalidate
    vfs: retry last component if opening stale dentry
    vfs: nameidata_to_filp(): don't throw away file on error
    vfs: nameidata_to_filp(): inline __dentry_open()
    vfs: do_dentry_open(): don't put filp
    vfs: split __dentry_open()
    vfs: do_last() common post lookup
    vfs: do_last(): add audit_inode before open
    vfs: do_last(): only return EISDIR for O_CREAT
    vfs: do_last(): check LOOKUP_DIRECTORY
    vfs: do_last(): make ENOENT exit RCU safe
    vfs: make follow_link check RCU safe
    vfs: do_last(): use inode variable
    vfs: do_last(): inline walk_component()
    vfs: do_last(): make exit RCU safe
    vfs: split do_lookup()
    Btrfs: move over to use ->update_time
    fs: introduce inode operation ->update_time
    reiserfs: get rid of resierfs_sync_super
    reiserfs: mark the superblock as dirty a bit later
    ...

    Linus Torvalds
     
  • Btrfs has to make sure we have space to allocate new blocks in order to modify
    the inode, so updating time can fail. We've gotten around this by having our
    own file_update_time but this is kind of a pain, and Christoph has indicated he
    would like to make xfs do something different with atime updates. So introduce
    ->update_time, where we will deal with i_version an a/m/c time updates and
    indicate which changes need to be made. The normal version just does what it
    has always done, updates the time and marks the inode dirty, and then
    filesystems can choose to do something different.

    I've gone through all of the users of file_update_time and made them check for
    errors with the exception of the fault code since it's complicated and I wasn't
    quite sure what to do there, also Jan is going to be pushing the file time
    updates into page_mkwrite for those who have it so that should satisfy btrfs and
    make it not a big deal to check the file_update_time() return code in the
    generic fault path. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

01 Jun, 2012

1 commit


31 May, 2012

1 commit


30 May, 2012

1 commit

  • Fix kernel-doc warnings in fs/inode.c:

    Warning(fs/inode.c:1493): No description found for parameter 'path'
    Warning(fs/inode.c:1493): Excess function parameter 'mnt' description in 'touch_atime'
    Warning(fs/inode.c:1493): Excess function parameter 'dentry' description in 'touch_atime'

    Signed-off-by: Randy Dunlap
    Signed-off-by: Al Viro

    Randy Dunlap
     

29 May, 2012

1 commit

  • Pull writeback tree from Wu Fengguang:
    "Mainly from Jan Kara to avoid iput() in the flusher threads."

    * tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: Avoid iput() from flusher thread
    vfs: Rename end_writeback() to clear_inode()
    vfs: Move waiting for inode writeback from end_writeback() to evict_inode()
    writeback: Refactor writeback_single_inode()
    writeback: Remove wb->list_lock from writeback_single_inode()
    writeback: Separate inode requeueing after writeback
    writeback: Move I_DIRTY_PAGES handling
    writeback: Move requeueing when I_SYNC set to writeback_sb_inodes()
    writeback: Move clearing of I_SYNC into inode_sync_complete()
    writeback: initialize global_dirty_limit
    fs: remove 8 bytes of padding from struct writeback_control on 64 bit builds
    mm: page-writeback.c: local functions should not be exposed globally

    Linus Torvalds
     

25 May, 2012

1 commit

  • Pull more networking updates from David Miller:
    "Ok, everything from here on out will be bug fixes."

    1) One final sync of wireless and bluetooth stuff from John Linville.
    These changes have all been in his tree for more than a week, and
    therefore have had the necessary -next exposure. John was just away
    on a trip and didn't have a change to send the pull request until a
    day or two ago.

    2) Put back some defines in user exposed header file areas that were
    removed during the tokenring purge. From Stephen Hemminger and Paul
    Gortmaker.

    3) A bug fix for UDP hash table allocation got lost in the pile due to
    one of those "you got it.. no I've got it.." situations. :-)

    From Tim Bird.

    4) SKB coalescing in TCP needs to have stricter checks, otherwise we'll
    try to coalesce overlapping frags and crash. Fix from Eric Dumazet.

    5) RCU routing table lookups can race with free_fib_info(), causing
    crashes when we deref the device pointers in the route. Fix by
    releasing the net device in the RCU callback. From Yanmin Zhang.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (293 commits)
    tcp: take care of overlaps in tcp_try_coalesce()
    ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
    mm: add a low limit to alloc_large_system_hash
    ipx: restore token ring define to include/linux/ipx.h
    if: restore token ring ARP type to header
    xen: do not disable netfront in dom0
    phy/micrel: Fix ID of KSZ9021
    mISDN: Add X-Tensions USB ISDN TA XC-525
    gianfar:don't add FCB length to hard_header_len
    Bluetooth: Report proper error number in disconnection
    Bluetooth: Create flags for bt_sk()
    Bluetooth: report the right security level in getsockopt
    Bluetooth: Lock the L2CAP channel when sending
    Bluetooth: Restore locking semantics when looking up L2CAP channels
    Bluetooth: Fix a redundant and problematic incoming MTU check
    Bluetooth: Add support for Foxconn/Hon Hai AR5BBU22 0489:E03C
    Bluetooth: Fix EIR data generation for mgmt_device_found
    Bluetooth: Fix Inquiry with RSSI event mask
    Bluetooth: improve readability of l2cap_seq_list code
    Bluetooth: Fix skb length calculation
    ...

    Linus Torvalds
     

24 May, 2012

1 commit

  • UDP stack needs a minimum hash size value for proper operation and also
    uses alloc_large_system_hash() for proper NUMA distribution of its hash
    tables and automatic sizing depending on available system memory.

    On some low memory situations, udp_table_init() must ignore the
    alloc_large_system_hash() result and reallocs a bigger memory area.

    As we cannot easily free old hash table, we leak it and kmemleak can
    issue a warning.

    This patch adds a low limit parameter to alloc_large_system_hash() to
    solve this problem.

    We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table
    allocation.

    Reported-by: Mark Asselstine
    Reported-by: Tim Bird
    Signed-off-by: Eric Dumazet
    Cc: Paul Gortmaker
    Signed-off-by: David S. Miller

    Tim Bird
     

06 May, 2012

3 commits

  • Doing iput() from flusher thread (writeback_sb_inodes()) can create problems
    because iput() can do a lot of work - for example truncate the inode if it's
    the last iput on unlinked file. Some filesystems depend on flusher thread
    progressing (e.g. because they need to flush delay allocated blocks to reduce
    allocation uncertainty) and so flusher thread doing truncate creates
    interesting dependencies and possibilities for deadlocks.

    We get rid of iput() in flusher thread by using the fact that I_SYNC inode
    flag effectively pins the inode in memory. So if we take care to either hold
    i_lock or have I_SYNC set, we can get away without taking inode reference
    in writeback_sb_inodes().

    As a side effect of these changes, we also fix possible use-after-free in
    wb_writeback() because inode_wait_for_writeback() call could try to reacquire
    i_lock on the inode that was already free.

    Signed-off-by: Jan Kara
    Signed-off-by: Fengguang Wu

    Jan Kara
     
  • After we moved inode_sync_wait() from end_writeback() it doesn't make sense
    to call the function end_writeback() anymore. Rename it to clear_inode()
    which well says what the function really does - set I_CLEAR flag.

    Signed-off-by: Jan Kara
    Signed-off-by: Fengguang Wu

    Jan Kara
     
  • Currently, I_SYNC can never be set when evict_inode() (and thus
    end_writeback()) is called because flusher thread holds inode reference while
    inode is under writeback. As a result inode_sync_wait() in those places
    currently does nothing. However that is going to change and unveils problems
    with calling inode_sync_wait() from end_writeback(). Several filesystems call
    end_writeback() after they have deleted the inode (btrfs, gfs2, ...) and other
    filesystems (ext3, ext4, reiserfs, ...) can deadlock when waiting for I_SYNC
    because they call end_writeback() from within a transaction.

    To avoid these issues, we move inode_sync_wait() into evict_inode() before
    calling ->evict_inode(). That way we preserve the current property that
    ->evict_inode() and writeback never run in parallel and all filesystems are
    safe.

    Signed-off-by: Jan Kara
    Signed-off-by: Fengguang Wu

    Jan Kara
     

03 May, 2012

1 commit

  • The conversion of all of the users is not done yet there are too many to change
    in one go and leave the code reviewable. For now I change just the header and
    a few trivial users and rely on CONFIG_UIDGID_STRICT_TYPE_CHECKS not being set
    to ensure that the code will still compile during the transition.

    Helper functions i_uid_read, i_uid_write, i_gid_read, i_gid_write are added
    so that in most cases filesystems can avoid the complexities of multiple user
    namespaces and can concentrate on moving their raw numeric values into and
    out of the vfs data structures.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

08 Apr, 2012

1 commit

  • This represents a change in strategy of how to handle user namespaces.
    Instead of tagging everything explicitly with a user namespace and bulking
    up all of the comparisons of uids and gids in the kernel, all uids and gids
    in use will have a mapping to a flat kuid and kgid spaces respectively. This
    allows much more of the existing logic to be preserved and in general
    allows for faster code.

    In this new and improved world we allow someone to utiliize capabilities
    over an inode if the inodes owner mapps into the capabilities holders user
    namespace and the user has capabilities in their user namespace. Which
    is simple and efficient.

    Moving the fs uid comparisons to be comparisons in a flat kuid space
    follows in later patches, something that is only significant if you
    are using user namespaces.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

21 Mar, 2012

3 commits


11 Mar, 2012

2 commits

  • wait_on_inode() doesn't have ->i_lock

    Signed-off-by: Al Viro

    Al Viro
     
  • 9a7aa12f3911853a introduced additional logic around setting the i_mutex
    lockdep class for directory inodes. The idea was that some filesystems
    may want their own special lockdep class for different directory
    inodes and calling unlock_new_inode() should not clobber one of
    those special classes.

    I believe that the added conditional, around the *negated* return value
    of lockdep_match_class(), caused directory inodes to be placed in the
    wrong lockdep class.

    inode_init_always() sets the i_mutex lockdep class with i_mutex_key for
    all inodes. If the filesystem did not change the class during inode
    initialization, then the conditional mentioned above was false and the
    directory inode was incorrectly left in the non-directory lockdep class.
    If the filesystem did set a special lockdep class, then the conditional
    mentioned above was true and that class was clobbered with
    i_mutex_dir_key.

    This patch removes the negation from the conditional so that the i_mutex
    lockdep class is properly set for directory inodes. Special classes are
    preserved and directory inodes with unmodified classes are set with
    i_mutex_dir_key.

    Signed-off-by: Tyler Hicks
    Reviewed-by: Jan Kara
    Signed-off-by: Al Viro

    Tyler Hicks
     

14 Feb, 2012

1 commit

  • When the number of dentry cache hash table entries gets too high
    (2147483648 entries), as happens by default on a 16TB system, use of a
    signed integer in the dcache_init() initialization loop prevents the
    dentry_hashtable from getting initialized, causing a panic in
    __d_lookup(). Fix this in dcache_init() and similar areas.

    Signed-off-by: Dimitri Sivanich
    Acked-by: David S. Miller
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dimitri Sivanich
     

18 Jan, 2012

1 commit


11 Jan, 2012

1 commit


07 Jan, 2012

1 commit

  • Add a new counter to the superblock that keeps track of unlinked but
    not yet deleted inodes.

    Do not WARN_ON if set_nlink is called with zero count, just do a
    ratelimited printk. This happens on xfs and probably other
    filesystems after an unclean shutdown when the filesystem reads inodes
    which already have zero i_nlink. Reported by Christoph Hellwig.

    Signed-off-by: Miklos Szeredi
    Tested-by: Toshiyuki Okajima
    Signed-off-by: Al Viro

    Miklos Szeredi
     

04 Jan, 2012

3 commits


02 Nov, 2011

1 commit

  • Prevent direct modification of i_nlink by making it const and adding a
    non-const __i_nlink alias.

    Signed-off-by: Miklos Szeredi
    Tested-by: Toshiyuki Okajima
    Signed-off-by: Christoph Hellwig

    Miklos Szeredi
     

28 Oct, 2011

1 commit


26 Aug, 2011

1 commit

  • Purely in-memory filesystems do not use the inode hash as the dcache
    tells us if an entry already exists. As a result, they do not call
    unlock_new_inode, and thus directory inodes do not get put into a
    different lockdep class for i_sem.

    We need the different lockdep classes, because the locking order for
    i_mutex is different for directory inodes and regular inodes. Directory
    inodes can do "readdir()", which takes i_mutex *before* possibly taking
    mm->mmap_sem (due to a page fault while copying the directory entry to
    user space).

    In contrast, regular inodes can be mmap'ed, which takes mm->mmap_sem
    before accessing i_mutex.

    The two cases can never happen for the same inode, so no real deadlock
    can occur, but without the different lockdep classes, lockdep cannot
    understand that. As a result, if CONFIG_DEBUG_LOCK_ALLOC is set, this
    can lead to false positives from lockdep like below:

    find/645 is trying to acquire lock:
    (&mm->mmap_sem){++++++}, at: [] might_fault+0x5c/0xac

    but task is already holding lock:
    (&sb->s_type->i_mutex_key#15){+.+.+.}, at: []
    vfs_readdir+0x5b/0xb4

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&sb->s_type->i_mutex_key#15){+.+.+.}:
    [] lock_acquire+0xbf/0x103
    [] __mutex_lock_common+0x4c/0x361
    [] mutex_lock_nested+0x40/0x45
    [] hugetlbfs_file_mmap+0x82/0x110
    [] mmap_region+0x258/0x432
    [] do_mmap_pgoff+0x2ac/0x306
    [] sys_mmap_pgoff+0x118/0x16a
    [] sys_mmap+0x22/0x24
    [] system_call_fastpath+0x16/0x1b

    -> #0 (&mm->mmap_sem){++++++}:
    [] __lock_acquire+0xa1a/0xcf7
    [] lock_acquire+0xbf/0x103
    [] might_fault+0x89/0xac
    [] filldir+0x6f/0xc7
    [] dcache_readdir+0x67/0x205
    [] vfs_readdir+0x7b/0xb4
    [] sys_getdents+0x7e/0xd1
    [] system_call_fastpath+0x16/0x1b

    This patch moves the directory vs file lockdep annotation into a helper
    function that can be called by in-memory filesystems and has hugetlbfs
    call it.

    Signed-off-by: Josh Boyer
    Acked-by: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Josh Boyer
     

07 Aug, 2011

1 commit

  • The inode structure layout is largely random, and some of the vfs paths
    really do care. The path lookup in particular is already quite D$
    intensive, and profiles show that accessing the 'inode->i_op->xyz'
    fields is quite costly.

    We already optimized the dcache to not unnecessarily load the d_op
    structure for members that are often NULL using the DCACHE_OP_xyz bits
    in dentry->d_flags, and this does something very similar for the inode
    ops that are used during pathname lookup.

    It also re-orders the fields so that the fields accessed by 'stat' are
    together at the beginning of the inode structure, and roughly in the
    order accessed.

    The effect of this seems to be in the 1-2% range for an empty kernel
    "make -j" run (which is fairly kernel-intensive, mostly in filename
    lookup), so it's visible. The numbers are fairly noisy, though, and
    likely depend a lot on exact microarchitecture. So there's more tuning
    to be done.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

01 Aug, 2011

2 commits