17 Feb, 2016

1 commit

  • inode struct members that track cgroup writeback information
    should be reinitialized when inode gets allocated from
    kmem_cache. Otherwise, their values remain and get used by the
    new inode.

    Signed-off-by: Tahsin Erdogan
    Acked-by: Tejun Heo
    Fixes: d10c80955265 ("writeback: implement foreign cgroup inode bdi_writeback switching")
    Signed-off-by: Jens Axboe

    Tahsin Erdogan
     

24 Jan, 2016

1 commit

  • Pull final vfs updates from Al Viro:

    - The ->i_mutex wrappers (with small prereq in lustre)

    - a fix for too early freeing of symlink bodies on shmem (they need to
    be RCU-delayed) (-stable fodder)

    - followup to dedupe stuff merged this cycle

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    vfs: abort dedupe loop if fatal signals are pending
    make sure that freeing shmem fast symlinks is RCU-delayed
    wrappers for ->i_mutex access
    lustre: remove unused declaration

    Linus Torvalds
     

23 Jan, 2016

2 commits

  • Add support for tracking dirty DAX entries in the struct address_space
    radix tree. This tree is already used for dirty page writeback, and it
    already supports the use of exceptional (non struct page*) entries.

    In order to properly track dirty DAX pages we will insert new
    exceptional entries into the radix tree that represent dirty DAX PTE or
    PMD pages. These exceptional entries will also contain the writeback
    addresses for the PTE or PMD faults that we can use at fsync/msync time.

    There are currently two types of exceptional entries (shmem and shadow)
    that can be placed into the radix tree, and this adds a third. We rely
    on the fact that only one type of exceptional entry can be found in a
    given radix tree based on its usage. This happens for free with DAX vs
    shmem but we explicitly prevent shadow entries from being added to radix
    trees for DAX mappings.

    The only shadow entries that would be generated for DAX radix trees
    would be to track zero page mappings that were created for holes. These
    pages would receive minimal benefit from having shadow entries, and the
    choice to have only one type of exceptional entry in a given radix tree
    makes the logic simpler both in clear_exceptional_entry() and in the
    rest of DAX.

    Signed-off-by: Ross Zwisler
    Cc: "H. Peter Anvin"
    Cc: "J. Bruce Fields"
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jeff Layton
    Cc: Matthew Wilcox
    Cc: Thomas Gleixner
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     

15 Jan, 2016

1 commit

  • Mark those kmem allocations that are known to be easily triggered from
    userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
    memcg. For the list, see below:

    - threadinfo
    - task_struct
    - task_delay_info
    - pid
    - cred
    - mm_struct
    - vm_area_struct and vm_region (nommu)
    - anon_vma and anon_vma_chain
    - signal_struct
    - sighand_struct
    - fs_struct
    - files_struct
    - fdtable and fdtable->full_fds_bits
    - dentry and external_name
    - inode for all filesystems. This is the most tedious part, because
    most filesystems overwrite the alloc_inode method.

    The list is far from complete, so feel free to add more objects.
    Nevertheless, it should be close to "account everything" approach and
    keep most workloads within bounds. Malevolent users will be able to
    breach the limit, but this was possible even with the former "account
    everything" approach (simply because it did not account everything in
    fact).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

13 Jan, 2016

1 commit

  • Pull file locking updates from Jeff Layton:
    "File locking related changes for v4.5 (pile #1)

    Highlights:
    - new Kconfig option to allow disabling mandatory locking (which is
    racy anyway)
    - new tracepoints for setlk and close codepaths
    - fix for a long-standing bug in code that handles races between
    setting a POSIX lock and close()"

    * tag 'locks-v4.5-1' of git://git.samba.org/jlayton/linux:
    locks: rename __posix_lock_file to posix_lock_inode
    locks: prink more detail when there are leaked locks
    locks: pass inode pointer to locks_free_lock_context
    locks: sprinkle some tracepoints around the file locking code
    locks: don't check for race with close when setting OFD lock
    locks: fix unlock when fcntl_setlk races with a close
    fs: make locks.c explicitly non-modular
    locks: use list_first_entry_or_null()
    locks: Don't allow mounts in user namespaces to enable mandatory locking
    locks: Allow disabling mandatory locking at compile time

    Linus Torvalds
     

09 Jan, 2016

1 commit


09 Dec, 2015

1 commit

  • kmap() in page_follow_link_light() needed to go - allowing to hold
    an arbitrary number of kmaps for long is a great way to deadlocking
    the system.

    new helper (inode_nohighmem(inode)) needs to be used for pagecache
    symlinks inodes; done for all in-tree cases. page_follow_link_light()
    instrumented to yell about anything missed.

    Signed-off-by: Al Viro

    Al Viro
     

10 Nov, 2015

1 commit


19 Aug, 2015

1 commit

  • On a box with a lot of ram (148gb) I can make the box softlockup after running
    an fs_mark job that creates hundreds of millions of empty files. This is
    because we never generate enough memory pressure to keep the number of inodes on
    our unused list low, so when we go to unmount we have to evict ~100 million
    inodes. This makes one processor a very unhappy person, so add a cond_resched()
    in dispose_list() and if we need a resched when processing the s_inodes list do
    that and run dispose_list() on what we've currently culled. Thanks,

    Signed-off-by: Josef Bacik
    Reviewed-by: Jan Kara

    Josef Bacik
     

18 Aug, 2015

2 commits

  • There's a small consistency problem between the inode and writeback
    naming. Writeback calls the "for IO" inode queues b_io and
    b_more_io, but the inode calls these the "writeback list" or
    i_wb_list. This makes it hard to an new "under writeback" list to
    the inode, or call it an "under IO" list on the bdi because either
    way we'll have writeback on IO and IO on writeback and it'll just be
    confusing. I'm getting confused just writing this!

    So, rename the inode "for IO" list variable to i_io_list so we can
    add a new "writeback list" in a subsequent patch.

    Signed-off-by: Dave Chinner
    Signed-off-by: Josef Bacik
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Tested-by: Dave Chinner

    Dave Chinner
     
  • The process of reducing contention on per-superblock inode lists
    starts with moving the locking to match the per-superblock inode
    list. This takes the global lock out of the picture and reduces the
    contention problems to within a single filesystem. This doesn't get
    rid of contention as the locks still have global CPU scope, but it
    does isolate operations on different superblocks form each other.

    Signed-off-by: Dave Chinner
    Signed-off-by: Josef Bacik
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Tested-by: Dave Chinner

    Dave Chinner
     

05 Jul, 2015

1 commit

  • Pull more vfs updates from Al Viro:
    "Assorted VFS fixes and related cleanups (IMO the most interesting in
    that part are f_path-related things and Eric's descriptor-related
    stuff). UFS regression fixes (it got broken last cycle). 9P fixes.
    fs-cache series, DAX patches, Jan's file_remove_suid() work"

    [ I'd say this is much more than "fixes and related cleanups". The
    file_table locking rule change by Eric Dumazet is a rather big and
    fundamental update even if the patch isn't huge. - Linus ]

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
    9p: cope with bogus responses from server in p9_client_{read,write}
    p9_client_write(): avoid double p9_free_req()
    9p: forgetting to cancel request on interrupted zero-copy RPC
    dax: bdev_direct_access() may sleep
    block: Add support for DAX reads/writes to block devices
    dax: Use copy_from_iter_nocache
    dax: Add block size note to documentation
    fs/file.c: __fget() and dup2() atomicity rules
    fs/file.c: don't acquire files->file_lock in fd_install()
    fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
    vfs: avoid creation of inode number 0 in get_next_ino
    namei: make set_root_rcu() return void
    make simple_positive() public
    ufs: use dir_pages instead of ufs_dir_pages()
    pagemap.h: move dir_pages() over there
    remove the pointless include of lglock.h
    fs: cleanup slight list_entry abuse
    xfs: Correctly lock inode when removing suid and file capabilities
    fs: Call security_ops->inode_killpriv on truncate
    fs: Provide function telling whether file_remove_privs() will do anything
    ...

    Linus Torvalds
     

01 Jul, 2015

1 commit

  • currently, get_next_ino() is able to create inodes with inode number = 0.
    This have a bad impact in the filesystems relying in this function to generate
    inode numbers.

    While there is no problem at all in having inodes with number 0, userspace tools
    which handle file management tasks can have problems handling these files, like
    for example, the impossiblity of users to delete these files, since glibc will
    ignore them. So, I believe the best way is kernel to avoid creating them.

    This problem has been raised previously, but the old thread didn't have any
    other update for a year+, and I've seen too many users hitting the same issue
    regarding the impossibility to delete files while using filesystems relying on
    this function. So, I'm starting the thread again, with the same patch
    that I believe is enough to address this problem.

    Signed-off-by: Carlos Maiolino
    Signed-off-by: Al Viro

    Carlos Maiolino
     

26 Jun, 2015

1 commit

  • Pull cgroup writeback support from Jens Axboe:
    "This is the big pull request for adding cgroup writeback support.

    This code has been in development for a long time, and it has been
    simmering in for-next for a good chunk of this cycle too. This is one
    of those problems that has been talked about for at least half a
    decade, finally there's a solution and code to go with it.

    Also see last weeks writeup on LWN:

    http://lwn.net/Articles/648292/"

    * 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
    writeback, blkio: add documentation for cgroup writeback support
    vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
    writeback: do foreign inode detection iff cgroup writeback is enabled
    v9fs: fix error handling in v9fs_session_init()
    bdi: fix wrong error return value in cgwb_create()
    buffer: remove unusued 'ret' variable
    writeback: disassociate inodes from dying bdi_writebacks
    writeback: implement foreign cgroup inode bdi_writeback switching
    writeback: add lockdep annotation to inode_to_wb()
    writeback: use unlocked_inode_to_wb transaction in inode_congested()
    writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
    writeback: implement [locked_]inode_to_wb_and_lock_list()
    writeback: implement foreign cgroup inode detection
    writeback: make writeback_control track the inode being written back
    writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
    mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
    writeback: implement memcg writeback domain based throttling
    writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
    writeback: implement memcg wb_domain
    writeback: update wb_over_bg_thresh() to use wb_domain aware operations
    ...

    Linus Torvalds
     

24 Jun, 2015

4 commits

  • Comment in include/linux/security.h says that ->inode_killpriv() should
    be called when setuid bit is being removed and that similar security
    labels (in fact this applies only to file capabilities) should be
    removed at this time as well. However we don't call ->inode_killpriv()
    when we remove suid bit on truncate.

    We fix the problem by calling ->inode_need_killpriv() and subsequently
    ->inode_killpriv() on truncate the same way as we do it on file write.

    After this patch there's only one user of should_remove_suid() - ocfs2 -
    and indeed it's buggy because it doesn't call ->inode_killpriv() on
    write. However fixing it is difficult because of special locking
    constraints.

    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • Provide function telling whether file_remove_privs() will do anything.
    Currently we only have should_remove_suid() and that does something
    slightly different.

    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • file_remove_suid() is a misnomer since it removes also file capabilities
    stored in xattrs and sets S_NOSEC flag. Also should_remove_suid() tells
    something else than whether file_remove_suid() call is necessary which
    leads to bugs.

    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • file_remove_suid() could mistakenly set S_NOSEC inode bit when root was
    modifying the file. As a result following writes to the file by ordinary
    user would avoid clearing suid or sgid bits.

    Fix the bug by checking actual mode bits before setting S_NOSEC.

    CC: stable@vger.kernel.org
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     

02 Jun, 2015

1 commit

  • For the planned cgroup writeback support, on each bdi
    (backing_dev_info), each memcg will be served by a separate wb
    (bdi_writeback). This patch updates bdi so that a bdi can host
    multiple wbs (bdi_writebacks).

    On the default hierarchy, blkcg implicitly enables memcg. This allows
    using memcg's page ownership for attributing writeback IOs, and every
    memcg - blkcg combination can be served by its own wb by assigning a
    dedicated wb to each memcg. This means that there may be multiple
    wb's of a bdi mapped to the same blkcg. As congested state is per
    blkcg - bdi combination, those wb's should share the same congested
    state. This is achieved by tracking congested state via
    bdi_writeback_congested structs which are keyed by blkcg.

    bdi->wb remains unchanged and will keep serving the root cgroup.
    cgwb's (cgroup wb's) for non-root cgroups are created on-demand or
    looked up while dirtying an inode according to the memcg of the page
    being dirtied or current task. Each cgwb is indexed on bdi->cgwb_tree
    by its memcg id. Once an inode is associated with its wb, it can be
    retrieved using inode_to_wb().

    Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
    pages will keep being associated with bdi->wb.

    v3: inode_attach_wb() in account_page_dirtied() moved inside
    mapping_cap_account_dirty() block where it's known to be !NULL.
    Also, an unnecessary NULL check before kfree() removed. Both
    detected by the kbuild bot.

    v2: Updated so that wb association is per inode and wb is per memcg
    rather than blkcg.

    Signed-off-by: Tejun Heo
    Cc: kbuild test robot
    Cc: Dan Carpenter
    Cc: Jens Axboe
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     

15 May, 2015

1 commit

  • touch_atime is not RCU-safe, and so cannot be called on an RCU walk.
    However, in situations where RCU-walk makes a difference, the symlink
    will likely to accessed much more often than it is useful to update
    the atime.

    So split out the test of "Does the atime actually need to be updated"
    into atime_needs_update(), and have get_link() unlazy if it finds that
    it will need to do that update.

    Signed-off-by: NeilBrown
    Signed-off-by: Al Viro

    NeilBrown
     

11 May, 2015

1 commit


25 Apr, 2015

1 commit

  • do_blockdev_direct_IO() increments and decrements the inode
    ->i_dio_count for each IO operation. It does this to protect against
    truncate of a file. Block devices don't need this sort of protection.

    For a capable multiqueue setup, this atomic int is the only shared
    state between applications accessing the device for O_DIRECT, and it
    presents a scaling wall for that. In my testing, as much as 30% of
    system time is spent incrementing and decrementing this value. A mixed
    read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
    better latencies too. Before:

    clat percentiles (usec):
    | 1.00th=[ 33], 5.00th=[ 34], 10.00th=[ 34], 20.00th=[ 34],
    | 30.00th=[ 34], 40.00th=[ 34], 50.00th=[ 35], 60.00th=[ 35],
    | 70.00th=[ 35], 80.00th=[ 35], 90.00th=[ 37], 95.00th=[ 80],
    | 99.00th=[ 98], 99.50th=[ 151], 99.90th=[ 155], 99.95th=[ 155],
    | 99.99th=[ 165]

    After:

    clat percentiles (usec):
    | 1.00th=[ 95], 5.00th=[ 108], 10.00th=[ 129], 20.00th=[ 149],
    | 30.00th=[ 155], 40.00th=[ 161], 50.00th=[ 167], 60.00th=[ 171],
    | 70.00th=[ 177], 80.00th=[ 185], 90.00th=[ 201], 95.00th=[ 270],
    | 99.00th=[ 390], 99.50th=[ 398], 99.90th=[ 418], 99.95th=[ 422],
    | 99.99th=[ 438]

    In other setups, Robert Elliott reported seeing good performance
    improvements:

    https://lkml.org/lkml/2015/4/3/557

    The more applications accessing the device, the worse it gets.

    Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
    do_blockdev_direct_IO() that it need not worry about incrementing
    or decrementing the inode i_dio_count for this caller.

    Cc: Andrew Morton
    Cc: Christoph Hellwig
    Cc: Theodore Ts'o
    Cc: Elliott, Robert (Server Storage)
    Cc: Al Viro
    Signed-off-by: Jens Axboe
    Signed-off-by: Al Viro

    Jens Axboe
     

16 Apr, 2015

1 commit


18 Feb, 2015

1 commit


13 Feb, 2015

4 commits

  • Merge third set of updates from Andrew Morton:

    - the rest of MM

    [ This includes getting rid of the numa hinting bits, in favor of
    just generic protnone logic. Yay. - Linus ]

    - core kernel

    - procfs

    - some of lib/ (lots of lib/ material this time)

    * emailed patches from Andrew Morton : (104 commits)
    lib/lcm.c: replace include
    lib/percpu_ida.c: remove redundant includes
    lib/strncpy_from_user.c: replace module.h include
    lib/stmp_device.c: replace module.h include
    lib/sort.c: move include inside #if 0
    lib/show_mem.c: remove redundant include
    lib/radix-tree.c: change to simpler include
    lib/plist.c: remove redundant include
    lib/nlattr.c: remove redundant include
    lib/kobject_uevent.c: remove redundant include
    lib/llist.c: remove redundant include
    lib/md5.c: simplify include
    lib/list_sort.c: rearrange includes
    lib/genalloc.c: remove redundant include
    lib/idr.c: remove redundant include
    lib/halfmd4.c: simplify includes
    lib/dynamic_queue_limits.c: simplify includes
    lib/sort.c: use simpler includes
    lib/interval_tree.c: simplify includes
    hexdump: make it return number of bytes placed in buffer
    ...

    Linus Torvalds
     
  • Currently, the isolate callback passed to the list_lru_walk family of
    functions is supposed to just delete an item from the list upon returning
    LRU_REMOVED or LRU_REMOVED_RETRY, while nr_items counter is fixed by
    __list_lru_walk_one after the callback returns. Since the callback is
    allowed to drop the lock after removing an item (it has to return
    LRU_REMOVED_RETRY then), the nr_items can be less than the actual number
    of elements on the list even if we check them under the lock. This makes
    it difficult to move items from one list_lru_one to another, which is
    required for per-memcg list_lru reparenting - we can't just splice the
    lists, we have to move entries one by one.

    This patch therefore introduces helpers that must be used by callback
    functions to isolate items instead of raw list_del/list_move. These are
    list_lru_isolate and list_lru_isolate_move. They not only remove the
    entry from the list, but also fix the nr_items counter, making sure
    nr_items always reflects the actual number of elements on the list if
    checked under the appropriate lock.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Kmem accounting of memcg is unusable now, because it lacks slab shrinker
    support. That means when we hit the limit we will get ENOMEM w/o any
    chance to recover. What we should do then is to call shrink_slab, which
    would reclaim old inode/dentry caches from this cgroup. This is what
    this patch set is intended to do.

    Basically, it does two things. First, it introduces the notion of
    per-memcg slab shrinker. A shrinker that wants to reclaim objects per
    cgroup should mark itself as SHRINKER_MEMCG_AWARE. Then it will be
    passed the memory cgroup to scan from in shrink_control->memcg. For
    such shrinkers shrink_slab iterates over the whole cgroup subtree under
    the target cgroup and calls the shrinker for each kmem-active memory
    cgroup.

    Secondly, this patch set makes the list_lru structure per-memcg. It's
    done transparently to list_lru users - everything they have to do is to
    tell list_lru_init that they want memcg-aware list_lru. Then the
    list_lru will automatically distribute objects among per-memcg lists
    basing on which cgroup the object is accounted to. This way to make FS
    shrinkers (icache, dcache) memcg-aware we only need to make them use
    memcg-aware list_lru, and this is what this patch set does.

    As before, this patch set only enables per-memcg kmem reclaim when the
    pressure goes from memory.limit, not from memory.kmem.limit. Handling
    memory.kmem.limit is going to be tricky due to GFP_NOFS allocations, and
    it is still unclear whether we will have this knob in the unified
    hierarchy.

    This patch (of 9):

    NUMA aware slab shrinkers use the list_lru structure to distribute
    objects coming from different NUMA nodes to different lists. Whenever
    such a shrinker needs to count or scan objects from a particular node,
    it issues commands like this:

    count = list_lru_count_node(lru, sc->nid);
    freed = list_lru_walk_node(lru, sc->nid, isolate_func,
    isolate_arg, &sc->nr_to_scan);

    where sc is an instance of the shrink_control structure passed to it
    from vmscan.

    To simplify this, let's add special list_lru functions to be used by
    shrinkers, list_lru_shrink_count() and list_lru_shrink_walk(), which
    consolidate the nid and nr_to_scan arguments in the shrink_control
    structure.

    This will also allow us to avoid patching shrinkers that use list_lru
    when we make shrink_slab() per-memcg - all we will have to do is extend
    the shrink_control structure to include the target memcg and make
    list_lru_shrink_{count,walk} handle this appropriately.

    Signed-off-by: Vladimir Davydov
    Suggested-by: Dave Chinner
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Glauber Costa
    Cc: Alexander Viro
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Pull backing device changes from Jens Axboe:
    "This contains a cleanup of how the backing device is handled, in
    preparation for a rework of the life time rules. In this part, the
    most important change is to split the unrelated nommu mmap flags from
    it, but also removing a backing_dev_info pointer from the
    address_space (and inode), and a cleanup of other various minor bits.

    Christoph did all the work here, I just fixed an oops with pages that
    have a swap backing. Arnd fixed a missing export, and Oleg killed the
    lustre backing_dev_info from staging. Last patch was from Al,
    unexporting parts that are now no longer needed outside"

    * 'for-3.20/bdi' of git://git.kernel.dk/linux-block:
    Make super_blocks and sb_lock static
    mtd: export new mtd_mmap_capabilities
    fs: make inode_to_bdi() handle NULL inode
    staging/lustre/llite: get rid of backing_dev_info
    fs: remove default_backing_dev_info
    fs: don't reassign dirty inodes to default_backing_dev_info
    nfs: don't call bdi_unregister
    ceph: remove call to bdi_unregister
    fs: remove mapping->backing_dev_info
    fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info
    nilfs2: set up s_bdi like the generic mount_bdev code
    block_dev: get bdev inode bdi directly from the block device
    block_dev: only write bdev inode on close
    fs: introduce f_op->mmap_capabilities for nommu mmap support
    fs: kill BDI_CAP_SWAP_BACKED
    fs: deduplicate noop_backing_dev_info

    Linus Torvalds
     

11 Feb, 2015

2 commits

  • Merge misc updates from Andrew Morton:
    "Bite-sized chunks this time, to avoid the MTA ratelimiting woes.

    - fs/notify updates

    - ocfs2

    - some of MM"

    That laconic "some MM" is mainly the removal of remap_file_pages(),
    which is a big simplification of the VM, and which gets rid of a *lot*
    of random cruft and special cases because we no longer support the
    non-linear mappings that it used.

    From a user interface perspective, nothing has changed, because the
    remap_file_pages() syscall still exists, it's just done by emulating the
    old behavior by creating a lot of individual small mappings instead of
    one non-linear one.

    The emulation is slower than the old "native" non-linear mappings, but
    nobody really uses or cares about remap_file_pages(), and simplifying
    the VM is a big advantage.

    * emailed patches from Andrew Morton : (78 commits)
    memcg: zap memcg_slab_caches and memcg_slab_mutex
    memcg: zap memcg_name argument of memcg_create_kmem_cache
    memcg: zap __memcg_{charge,uncharge}_slab
    mm/page_alloc.c: place zone_id check before VM_BUG_ON_PAGE check
    mm: hugetlb: fix type of hugetlb_treat_as_movable variable
    mm, hugetlb: remove unnecessary lower bound on sysctl handlers"?
    mm: memory: merge shared-writable dirtying branches in do_wp_page()
    mm: memory: remove ->vm_file check on shared writable vmas
    xtensa: drop _PAGE_FILE and pte_file()-related helpers
    x86: drop _PAGE_FILE and pte_file()-related helpers
    unicore32: drop pte_file()-related helpers
    um: drop _PAGE_FILE and pte_file()-related helpers
    tile: drop pte_file()-related helpers
    sparc: drop pte_file()-related helpers
    sh: drop _PAGE_FILE and pte_file()-related helpers
    score: drop _PAGE_FILE and pte_file()-related helpers
    s390: drop pte_file()-related helpers
    parisc: drop _PAGE_FILE and pte_file()-related helpers
    openrisc: drop _PAGE_FILE and pte_file()-related helpers
    nios2: drop _PAGE_FILE and pte_file()-related helpers
    ...

    Linus Torvalds
     
  • We don't create non-linear mappings anymore. Let's drop code which
    handles them in rmap.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

05 Feb, 2015

2 commits

  • Add a new function find_inode_nowait() which is an even more general
    version of ilookup5_nowait(). It is designed for callers which need
    very fine grained control over when the function is allowed to block
    or increment the inode's reference count.

    Signed-off-by: Theodore Ts'o
    Signed-off-by: Al Viro

    Theodore Ts'o
     
  • Add a new mount option which enables a new "lazytime" mode. This mode
    causes atime, mtime, and ctime updates to only be made to the
    in-memory version of the inode. The on-disk times will only get
    updated when (a) if the inode needs to be updated for some non-time
    related change, (b) if userspace calls fsync(), syncfs() or sync(), or
    (c) just before an undeleted inode is evicted from memory.

    This is OK according to POSIX because there are no guarantees after a
    crash unless userspace explicitly requests via a fsync(2) call.

    For workloads which feature a large number of random write to a
    preallocated file, the lazytime mount option significantly reduces
    writes to the inode table. The repeated 4k writes to a single block
    will result in undesirable stress on flash devices and SMR disk
    drives. Even on conventional HDD's, the repeated writes to the inode
    table block will trigger Adjacent Track Interference (ATI) remediation
    latencies, which very negatively impact long tail latencies --- which
    is a very big deal for web serving tiers (for example).

    Google-Bug-Id: 18297052

    Signed-off-by: Theodore Ts'o
    Signed-off-by: Al Viro

    Theodore Ts'o
     

21 Jan, 2015

1 commit

  • Now that we never use the backing_dev_info pointer in struct address_space
    we can simply remove it and save 4 to 8 bytes in every inode.

    Signed-off-by: Christoph Hellwig
    Acked-by: Ryusuke Konishi
    Reviewed-by: Tejun Heo
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

17 Jan, 2015

1 commit

  • The current scheme of using the i_flock list is really difficult to
    manage. There is also a legitimate desire for a per-inode spinlock to
    manage these lists that isn't the i_lock.

    Start conversion to a new scheme to eventually replace the old i_flock
    list with a new "file_lock_context" object.

    We start by adding a new i_flctx to struct inode. For now, it lives in
    parallel with i_flock list, but will eventually replace it. The idea is
    to allocate a structure to sit in that pointer and act as a locus for
    all things file locking.

    We allocate a file_lock_context for an inode when the first lock is
    added to it, and it's only freed when the inode is freed. We use the
    i_lock to protect the assignment, but afterward it should mostly be
    accessed locklessly.

    Signed-off-by: Jeff Layton
    Acked-by: Christoph Hellwig

    Jeff Layton
     

17 Dec, 2014

1 commit

  • Pull vfs pile #2 from Al Viro:
    "Next pile (and there'll be one or two more).

    The large piece in this one is getting rid of /proc/*/ns/* weirdness;
    among other things, it allows to (finally) make nameidata completely
    opaque outside of fs/namei.c, making for easier further cleanups in
    there"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    coda_venus_readdir(): use file_inode()
    fs/namei.c: fold link_path_walk() call into path_init()
    path_init(): don't bother with LOOKUP_PARENT in argument
    fs/namei.c: new helper (path_cleanup())
    path_init(): store the "base" pointer to file in nameidata itself
    make default ->i_fop have ->open() fail with ENXIO
    make nameidata completely opaque outside of fs/namei.c
    kill proc_ns completely
    take the targets of /proc/*/ns/* symlinks to separate fs
    bury struct proc_ns in fs/proc
    copy address of proc_ns_ops into ns_common
    new helpers: ns_alloc_inum/ns_free_inum
    make proc_ns_operations work with struct ns_common * instead of void *
    switch the rest of proc_ns_operations to working with &...->ns
    netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns
    make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns
    common object embedded into various struct ....ns

    Linus Torvalds
     

14 Dec, 2014

1 commit

  • The i_mmap_mutex is a close cousin of the anon vma lock, both protecting
    similar data, one for file backed pages and the other for anon memory. To
    this end, this lock can also be a rwsem. In addition, there are some
    important opportunities to share the lock when there are no tree
    modifications.

    This conversion is straightforward. For now, all users take the write
    lock.

    [sfr@canb.auug.org.au: update fremap.c]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Acked-by: "Kirill A. Shutemov"
    Acked-by: Hugh Dickins
    Cc: Oleg Nesterov
    Acked-by: Peter Zijlstra (Intel)
    Cc: Srikar Dronamraju
    Acked-by: Mel Gorman
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

11 Dec, 2014

1 commit

  • As it is, default ->i_fop has NULL ->open() (along with all other methods).
    The only case where it matters is reopening (via procfs symlink) a file that
    didn't get its ->f_op from ->i_fop - anything else will have ->i_fop assigned
    to something sane (default would fail on read/write/ioctl/etc.).

    Unfortunately, such case exists - alloc_file() users, especially
    anon_get_file() ones. There we have tons of opened files of very different
    kinds sharing the same inode. As the result, attempt to reopen those via
    procfs succeeds and you get a descriptor you can't do anything with.

    Moreover, in case of sockets we set ->i_fop that will only be used
    on such reopen attempts - and put a failing ->open() into it to make sure
    those do not succeed.

    It would be simpler to put such ->open() into default ->i_fop and leave
    it unchanged both for anon inode (as we do anyway) and for socket ones. Result:
    * everything going through do_dentry_open() works as it used to
    * sock_no_open() kludge is gone
    * attempts to reopen anon-inode files fail as they really ought to
    * ditto for aio_private_file()
    * ditto for perfmon - this one actually tried to imitate sock_no_open()
    trick, but failed to set ->i_fop, so in the current tree reopens succeed and
    yield completely useless descriptor. Intent clearly had been to fail with
    -ENXIO on such reopens; now it actually does.
    * everything else that used alloc_file() keeps working - it has ->i_fop
    set for its inodes anyway

    Signed-off-by: Al Viro

    Al Viro
     

10 Nov, 2014

1 commit


09 Aug, 2014

1 commit

  • This patch (of 6):

    The i_mmap_writable field counts existing writable mappings of an
    address_space. To allow drivers to prevent new writable mappings, make
    this counter signed and prevent new writable mappings if it is negative.
    This is modelled after i_writecount and DENYWRITE.

    This will be required by the shmem-sealing infrastructure to prevent any
    new writable mappings after the WRITE seal has been set. In case there
    exists a writable mapping, this operation will fail with EBUSY.

    Note that we rely on the fact that iff you already own a writable mapping,
    you can increase the counter without using the helpers. This is the same
    that we do for i_writecount.

    Signed-off-by: David Herrmann
    Acked-by: Hugh Dickins
    Cc: Michael Kerrisk
    Cc: Ryan Lortie
    Cc: Lennart Poettering
    Cc: Daniel Mack
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Herrmann