11 Jul, 2015

1 commit

  • commit 2426f3910069ed47c0cc58559a6d088af7920201 upstream.

    file_remove_suid() could mistakenly set S_NOSEC inode bit when root was
    modifying the file. As a result following writes to the file by ordinary
    user would avoid clearing suid or sgid bits.

    Fix the bug by checking actual mode bits before setting S_NOSEC.

    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     

25 Apr, 2015

1 commit

  • do_blockdev_direct_IO() increments and decrements the inode
    ->i_dio_count for each IO operation. It does this to protect against
    truncate of a file. Block devices don't need this sort of protection.

    For a capable multiqueue setup, this atomic int is the only shared
    state between applications accessing the device for O_DIRECT, and it
    presents a scaling wall for that. In my testing, as much as 30% of
    system time is spent incrementing and decrementing this value. A mixed
    read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
    better latencies too. Before:

    clat percentiles (usec):
    | 1.00th=[ 33], 5.00th=[ 34], 10.00th=[ 34], 20.00th=[ 34],
    | 30.00th=[ 34], 40.00th=[ 34], 50.00th=[ 35], 60.00th=[ 35],
    | 70.00th=[ 35], 80.00th=[ 35], 90.00th=[ 37], 95.00th=[ 80],
    | 99.00th=[ 98], 99.50th=[ 151], 99.90th=[ 155], 99.95th=[ 155],
    | 99.99th=[ 165]

    After:

    clat percentiles (usec):
    | 1.00th=[ 95], 5.00th=[ 108], 10.00th=[ 129], 20.00th=[ 149],
    | 30.00th=[ 155], 40.00th=[ 161], 50.00th=[ 167], 60.00th=[ 171],
    | 70.00th=[ 177], 80.00th=[ 185], 90.00th=[ 201], 95.00th=[ 270],
    | 99.00th=[ 390], 99.50th=[ 398], 99.90th=[ 418], 99.95th=[ 422],
    | 99.99th=[ 438]

    In other setups, Robert Elliott reported seeing good performance
    improvements:

    https://lkml.org/lkml/2015/4/3/557

    The more applications accessing the device, the worse it gets.

    Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
    do_blockdev_direct_IO() that it need not worry about incrementing
    or decrementing the inode i_dio_count for this caller.

    Cc: Andrew Morton
    Cc: Christoph Hellwig
    Cc: Theodore Ts'o
    Cc: Elliott, Robert (Server Storage)
    Cc: Al Viro
    Signed-off-by: Jens Axboe
    Signed-off-by: Al Viro

    Jens Axboe
     

16 Apr, 2015

1 commit


18 Feb, 2015

1 commit


13 Feb, 2015

4 commits

  • Merge third set of updates from Andrew Morton:

    - the rest of MM

    [ This includes getting rid of the numa hinting bits, in favor of
    just generic protnone logic. Yay. - Linus ]

    - core kernel

    - procfs

    - some of lib/ (lots of lib/ material this time)

    * emailed patches from Andrew Morton : (104 commits)
    lib/lcm.c: replace include
    lib/percpu_ida.c: remove redundant includes
    lib/strncpy_from_user.c: replace module.h include
    lib/stmp_device.c: replace module.h include
    lib/sort.c: move include inside #if 0
    lib/show_mem.c: remove redundant include
    lib/radix-tree.c: change to simpler include
    lib/plist.c: remove redundant include
    lib/nlattr.c: remove redundant include
    lib/kobject_uevent.c: remove redundant include
    lib/llist.c: remove redundant include
    lib/md5.c: simplify include
    lib/list_sort.c: rearrange includes
    lib/genalloc.c: remove redundant include
    lib/idr.c: remove redundant include
    lib/halfmd4.c: simplify includes
    lib/dynamic_queue_limits.c: simplify includes
    lib/sort.c: use simpler includes
    lib/interval_tree.c: simplify includes
    hexdump: make it return number of bytes placed in buffer
    ...

    Linus Torvalds
     
  • Currently, the isolate callback passed to the list_lru_walk family of
    functions is supposed to just delete an item from the list upon returning
    LRU_REMOVED or LRU_REMOVED_RETRY, while nr_items counter is fixed by
    __list_lru_walk_one after the callback returns. Since the callback is
    allowed to drop the lock after removing an item (it has to return
    LRU_REMOVED_RETRY then), the nr_items can be less than the actual number
    of elements on the list even if we check them under the lock. This makes
    it difficult to move items from one list_lru_one to another, which is
    required for per-memcg list_lru reparenting - we can't just splice the
    lists, we have to move entries one by one.

    This patch therefore introduces helpers that must be used by callback
    functions to isolate items instead of raw list_del/list_move. These are
    list_lru_isolate and list_lru_isolate_move. They not only remove the
    entry from the list, but also fix the nr_items counter, making sure
    nr_items always reflects the actual number of elements on the list if
    checked under the appropriate lock.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Kmem accounting of memcg is unusable now, because it lacks slab shrinker
    support. That means when we hit the limit we will get ENOMEM w/o any
    chance to recover. What we should do then is to call shrink_slab, which
    would reclaim old inode/dentry caches from this cgroup. This is what
    this patch set is intended to do.

    Basically, it does two things. First, it introduces the notion of
    per-memcg slab shrinker. A shrinker that wants to reclaim objects per
    cgroup should mark itself as SHRINKER_MEMCG_AWARE. Then it will be
    passed the memory cgroup to scan from in shrink_control->memcg. For
    such shrinkers shrink_slab iterates over the whole cgroup subtree under
    the target cgroup and calls the shrinker for each kmem-active memory
    cgroup.

    Secondly, this patch set makes the list_lru structure per-memcg. It's
    done transparently to list_lru users - everything they have to do is to
    tell list_lru_init that they want memcg-aware list_lru. Then the
    list_lru will automatically distribute objects among per-memcg lists
    basing on which cgroup the object is accounted to. This way to make FS
    shrinkers (icache, dcache) memcg-aware we only need to make them use
    memcg-aware list_lru, and this is what this patch set does.

    As before, this patch set only enables per-memcg kmem reclaim when the
    pressure goes from memory.limit, not from memory.kmem.limit. Handling
    memory.kmem.limit is going to be tricky due to GFP_NOFS allocations, and
    it is still unclear whether we will have this knob in the unified
    hierarchy.

    This patch (of 9):

    NUMA aware slab shrinkers use the list_lru structure to distribute
    objects coming from different NUMA nodes to different lists. Whenever
    such a shrinker needs to count or scan objects from a particular node,
    it issues commands like this:

    count = list_lru_count_node(lru, sc->nid);
    freed = list_lru_walk_node(lru, sc->nid, isolate_func,
    isolate_arg, &sc->nr_to_scan);

    where sc is an instance of the shrink_control structure passed to it
    from vmscan.

    To simplify this, let's add special list_lru functions to be used by
    shrinkers, list_lru_shrink_count() and list_lru_shrink_walk(), which
    consolidate the nid and nr_to_scan arguments in the shrink_control
    structure.

    This will also allow us to avoid patching shrinkers that use list_lru
    when we make shrink_slab() per-memcg - all we will have to do is extend
    the shrink_control structure to include the target memcg and make
    list_lru_shrink_{count,walk} handle this appropriately.

    Signed-off-by: Vladimir Davydov
    Suggested-by: Dave Chinner
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Glauber Costa
    Cc: Alexander Viro
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Pull backing device changes from Jens Axboe:
    "This contains a cleanup of how the backing device is handled, in
    preparation for a rework of the life time rules. In this part, the
    most important change is to split the unrelated nommu mmap flags from
    it, but also removing a backing_dev_info pointer from the
    address_space (and inode), and a cleanup of other various minor bits.

    Christoph did all the work here, I just fixed an oops with pages that
    have a swap backing. Arnd fixed a missing export, and Oleg killed the
    lustre backing_dev_info from staging. Last patch was from Al,
    unexporting parts that are now no longer needed outside"

    * 'for-3.20/bdi' of git://git.kernel.dk/linux-block:
    Make super_blocks and sb_lock static
    mtd: export new mtd_mmap_capabilities
    fs: make inode_to_bdi() handle NULL inode
    staging/lustre/llite: get rid of backing_dev_info
    fs: remove default_backing_dev_info
    fs: don't reassign dirty inodes to default_backing_dev_info
    nfs: don't call bdi_unregister
    ceph: remove call to bdi_unregister
    fs: remove mapping->backing_dev_info
    fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info
    nilfs2: set up s_bdi like the generic mount_bdev code
    block_dev: get bdev inode bdi directly from the block device
    block_dev: only write bdev inode on close
    fs: introduce f_op->mmap_capabilities for nommu mmap support
    fs: kill BDI_CAP_SWAP_BACKED
    fs: deduplicate noop_backing_dev_info

    Linus Torvalds
     

11 Feb, 2015

2 commits

  • Merge misc updates from Andrew Morton:
    "Bite-sized chunks this time, to avoid the MTA ratelimiting woes.

    - fs/notify updates

    - ocfs2

    - some of MM"

    That laconic "some MM" is mainly the removal of remap_file_pages(),
    which is a big simplification of the VM, and which gets rid of a *lot*
    of random cruft and special cases because we no longer support the
    non-linear mappings that it used.

    From a user interface perspective, nothing has changed, because the
    remap_file_pages() syscall still exists, it's just done by emulating the
    old behavior by creating a lot of individual small mappings instead of
    one non-linear one.

    The emulation is slower than the old "native" non-linear mappings, but
    nobody really uses or cares about remap_file_pages(), and simplifying
    the VM is a big advantage.

    * emailed patches from Andrew Morton : (78 commits)
    memcg: zap memcg_slab_caches and memcg_slab_mutex
    memcg: zap memcg_name argument of memcg_create_kmem_cache
    memcg: zap __memcg_{charge,uncharge}_slab
    mm/page_alloc.c: place zone_id check before VM_BUG_ON_PAGE check
    mm: hugetlb: fix type of hugetlb_treat_as_movable variable
    mm, hugetlb: remove unnecessary lower bound on sysctl handlers"?
    mm: memory: merge shared-writable dirtying branches in do_wp_page()
    mm: memory: remove ->vm_file check on shared writable vmas
    xtensa: drop _PAGE_FILE and pte_file()-related helpers
    x86: drop _PAGE_FILE and pte_file()-related helpers
    unicore32: drop pte_file()-related helpers
    um: drop _PAGE_FILE and pte_file()-related helpers
    tile: drop pte_file()-related helpers
    sparc: drop pte_file()-related helpers
    sh: drop _PAGE_FILE and pte_file()-related helpers
    score: drop _PAGE_FILE and pte_file()-related helpers
    s390: drop pte_file()-related helpers
    parisc: drop _PAGE_FILE and pte_file()-related helpers
    openrisc: drop _PAGE_FILE and pte_file()-related helpers
    nios2: drop _PAGE_FILE and pte_file()-related helpers
    ...

    Linus Torvalds
     
  • We don't create non-linear mappings anymore. Let's drop code which
    handles them in rmap.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

05 Feb, 2015

2 commits

  • Add a new function find_inode_nowait() which is an even more general
    version of ilookup5_nowait(). It is designed for callers which need
    very fine grained control over when the function is allowed to block
    or increment the inode's reference count.

    Signed-off-by: Theodore Ts'o
    Signed-off-by: Al Viro

    Theodore Ts'o
     
  • Add a new mount option which enables a new "lazytime" mode. This mode
    causes atime, mtime, and ctime updates to only be made to the
    in-memory version of the inode. The on-disk times will only get
    updated when (a) if the inode needs to be updated for some non-time
    related change, (b) if userspace calls fsync(), syncfs() or sync(), or
    (c) just before an undeleted inode is evicted from memory.

    This is OK according to POSIX because there are no guarantees after a
    crash unless userspace explicitly requests via a fsync(2) call.

    For workloads which feature a large number of random write to a
    preallocated file, the lazytime mount option significantly reduces
    writes to the inode table. The repeated 4k writes to a single block
    will result in undesirable stress on flash devices and SMR disk
    drives. Even on conventional HDD's, the repeated writes to the inode
    table block will trigger Adjacent Track Interference (ATI) remediation
    latencies, which very negatively impact long tail latencies --- which
    is a very big deal for web serving tiers (for example).

    Google-Bug-Id: 18297052

    Signed-off-by: Theodore Ts'o
    Signed-off-by: Al Viro

    Theodore Ts'o
     

21 Jan, 2015

1 commit

  • Now that we never use the backing_dev_info pointer in struct address_space
    we can simply remove it and save 4 to 8 bytes in every inode.

    Signed-off-by: Christoph Hellwig
    Acked-by: Ryusuke Konishi
    Reviewed-by: Tejun Heo
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

17 Jan, 2015

1 commit

  • The current scheme of using the i_flock list is really difficult to
    manage. There is also a legitimate desire for a per-inode spinlock to
    manage these lists that isn't the i_lock.

    Start conversion to a new scheme to eventually replace the old i_flock
    list with a new "file_lock_context" object.

    We start by adding a new i_flctx to struct inode. For now, it lives in
    parallel with i_flock list, but will eventually replace it. The idea is
    to allocate a structure to sit in that pointer and act as a locus for
    all things file locking.

    We allocate a file_lock_context for an inode when the first lock is
    added to it, and it's only freed when the inode is freed. We use the
    i_lock to protect the assignment, but afterward it should mostly be
    accessed locklessly.

    Signed-off-by: Jeff Layton
    Acked-by: Christoph Hellwig

    Jeff Layton
     

17 Dec, 2014

1 commit

  • Pull vfs pile #2 from Al Viro:
    "Next pile (and there'll be one or two more).

    The large piece in this one is getting rid of /proc/*/ns/* weirdness;
    among other things, it allows to (finally) make nameidata completely
    opaque outside of fs/namei.c, making for easier further cleanups in
    there"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    coda_venus_readdir(): use file_inode()
    fs/namei.c: fold link_path_walk() call into path_init()
    path_init(): don't bother with LOOKUP_PARENT in argument
    fs/namei.c: new helper (path_cleanup())
    path_init(): store the "base" pointer to file in nameidata itself
    make default ->i_fop have ->open() fail with ENXIO
    make nameidata completely opaque outside of fs/namei.c
    kill proc_ns completely
    take the targets of /proc/*/ns/* symlinks to separate fs
    bury struct proc_ns in fs/proc
    copy address of proc_ns_ops into ns_common
    new helpers: ns_alloc_inum/ns_free_inum
    make proc_ns_operations work with struct ns_common * instead of void *
    switch the rest of proc_ns_operations to working with &...->ns
    netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns
    make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns
    common object embedded into various struct ....ns

    Linus Torvalds
     

14 Dec, 2014

1 commit

  • The i_mmap_mutex is a close cousin of the anon vma lock, both protecting
    similar data, one for file backed pages and the other for anon memory. To
    this end, this lock can also be a rwsem. In addition, there are some
    important opportunities to share the lock when there are no tree
    modifications.

    This conversion is straightforward. For now, all users take the write
    lock.

    [sfr@canb.auug.org.au: update fremap.c]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Acked-by: "Kirill A. Shutemov"
    Acked-by: Hugh Dickins
    Cc: Oleg Nesterov
    Acked-by: Peter Zijlstra (Intel)
    Cc: Srikar Dronamraju
    Acked-by: Mel Gorman
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

11 Dec, 2014

1 commit

  • As it is, default ->i_fop has NULL ->open() (along with all other methods).
    The only case where it matters is reopening (via procfs symlink) a file that
    didn't get its ->f_op from ->i_fop - anything else will have ->i_fop assigned
    to something sane (default would fail on read/write/ioctl/etc.).

    Unfortunately, such case exists - alloc_file() users, especially
    anon_get_file() ones. There we have tons of opened files of very different
    kinds sharing the same inode. As the result, attempt to reopen those via
    procfs succeeds and you get a descriptor you can't do anything with.

    Moreover, in case of sockets we set ->i_fop that will only be used
    on such reopen attempts - and put a failing ->open() into it to make sure
    those do not succeed.

    It would be simpler to put such ->open() into default ->i_fop and leave
    it unchanged both for anon inode (as we do anyway) and for socket ones. Result:
    * everything going through do_dentry_open() works as it used to
    * sock_no_open() kludge is gone
    * attempts to reopen anon-inode files fail as they really ought to
    * ditto for aio_private_file()
    * ditto for perfmon - this one actually tried to imitate sock_no_open()
    trick, but failed to set ->i_fop, so in the current tree reopens succeed and
    yield completely useless descriptor. Intent clearly had been to fail with
    -ENXIO on such reopens; now it actually does.
    * everything else that used alloc_file() keeps working - it has ->i_fop
    set for its inodes anyway

    Signed-off-by: Al Viro

    Al Viro
     

10 Nov, 2014

1 commit


09 Aug, 2014

1 commit

  • This patch (of 6):

    The i_mmap_writable field counts existing writable mappings of an
    address_space. To allow drivers to prevent new writable mappings, make
    this counter signed and prevent new writable mappings if it is negative.
    This is modelled after i_writecount and DENYWRITE.

    This will be required by the shmem-sealing infrastructure to prevent any
    new writable mappings after the WRITE seal has been set. In case there
    exists a writable mapping, this operation will fail with EBUSY.

    Note that we rely on the fact that iff you already own a writable mapping,
    you can increase the counter without using the helpers. This is the same
    that we do for i_writecount.

    Signed-off-by: David Herrmann
    Acked-by: Hugh Dickins
    Cc: Michael Kerrisk
    Cc: Ryan Lortie
    Cc: Lennart Poettering
    Cc: Daniel Mack
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Herrmann
     

16 Jul, 2014

1 commit

  • The current "wait_on_bit" interface requires an 'action'
    function to be provided which does the actual waiting.
    There are over 20 such functions, many of them identical.
    Most cases can be satisfied by one of just two functions, one
    which uses io_schedule() and one which just uses schedule().

    So:
    Rename wait_on_bit and wait_on_bit_lock to
    wait_on_bit_action and wait_on_bit_lock_action
    to make it explicit that they need an action function.

    Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
    which are *not* given an action function but implicitly use
    a standard one.
    The decision to error-out if a signal is pending is now made
    based on the 'mode' argument rather than being encoded in the action
    function.

    All instances of the old wait_on_bit and wait_on_bit_lock which
    can use the new version have been changed accordingly and their
    action functions have been discarded.
    wait_on_bit{_lock} does not return any specific error code in the
    event of a signal so the caller must check for non-zero and
    interpolate their own error code as appropriate.

    The wait_on_bit() call in __fscache_wait_on_invalidate() was
    ambiguous as it specified TASK_UNINTERRUPTIBLE but used
    fscache_wait_bit_interruptible as an action function.
    David Howells confirms this should be uniformly
    "uninterruptible"

    The main remaining user of wait_on_bit{,_lock}_action is NFS
    which needs to use a freezer-aware schedule() call.

    A comment in fs/gfs2/glock.c notes that having multiple 'action'
    functions is useful as they display differently in the 'wchan'
    field of 'ps'. (and /proc/$PID/wchan).
    As the new bit_wait{,_io} functions are tagged "__sched", they
    will not show up at all, but something higher in the stack. So
    the distinction will still be visible, only with different
    function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
    gfs2/glock.c case).

    Since first version of this patch (against 3.15) two new action
    functions appeared, on in NFS and one in CIFS. CIFS also now
    uses an action function that makes the same freezer aware
    schedule call as NFS.

    Signed-off-by: NeilBrown
    Acked-by: David Howells (fscache, keys)
    Acked-by: Steven Whitehouse (gfs2)
    Acked-by: Peter Zijlstra
    Cc: Oleg Nesterov
    Cc: Steve French
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
    Signed-off-by: Ingo Molnar

    NeilBrown
     

11 Jun, 2014

1 commit

  • The kernel has no concept of capabilities with respect to inodes; inodes
    exist independently of namespaces. For example, inode_capable(inode,
    CAP_LINUX_IMMUTABLE) would be nonsense.

    This patch changes inode_capable to check for uid and gid mappings and
    renames it to capable_wrt_inode_uidgid, which should make it more
    obvious what it does.

    Fixes CVE-2014-4014.

    Cc: Theodore Ts'o
    Cc: Serge Hallyn
    Cc: "Eric W. Biederman"
    Cc: Dave Chinner
    Cc: stable@vger.kernel.org
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     

07 Jun, 2014

1 commit


05 Apr, 2014

2 commits

  • Pull ext4 updates from Ted Ts'o:
    "Major changes for 3.14 include support for the newly added ZERO_RANGE
    and COLLAPSE_RANGE fallocate operations, and scalability improvements
    in the jbd2 layer and in xattr handling when the extended attributes
    spill over into an external block.

    Other than that, the usual clean ups and minor bug fixes"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (42 commits)
    ext4: fix premature freeing of partial clusters split across leaf blocks
    ext4: remove unneeded test of ret variable
    ext4: fix comment typo
    ext4: make ext4_block_zero_page_range static
    ext4: atomically set inode->i_flags in ext4_set_inode_flags()
    ext4: optimize Hurd tests when reading/writing inodes
    ext4: kill i_version support for Hurd-castrated file systems
    ext4: each filesystem creates and uses its own mb_cache
    fs/mbcache.c: doucple the locking of local from global data
    fs/mbcache.c: change block and index hash chain to hlist_bl_node
    ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
    ext4: refactor ext4_fallocate code
    ext4: Update inode i_size after the preallocation
    ext4: fix partial cluster handling for bigalloc file systems
    ext4: delete path dealloc code in ext4_ext_handle_uninitialized_extents
    ext4: only call sync_filesystm() when remounting read-only
    fs: push sync_filesystem() down to the file system's remount_fs()
    jbd2: improve error messages for inconsistent journal heads
    jbd2: minimize region locked by j_list_lock in jbd2_journal_forget()
    jbd2: minimize region locked by j_list_lock in journal_get_create_access()
    ...

    Linus Torvalds
     
  • Pull renameat2 system call from Miklos Szeredi:
    "This adds a new syscall, renameat2(), which is the same as renameat()
    but with a flags argument.

    The purpose of extending rename is to add cross-rename, a symmetric
    variant of rename, which exchanges the two files. This allows
    interesting things, which were not possible before, for example
    atomically replacing a directory tree with a symlink, etc... This
    also allows overlayfs and friends to operate on whiteouts atomically.

    Andy Lutomirski also suggested a "noreplace" flag, which disables the
    overwriting behavior of rename.

    These two flags, RENAME_EXCHANGE and RENAME_NOREPLACE are only
    implemented for ext4 as an example and for testing"

    * 'cross-rename' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
    ext4: add cross rename support
    ext4: rename: split out helper functions
    ext4: rename: move EMLINK check up
    ext4: rename: create ext4_renament structure for local vars
    vfs: add cross-rename
    vfs: lock_two_nondirectories: allow directory args
    security: add flags to rename hooks
    vfs: add RENAME_NOREPLACE flag
    vfs: add renameat2 syscall
    vfs: rename: use common code for dir and non-dir
    vfs: rename: move d_move() up
    vfs: add d_is_dir()

    Linus Torvalds
     

04 Apr, 2014

1 commit

  • Reclaim will be leaving shadow entries in the page cache radix tree upon
    evicting the real page. As those pages are found from the LRU, an
    iput() can lead to the inode being freed concurrently. At this point,
    reclaim must no longer install shadow pages because the inode freeing
    code needs to ensure the page tree is really empty.

    Add an address_space flag, AS_EXITING, that the inode freeing code sets
    under the tree lock before doing the final truncate. Reclaim will check
    for this flag before installing shadow pages.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

01 Apr, 2014

1 commit


25 Mar, 2014

1 commit

  • Use cmpxchg() to atomically set i_flags instead of clearing out the
    S_IMMUTABLE, S_APPEND, etc. flags and then setting them from the
    EXT4_IMMUTABLE_FL, EXT4_APPEND_FL flags, since this opens up a race
    where an immutable file has the immutable flag cleared for a brief
    window of time.

    Reported-by: John Sullivan
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@kernel.org

    Theodore Ts'o
     

09 Nov, 2013

5 commits


11 Sep, 2013

5 commits

  • Now that the shrinker is passing a node in the scan control structure, we
    can pass this to the the generic LRU list code to isolate reclaim to the
    lists on matching nodes.

    Signed-off-by: Dave Chinner
    Signed-off-by: Glauber Costa
    Acked-by: Mel Gorman
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dave Chinner
     
  • When removing an element from the lru, this will be done today after the lock
    is released. This is a clear mistake, although we are not sure if the bugs we
    are seeing are related to this. All list manipulations are done inside the
    lock, and so should this one.

    Signed-off-by: Glauber Costa
    Tested-by: Michal Hocko
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Glauber Costa
     
  • [glommer@openvz.org: adapted for new LRU return codes]
    Signed-off-by: Dave Chinner
    Signed-off-by: Glauber Costa
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton

    Signed-off-by: Al Viro

    Dave Chinner
     
  • Convert superblock shrinker to use the new count/scan API, and propagate
    the API changes through to the filesystem callouts. The filesystem
    callouts already use a count/scan API, so it's just changing counters to
    longs to match the VM API.

    This requires the dentry and inode shrinker callouts to be converted to
    the count/scan API. This is mainly a mechanical change.

    [glommer@openvz.org: use mult_frac for fractional proportions, build fixes]
    Signed-off-by: Dave Chinner
    Signed-off-by: Glauber Costa
    Acked-by: Mel Gorman
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton

    Signed-off-by: Al Viro

    Dave Chinner
     
  • This series reworks our current object cache shrinking infrastructure in
    two main ways:

    * Noticing that a lot of users copy and paste their own version of LRU
    lists for objects, we put some effort in providing a generic version.
    It is modeled after the filesystem users: dentries, inodes, and xfs
    (for various tasks), but we expect that other users could benefit in
    the near future with little or no modification. Let us know if you
    have any issues.

    * The underlying list_lru being proposed automatically and
    transparently keeps the elements in per-node lists, and is able to
    manipulate the node lists individually. Given this infrastructure, we
    are able to modify the up-to-now hammer called shrink_slab to proceed
    with node-reclaim instead of always searching memory from all over like
    it has been doing.

    Per-node lru lists are also expected to lead to less contention in the lru
    locks on multi-node scans, since we are now no longer fighting for a
    global lock. The locks usually disappear from the profilers with this
    change.

    Although we have no official benchmarks for this version - be our guest to
    independently evaluate this - earlier versions of this series were
    performance tested (details at
    http://permalink.gmane.org/gmane.linux.kernel.mm/100537) yielding no
    visible performance regressions while yielding a better qualitative
    behavior in NUMA machines.

    With this infrastructure in place, we can use the list_lru entry point to
    provide memcg isolation and per-memcg targeted reclaim. Historically,
    those two pieces of work have been posted together. This version presents
    only the infrastructure work, deferring the memcg work for a later time,
    so we can focus on getting this part tested. You can see more about the
    history of such work at http://lwn.net/Articles/552769/

    Dave Chinner (18):
    dcache: convert dentry_stat.nr_unused to per-cpu counters
    dentry: move to per-sb LRU locks
    dcache: remove dentries from LRU before putting on dispose list
    mm: new shrinker API
    shrinker: convert superblock shrinkers to new API
    list: add a new LRU list type
    inode: convert inode lru list to generic lru list code.
    dcache: convert to use new lru list infrastructure
    list_lru: per-node list infrastructure
    shrinker: add node awareness
    fs: convert inode and dentry shrinking to be node aware
    xfs: convert buftarg LRU to generic code
    xfs: rework buffer dispose list tracking
    xfs: convert dquot cache lru to list_lru
    fs: convert fs shrinkers to new scan/count API
    drivers: convert shrinkers to new count/scan API
    shrinker: convert remaining shrinkers to count/scan API
    shrinker: Kill old ->shrink API.

    Glauber Costa (7):
    fs: bump inode and dentry counters to long
    super: fix calculation of shrinkable objects for small numbers
    list_lru: per-node API
    vmscan: per-node deferred work
    i915: bail out earlier when shrinker cannot acquire mutex
    hugepage: convert huge zero page shrinker to new shrinker API
    list_lru: dynamically adjust node arrays

    This patch:

    There are situations in very large machines in which we can have a large
    quantity of dirty inodes, unused dentries, etc. This is particularly true
    when umounting a filesystem, where eventually since every live object will
    eventually be discarded.

    Dave Chinner reported a problem with this while experimenting with the
    shrinker revamp patchset. So we believe it is time for a change. This
    patch just moves int to longs. Machines where it matters should have a
    big long anyway.

    Signed-off-by: Glauber Costa
    Cc: Dave Chinner
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: Dave Chinner
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Glauber Costa
     

04 Sep, 2013

1 commit


29 Jun, 2013

1 commit


02 May, 2013

1 commit

  • Pull VFS updates from Al Viro,

    Misc cleanups all over the place, mainly wrt /proc interfaces (switch
    create_proc_entry to proc_create(), get rid of the deprecated
    create_proc_read_entry() in favor of using proc_create_data() and
    seq_file etc).

    7kloc removed.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
    don't bother with deferred freeing of fdtables
    proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
    proc: Make the PROC_I() and PDE() macros internal to procfs
    proc: Supply a function to remove a proc entry by PDE
    take cgroup_open() and cpuset_open() to fs/proc/base.c
    ppc: Clean up scanlog
    ppc: Clean up rtas_flash driver somewhat
    hostap: proc: Use remove_proc_subtree()
    drm: proc: Use remove_proc_subtree()
    drm: proc: Use minor->index to label things, not PDE->name
    drm: Constify drm_proc_list[]
    zoran: Don't print proc_dir_entry data in debug
    reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
    proc: Supply an accessor for getting the data from a PDE's parent
    airo: Use remove_proc_subtree()
    rtl8192u: Don't need to save device proc dir PDE
    rtl8187se: Use a dir under /proc/net/r8180/
    proc: Add proc_mkdir_data()
    proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
    proc: Move PDE_NET() to fs/proc/proc_net.c
    ...

    Linus Torvalds