21 Aug, 2015

2 commits

  • i_lock is only needed until __d_find_any_alias calls dget on the alias
    dentry. After that the reference to new ensures that dentry_kill and
    d_delete will not remove the inode from the dentry, and remove the
    dentry from the inode->d_entry list.

    The inode i_lock came to be held over the the __d_move calls in
    d_splice_alias through a series of introduction of locks with
    increasing smaller scope. First it was the dcache_lock, then
    it was the dcache_inode_lock, and finally inode->i_lock.

    Furthermore inode->i_lock is not held over any other calls
    to d_move or __d_move so it can not provide any meaningful
    rename protection.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Al Viro

    Eric W. Biederman
     
  • A rename can result in a dentry that by walking up d_parent
    will never reach it's mnt_root. For lack of a better term
    I call this an escaped path.

    prepend_path is called by four different functions __d_path,
    d_absolute_path, d_path, and getcwd.

    __d_path only wants to see paths are connected to the root it passes
    in. So __d_path needs prepend_path to return an error.

    d_absolute_path similarly wants to see paths that are connected to
    some root. Escaped paths are not connected to any mnt_root so
    d_absolute_path needs prepend_path to return an error greater
    than 1. So escaped paths will be treated like paths on lazily
    unmounted mounts.

    getcwd needs to prepend "(unreachable)" so getcwd also needs
    prepend_path to return an error.

    d_path is the interesting hold out. d_path just wants to print
    something, and does not care about the weird cases. Which raises
    the question what should be printed?

    Given that / should result in -ENOENT I
    believe it is desirable for escaped paths to be printed as empty
    paths. As there are not really any meaninful path components when
    considered from the perspective of a mount tree.

    So tweak prepend_path to return an empty path with an new error
    code of 3 when it encounters an escaped path.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Al Viro

    Eric W. Biederman
     

07 Aug, 2015

1 commit

  • Dave Hansen reported the following;

    My laptop has been behaving strangely with 4.2-rc2. Once I log
    in to my X session, I start getting all kinds of strange errors
    from applications and see this in my dmesg:

    VFS: file-max limit 8192 reached

    The problem is that the file-max is calculated before memory is fully
    initialised and miscalculates how much memory the kernel is using. This
    patch recalculates file-max after deferred memory initialisation. Note
    that using memory hotplug infrastructure would not have avoided this
    problem as the value is not recalculated after memory hot-add.

    4.1: files_stat.max_files = 6582781
    4.2-rc2: files_stat.max_files = 8192
    4.2-rc2 patched: files_stat.max_files = 6562467

    Small differences with the patch applied and 4.1 but not enough to matter.

    Signed-off-by: Mel Gorman
    Reported-by: Dave Hansen
    Cc: Nicolai Stange
    Cc: Dave Hansen
    Cc: Alex Ng
    Cc: Fengguang Wu
    Cc: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

12 Jul, 2015

1 commit

  • Normally opening a file, unlinking it and then closing will have
    the inode freed upon close() (provided that it's not otherwise busy and
    has no remaining links, of course). However, there's one case where that
    does *not* happen. Namely, if you open it by fhandle with cold dcache,
    then unlink() and close().

    In normal case you get d_delete() in unlink(2) notice that dentry
    is busy and unhash it; on the final dput() it will be forcibly evicted from
    dcache, triggering iput() and inode removal. In this case, though, we end
    up with *two* dentries - disconnected (created by open-by-fhandle) and
    regular one (used by unlink()). The latter will have its reference to inode
    dropped just fine, but the former will not - it's considered hashed (it
    is on the ->s_anon list), so it will stay around until the memory pressure
    will finally do it in. As the result, we have the final iput() delayed
    indefinitely. It's trivial to reproduce -

    void flush_dcache(void)
    {
    system("mount -o remount,rw /");
    }

    static char buf[20 * 1024 * 1024];

    main()
    {
    int fd;
    union {
    struct file_handle f;
    char buf[MAX_HANDLE_SZ];
    } x;
    int m;

    x.f.handle_bytes = sizeof(x);
    chdir("/root");
    mkdir("foo", 0700);
    fd = open("foo/bar", O_CREAT | O_RDWR, 0600);
    close(fd);
    name_to_handle_at(AT_FDCWD, "foo/bar", &x.f, &m, 0);
    flush_dcache();
    fd = open_by_handle_at(AT_FDCWD, &x.f, O_RDWR);
    unlink("foo/bar");
    write(fd, buf, sizeof(buf));
    system("df ."); /* 20Mb eaten */
    close(fd);
    system("df ."); /* should've freed those 20Mb */
    flush_dcache();
    system("df ."); /* should be the same as #2 */
    }

    will spit out something like
    Filesystem 1K-blocks Used Available Use% Mounted on
    /dev/root 322023 303843 1131 100% /
    Filesystem 1K-blocks Used Available Use% Mounted on
    /dev/root 322023 303843 1131 100% /
    Filesystem 1K-blocks Used Available Use% Mounted on
    /dev/root 322023 283282 21692 93% /
    - inode gets freed only when dentry is finally evicted (here we trigger
    than by remount; normally it would've happened in response to memory
    pressure hell knows when).

    Cc: stable@vger.kernel.org # v2.6.38+; earlier ones need s/kill_it/unhash_it/
    Acked-by: J. Bruce Fields
    Signed-off-by: Al Viro

    Al Viro
     

05 Jul, 2015

1 commit

  • Pull more vfs updates from Al Viro:
    "Assorted VFS fixes and related cleanups (IMO the most interesting in
    that part are f_path-related things and Eric's descriptor-related
    stuff). UFS regression fixes (it got broken last cycle). 9P fixes.
    fs-cache series, DAX patches, Jan's file_remove_suid() work"

    [ I'd say this is much more than "fixes and related cleanups". The
    file_table locking rule change by Eric Dumazet is a rather big and
    fundamental update even if the patch isn't huge. - Linus ]

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
    9p: cope with bogus responses from server in p9_client_{read,write}
    p9_client_write(): avoid double p9_free_req()
    9p: forgetting to cancel request on interrupted zero-copy RPC
    dax: bdev_direct_access() may sleep
    block: Add support for DAX reads/writes to block devices
    dax: Use copy_from_iter_nocache
    dax: Add block size note to documentation
    fs/file.c: __fget() and dup2() atomicity rules
    fs/file.c: don't acquire files->file_lock in fd_install()
    fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
    vfs: avoid creation of inode number 0 in get_next_ino
    namei: make set_root_rcu() return void
    make simple_positive() public
    ufs: use dir_pages instead of ufs_dir_pages()
    pagemap.h: move dir_pages() over there
    remove the pointless include of lglock.h
    fs: cleanup slight list_entry abuse
    xfs: Correctly lock inode when removing suid and file capabilities
    fs: Call security_ops->inode_killpriv on truncate
    fs: Provide function telling whether file_remove_privs() will do anything
    ...

    Linus Torvalds
     

04 Jul, 2015

1 commit

  • Pull user namespace updates from Eric Biederman:
    "Long ago and far away when user namespaces where young it was realized
    that allowing fresh mounts of proc and sysfs with only user namespace
    permissions could violate the basic rule that only root gets to decide
    if proc or sysfs should be mounted at all.

    Some hacks were put in place to reduce the worst of the damage could
    be done, and the common sense rule was adopted that fresh mounts of
    proc and sysfs should allow no more than bind mounts of proc and
    sysfs. Unfortunately that rule has not been fully enforced.

    There are two kinds of gaps in that enforcement. Only filesystems
    mounted on empty directories of proc and sysfs should be ignored but
    the test for empty directories was insufficient. So in my tree
    directories on proc, sysctl and sysfs that will always be empty are
    created specially. Every other technique is imperfect as an ordinary
    directory can have entries added even after a readdir returns and
    shows that the directory is empty. Special creation of directories
    for mount points makes the code in the kernel a smidge clearer about
    it's purpose. I asked container developers from the various container
    projects to help test this and no holes were found in the set of mount
    points on proc and sysfs that are created specially.

    This set of changes also starts enforcing the mount flags of fresh
    mounts of proc and sysfs are consistent with the existing mount of
    proc and sysfs. I expected this to be the boring part of the work but
    unfortunately unprivileged userspace winds up mounting fresh copies of
    proc and sysfs with noexec and nosuid clear when root set those flags
    on the previous mount of proc and sysfs. So for now only the atime,
    read-only and nodev attributes which userspace happens to keep
    consistent are enforced. Dealing with the noexec and nosuid
    attributes remains for another time.

    This set of changes also addresses an issue with how open file
    descriptors from /proc//ns/* are displayed. Recently readlink of
    /proc//fd has been triggering a WARN_ON that has not been
    meaningful since it was added (as all of the code in the kernel was
    converted) and is not now actively wrong.

    There is also a short list of issues that have not been fixed yet that
    I will mention briefly.

    It is possible to rename a directory from below to above a bind mount.
    At which point any directory pointers below the renamed directory can
    be walked up to the root directory of the filesystem. With user
    namespaces enabled a bind mount of the bind mount can be created
    allowing the user to pick a directory whose children they can rename
    to outside of the bind mount. This is challenging to fix and doubly
    so because all obvious solutions must touch code that is in the
    performance part of pathname resolution.

    As mentioned above there is also a question of how to ensure that
    developers by accident or with purpose do not introduce exectuable
    files on sysfs and proc and in doing so introduce security regressions
    in the current userspace that will not be immediately obvious and as
    such are likely to require breaking userspace in painful ways once
    they are recognized"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    vfs: Remove incorrect debugging WARN in prepend_path
    mnt: Update fs_fully_visible to test for permanently empty directories
    sysfs: Create mountpoints with sysfs_create_mount_point
    sysfs: Add support for permanently empty directories to serve as mount points.
    kernfs: Add support for always empty directories.
    proc: Allow creating permanently empty directories that serve as mount points
    sysctl: Allow creating permanently empty directories that serve as mountpoints.
    fs: Add helper functions for permanently empty directories.
    vfs: Ignore unlocked mounts in fs_fully_visible
    mnt: Modify fs_fully_visible to deal with locked ro nodev and atime
    mnt: Refactor the logic for mounting sysfs and proc in a user namespace

    Linus Torvalds
     

01 Jul, 2015

1 commit

  • The warning message in prepend_path is unclear and outdated. It was
    added as a warning that the mechanism for generating names of pseudo
    files had been removed from prepend_path and d_dname should be used
    instead. Unfortunately the warning reads like a general warning,
    making it unclear what to do with it.

    Remove the warning. The transition it was added to warn about is long
    over, and I added code several years ago which in rare cases causes
    the warning to fire on legitimate code, and the warning is now firing
    and scaring people for no good reason.

    Cc: stable@vger.kernel.org
    Reported-by: Ivan Delalande
    Reported-by: Omar Sandoval
    Fixes: f48cfddc6729e ("vfs: In d_path don't call d_dname on a mount point")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

23 Jun, 2015

1 commit

  • Pull timer updates from Thomas Gleixner:
    "A rather largish update for everything time and timer related:

    - Cache footprint optimizations for both hrtimers and timer wheel

    - Lower the NOHZ impact on systems which have NOHZ or timer migration
    disabled at runtime.

    - Optimize run time overhead of hrtimer interrupt by making the clock
    offset updates smarter

    - hrtimer cleanups and removal of restrictions to tackle some
    problems in sched/perf

    - Some more leap second tweaks

    - Another round of changes addressing the 2038 problem

    - First step to change the internals of clock event devices by
    introducing the necessary infrastructure

    - Allow constant folding for usecs/msecs_to_jiffies()

    - The usual pile of clockevent/clocksource driver updates

    The hrtimer changes contain updates to sched, perf and x86 as they
    depend on them plus changes all over the tree to cleanup API changes
    and redundant code, which got copied all over the place. The y2038
    changes touch s390 to remove the last non 2038 safe code related to
    boot/persistant clock"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
    clocksource: Increase dependencies of timer-stm32 to limit build wreckage
    timer: Minimize nohz off overhead
    timer: Reduce timer migration overhead if disabled
    timer: Stats: Simplify the flags handling
    timer: Replace timer base by a cpu index
    timer: Use hlist for the timer wheel hash buckets
    timer: Remove FIFO "guarantee"
    timers: Sanitize catchup_timer_jiffies() usage
    hrtimer: Allow hrtimer::function() to free the timer
    seqcount: Introduce raw_write_seqcount_barrier()
    seqcount: Rename write_seqcount_barrier()
    hrtimer: Fix hrtimer_is_queued() hole
    hrtimer: Remove HRTIMER_STATE_MIGRATE
    selftest: Timers: Avoid signal deadlock in leap-a-day
    timekeeping: Copy the shadow-timekeeper over the real timekeeper last
    clockevents: Check state instead of mode in suspend/resume path
    selftests: timers: Add leap-second timer edge testing to leap-a-day.c
    ntp: Do leapsecond adjustment in adjtimex read path
    time: Prevent early expiry of hrtimers[CLOCK_REALTIME] at the leap second edge
    ntp: Introduce and use SECS_PER_DAY macro instead of 86400
    ...

    Linus Torvalds
     

19 Jun, 2015

2 commits

  • Make file->f_path always point to the overlay dentry so that the path in
    /proc/pid/fd is correct and to ensure that label-based LSMs have access to the
    overlay as well as the underlay (path-based LSMs probably don't need it).

    Using my union testsuite to set things up, before the patch I see:

    [root@andromeda union-testsuite]# bash 5 /a/foo107
    [root@andromeda union-testsuite]# stat /mnt/a/foo107
    ...
    Device: 23h/35d Inode: 13381 Links: 1
    ...
    [root@andromeda union-testsuite]# stat -L /proc/$$/fd/5
    ...
    Device: 23h/35d Inode: 13381 Links: 1
    ...

    After the patch:

    [root@andromeda union-testsuite]# bash 5 /mnt/a/foo107
    [root@andromeda union-testsuite]# stat /mnt/a/foo107
    ...
    Device: 23h/35d Inode: 40346 Links: 1
    ...
    [root@andromeda union-testsuite]# stat -L /proc/$$/fd/5
    ...
    Device: 23h/35d Inode: 40346 Links: 1
    ...

    Note the change in where /proc/$$/fd/5 points to in the ls command. It was
    pointing to /a/foo107 (which doesn't exist) and now points to /mnt/a/foo107
    (which is correct).

    The inode accessed, however, is the lower layer. The union layer is on device
    25h/37d and the upper layer on 24h/36d.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     
  • I'll shortly be introducing another seqcount primitive that's useful
    to provide ordering semantics and would like to use the
    write_seqcount_barrier() name for that.

    Seeing how there's only one user of the current primitive, lets rename
    it to invalidate, as that appears what its doing.

    While there, employ lockdep_assert_held() instead of
    assert_spin_locked() to not generate debug code for regular kernels.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: ktkhai@parallels.com
    Cc: rostedt@goodmis.org
    Cc: juri.lelli@gmail.com
    Cc: pang.xunlei@linaro.org
    Cc: Oleg Nesterov
    Cc: wanpeng.li@linux.intel.com
    Cc: Paul McKenney
    Cc: Al Viro
    Cc: Linus Torvalds
    Cc: umgwanakikbuti@gmail.com
    Link: http://lkml.kernel.org/r/20150611124743.279926217@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     

29 May, 2015

1 commit

  • when we find that a child has died while we'd been trying to ascend,
    we should go into the first live sibling itself, rather than its sibling.

    Off-by-one in question had been introduced in "deal with deadlock in
    d_walk()" and the fix needs to be backported to all branches this one
    has been backported to.

    Cc: stable@vger.kernel.org # 3.2 and later
    Signed-off-by: Al Viro

    Al Viro
     

16 Apr, 2015

1 commit

  • Impose ordering on accesses of d_inode and d_flags to avoid the need to do
    this:

    if (!dentry->d_inode || d_is_negative(dentry)) {

    when this:

    if (d_is_negative(dentry)) {

    should suffice.

    This check is especially problematic if a dentry can have its type field set
    to something other than DENTRY_MISS_TYPE when d_inode is NULL (as in
    unionmount).

    What we really need to do is stick a write barrier between setting d_inode and
    setting d_flags and a read barrier between reading d_flags and reading
    d_inode.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     

12 Apr, 2015

1 commit

  • On a distributed filesystem it's possible for lookup to discover that a
    directory it just found is already cached elsewhere in the directory
    heirarchy. The dcache won't let us keep the directory in both places,
    so we have to move the dentry to the new location from the place we
    previously had it cached.

    If the parent has changed, then this requires all the same locks as we'd
    need to do a cross-directory rename. But we're already in lookup
    holding one parent's i_mutex, so it's too late to acquire those locks in
    the right order.

    The (unreliable) solution in __d_unalias is to trylock() the required
    locks and return -EBUSY if it fails.

    I see no particular reason for returning -EBUSY, and -ESTALE is already
    the result of some other lookup races on NFS. I think -ESTALE is the
    more helpful error return. It also allows us to take advantage of the
    logic Jeff Layton added in c6a9428401c0 "vfs: fix renameat to retry on
    ESTALE errors" and ancestors, which hopefully resolves some of these
    errors before they're returned to userspace.

    I can reproduce these cases using NFS with:

    ssh root@$client '
    mount -olookupcache=pos '$server':'$export' /mnt/
    mkdir /mnt/TO
    mkdir /mnt/DIR
    touch /mnt/DIR/test.txt
    while true; do
    strace -e open cat /mnt/DIR/test.txt 2>&1 | grep EBUSY
    done
    '
    ssh root@$server '
    while true; do
    mv $export/DIR $export/TO/DIR
    mv $export/TO/DIR $export/DIR
    done
    '

    It also helps to add some other concurrent use of the directory on the
    client (e.g., "ls /mnt/TO"). And you can replace the server-side mv's
    by client-side mv's that are repeatedly killed. (If the client is
    interrupted while waiting for the RENAME response then it's left with a
    dentry that has to go under one parent or the other, but it doesn't yet
    know which.)

    Acked-by: Jeff Layton
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Al Viro

    J. Bruce Fields
     

23 Feb, 2015

2 commits

  • Split DCACHE_FILE_TYPE into DCACHE_REGULAR_TYPE (dentries representing regular
    files) and DCACHE_SPECIAL_TYPE (representing blockdev, chardev, FIFO and
    socket files).

    d_is_reg() and d_is_special() are added to detect these subtypes and
    d_is_file() is left as the union of the two.

    This allows a number of places that use S_ISREG(dentry->d_inode->i_mode) to
    use d_is_reg(dentry) instead.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     
  • Add a DCACHE_FALLTHRU flag to indicate that, in a layered filesystem, this is
    a virtual dentry that covers another one in a lower layer that should be used
    instead. This may be recorded on medium if directory integration is stored
    there.

    The flag can be set with d_set_fallthru() and tested with d_is_fallthru().

    Original-author: Valerie Aurora
    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     

18 Feb, 2015

1 commit

  • Pull misc VFS updates from Al Viro:
    "This cycle a lot of stuff sits on topical branches, so I'll be sending
    more or less one pull request per branch.

    This is the first pile; more to follow in a few. In this one are
    several misc commits from early in the cycle (before I went for
    separate branches), plus the rework of mntput/dput ordering on umount,
    switching to use of fs_pin instead of convoluted games in
    namespace_unlock()"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    switch the IO-triggering parts of umount to fs_pin
    new fs_pin killing logics
    allow attaching fs_pin to a group not associated with some superblock
    get rid of the second argument of acct_kill()
    take count and rcu_head out of fs_pin
    dcache: let the dentry count go down to zero without taking d_lock
    pull bumping refcount into ->kill()
    kill pin_put()
    mode_t whack-a-mole: chelsio
    file->f_path.dentry is pinned down for as long as the file is open...
    get rid of lustre_dump_dentry()
    gut proc_register() a bit
    kill d_validate()
    ncpfs: get rid of d_validate() nonsense
    selinuxfs: don't open-code d_genocide()

    Linus Torvalds
     

14 Feb, 2015

1 commit

  • We need to manually unpoison rounded up allocation size for dname to avoid
    kasan's reports in dentry_string_cmp(). When CONFIG_DCACHE_WORD_ACCESS=y
    dentry_string_cmp may access few bytes beyound requested in kmalloc()
    size.

    dentry_string_cmp() relates on that fact that dentry allocated using
    kmalloc and kmalloc internally round up allocation size. So this is not a
    bug, but this makes kasan to complain about such accesses. To avoid such
    reports we mark rounded up allocation size in shadow as accessible.

    Signed-off-by: Andrey Ryabinin
    Reported-by: Dmitry Vyukov
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrey Konovalov
    Cc: Yuri Gribov
    Cc: Konstantin Khlebnikov
    Cc: Sasha Levin
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

13 Feb, 2015

2 commits

  • Currently, the isolate callback passed to the list_lru_walk family of
    functions is supposed to just delete an item from the list upon returning
    LRU_REMOVED or LRU_REMOVED_RETRY, while nr_items counter is fixed by
    __list_lru_walk_one after the callback returns. Since the callback is
    allowed to drop the lock after removing an item (it has to return
    LRU_REMOVED_RETRY then), the nr_items can be less than the actual number
    of elements on the list even if we check them under the lock. This makes
    it difficult to move items from one list_lru_one to another, which is
    required for per-memcg list_lru reparenting - we can't just splice the
    lists, we have to move entries one by one.

    This patch therefore introduces helpers that must be used by callback
    functions to isolate items instead of raw list_del/list_move. These are
    list_lru_isolate and list_lru_isolate_move. They not only remove the
    entry from the list, but also fix the nr_items counter, making sure
    nr_items always reflects the actual number of elements on the list if
    checked under the appropriate lock.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Kmem accounting of memcg is unusable now, because it lacks slab shrinker
    support. That means when we hit the limit we will get ENOMEM w/o any
    chance to recover. What we should do then is to call shrink_slab, which
    would reclaim old inode/dentry caches from this cgroup. This is what
    this patch set is intended to do.

    Basically, it does two things. First, it introduces the notion of
    per-memcg slab shrinker. A shrinker that wants to reclaim objects per
    cgroup should mark itself as SHRINKER_MEMCG_AWARE. Then it will be
    passed the memory cgroup to scan from in shrink_control->memcg. For
    such shrinkers shrink_slab iterates over the whole cgroup subtree under
    the target cgroup and calls the shrinker for each kmem-active memory
    cgroup.

    Secondly, this patch set makes the list_lru structure per-memcg. It's
    done transparently to list_lru users - everything they have to do is to
    tell list_lru_init that they want memcg-aware list_lru. Then the
    list_lru will automatically distribute objects among per-memcg lists
    basing on which cgroup the object is accounted to. This way to make FS
    shrinkers (icache, dcache) memcg-aware we only need to make them use
    memcg-aware list_lru, and this is what this patch set does.

    As before, this patch set only enables per-memcg kmem reclaim when the
    pressure goes from memory.limit, not from memory.kmem.limit. Handling
    memory.kmem.limit is going to be tricky due to GFP_NOFS allocations, and
    it is still unclear whether we will have this knob in the unified
    hierarchy.

    This patch (of 9):

    NUMA aware slab shrinkers use the list_lru structure to distribute
    objects coming from different NUMA nodes to different lists. Whenever
    such a shrinker needs to count or scan objects from a particular node,
    it issues commands like this:

    count = list_lru_count_node(lru, sc->nid);
    freed = list_lru_walk_node(lru, sc->nid, isolate_func,
    isolate_arg, &sc->nr_to_scan);

    where sc is an instance of the shrink_control structure passed to it
    from vmscan.

    To simplify this, let's add special list_lru functions to be used by
    shrinkers, list_lru_shrink_count() and list_lru_shrink_walk(), which
    consolidate the nid and nr_to_scan arguments in the shrink_control
    structure.

    This will also allow us to avoid patching shrinkers that use list_lru
    when we make shrink_slab() per-memcg - all we will have to do is extend
    the shrink_control structure to include the target memcg and make
    list_lru_shrink_{count,walk} handle this appropriately.

    Signed-off-by: Vladimir Davydov
    Suggested-by: Dave Chinner
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Glauber Costa
    Cc: Alexander Viro
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

26 Jan, 2015

2 commits


09 Dec, 2014

1 commit


20 Nov, 2014

4 commits


04 Nov, 2014

2 commits


24 Oct, 2014

1 commit

  • d_splice_alias() callers expect it to either stash the inode reference
    into a new alias, or drop the inode reference. That makes it possible
    to just return d_splice_alias() result from ->lookup() instance, without
    any extra housekeeping required.

    Unfortunately, that should include the failure exits. If d_splice_alias()
    returns an error, it leaves the dentry it has been given negative and
    thus it *must* drop the inode reference. Easily fixed, but it goes way
    back and will need backporting.

    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Al Viro
     

13 Oct, 2014

1 commit


09 Oct, 2014

9 commits

  • Fixed coding style in dcache.c

    Signed-off-by: Daeseok Youn
    Signed-off-by: Al Viro

    Daeseok Youn
     
  • the only in-tree instance checks d_unhashed() anyway,
    out-of-tree code can preserve the current behaviour by
    adding such check if they want it and we get an ability
    to use it in cases where we *want* to be notified of
    killing being inevitable before ->d_lock is dropped,
    whether it's unhashed or not. In particular, autofs
    would benefit from that.

    Signed-off-by: Al Viro

    Al Viro
     
  • The only reason for games with ->d_prune() was __d_drop(), which
    was needed only to force dput() into killing the sucker off.

    Note that lock_parent() can be called under ->i_lock and won't
    drop it, so dentry is safe from somebody managing to kill it
    under us - it won't happen while we are holding ->i_lock.

    __dentry_kill() is called only with ->d_lockref.count being 0
    (here and when picked from shrink list) or 1 (dput() and dropping
    the ancestors in shrink_dentry_list()), so it will never be called
    twice - the first thing it's doing is making ->d_lockref.count
    negative and once that happens, nothing will increment it.

    Signed-off-by: Al Viro

    Al Viro
     
  • Now that d_invalidate can no longer fail, stop returning a useless
    return code. For the few callers that checked the return code update
    remove the handling of d_invalidate failure.

    Reviewed-by: Miklos Szeredi
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Al Viro

    Eric W. Biederman
     
  • Now that d_invalidate is the only caller of check_submounts_and_drop,
    expand check_submounts_and_drop inline in d_invalidate.

    Reviewed-by: Miklos Szeredi
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Al Viro

    Eric W. Biederman
     
  • With the introduction of mount namespaces and bind mounts it became
    possible to access files and directories that on some paths are mount
    points but are not mount points on other paths. It is very confusing
    when rm -rf somedir returns -EBUSY simply because somedir is mounted
    somewhere else. With the addition of user namespaces allowing
    unprivileged mounts this condition has gone from annoying to allowing
    a DOS attack on other users in the system.

    The possibility for mischief is removed by updating the vfs to support
    rename, unlink and rmdir on a dentry that is a mountpoint and by
    lazily unmounting mountpoints on deleted dentries.

    In particular this change allows rename, unlink and rmdir system calls
    on a dentry without a mountpoint in the current mount namespace to
    succeed, and it allows rename, unlink, and rmdir performed on a
    distributed filesystem to update the vfs cache even if when there is a
    mount in some namespace on the original dentry.

    There are two common patterns of maintaining mounts: Mounts on trusted
    paths with the parent directory of the mount point and all ancestory
    directories up to / owned by root and modifiable only by root
    (i.e. /media/xxx, /dev, /dev/pts, /proc, /sys, /sys/fs/cgroup/{cpu,
    cpuacct, ...}, /usr, /usr/local). Mounts on unprivileged directories
    maintained by fusermount.

    In the case of mounts in trusted directories owned by root and
    modifiable only by root the current parent directory permissions are
    sufficient to ensure a mount point on a trusted path is not removed
    or renamed by anyone other than root, even if there is a context
    where the there are no mount points to prevent this.

    In the case of mounts in directories owned by less privileged users
    races with users modifying the path of a mount point are already a
    danger. fusermount already uses a combination of chdir,
    /proc//fd/NNN, and UMOUNT_NOFOLLOW to prevent these races. The
    removable of global rename, unlink, and rmdir protection really adds
    nothing new to consider only a widening of the attack window, and
    fusermount is already safe against unprivileged users modifying the
    directory simultaneously.

    In principle for perfect userspace programs returning -EBUSY for
    unlink, rmdir, and rename of dentires that have mounts in the local
    namespace is actually unnecessary. Unfortunately not all userspace
    programs are perfect so retaining -EBUSY for unlink, rmdir and rename
    of dentries that have mounts in the current mount namespace plays an
    important role of maintaining consistency with historical behavior and
    making imperfect userspace applications hard to exploit.

    v2: Remove spurious old_dentry.
    v3: Optimized shrink_submounts_and_drop
    Removed unsued afs label
    v4: Simplified the changes to check_submounts_and_drop
    Do not rename check_submounts_and_drop shrink_submounts_and_drop
    Document what why we need atomicity in check_submounts_and_drop
    Rely on the parent inode mutex to make d_revalidate and d_invalidate
    an atomic unit.
    v5: Refcount the mountpoint to detach in case of simultaneous
    renames.

    Reviewed-by: Miklos Szeredi
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Al Viro

    Eric W. Biederman
     
  • The current comments in d_invalidate about what and why it is doing
    what it is doing are wildly off-base. Which is not surprising as
    the comments date back to last minute bug fix of the 2.2 kernel.

    The big fat lie of a comment said: If it's a directory, we can't drop
    it for fear of somebody re-populating it with children (even though
    dropping it would make it unreachable from that root, we still might
    repopulate it if it was a working directory or similar).

    [AV] What we really need to avoid is multiple dentry aliases of the
    same directory inode; on all filesystems that have ->d_revalidate()
    we either declare all positive dentries always valid (and thus never
    fed to d_invalidate()) or use d_materialise_unique() and/or d_splice_alias(),
    which take care of alias prevention.

    The current rules are:
    - To prevent mount point leaks dentries that are mount points or that
    have childrent that are mount points may not be be unhashed.
    - All dentries may be unhashed.
    - Directories may be rehashed with d_materialise_unique

    check_submounts_and_drop implements this already for well maintained
    remote filesystems so implement the current rules in d_invalidate
    by just calling check_submounts_and_drop.

    The one difference between d_invalidate and check_submounts_and_drop
    is that d_invalidate must respect it when a d_revalidate method has
    earlier called d_drop so preserve the d_unhashed check in
    d_invalidate.

    Reviewed-by: Miklos Szeredi
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Al Viro

    Eric W. Biederman
     
  • d_drop or check_submounts_and_drop called from d_revalidate can result
    in renamed directories with child dentries being unhashed. These
    renamed and drop directory dentries can be rehashed after
    d_materialise_unique uses d_find_alias to find them.

    Reviewed-by: Miklos Szeredi
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Al Viro

    Eric W. Biederman
     
  • * external dentry names get a small structure prepended to them
    (struct external_name).
    * it contains an atomic refcount, matching the number of struct dentry
    instances that have ->d_name.name pointing to that external name. The
    first thing free_dentry() does is decrementing refcount of external name,
    so the instances that are between the call of free_dentry() and
    RCU-delayed actual freeing do not contribute.
    * __d_move(x, y, false) makes the name of x equal to the name of y,
    external or not. If y has an external name, extra reference is grabbed
    and put into x->d_name.name. If x used to have an external name, the
    reference to the old name is dropped and, should it reach zero, freeing
    is scheduled via kfree_rcu().
    * free_dentry() in dentry with external name decrements the refcount of
    that name and, should it reach zero, does RCU-delayed call that will
    free both the dentry and external name. Otherwise it does what it
    used to do, except that __d_free() doesn't even look at ->d_name.name;
    it simply frees the dentry.

    All non-RCU accesses to dentry external name are safe wrt freeing since they
    all should happen before free_dentry() is called. RCU accesses might run
    into a dentry seen by free_dentry() or into an old name that got already
    dropped by __d_move(); however, in both cases dentry must have been
    alive and refer to that name at some point after we'd done rcu_read_lock(),
    which means that any freeing must be still pending.

    Signed-off-by: Al Viro

    Al Viro
     

30 Sep, 2014

1 commit

  • AFAICS, prepend_name() is broken on SMP alpha. Disclaimer: I don't have
    SMP alpha boxen to reproduce it on. However, it really looks like the race
    is real.

    CPU1: d_path() on /mnt/ramfs//foo
    CPU2: mv /mnt/ramfs/ /mnt/ramfs/

    CPU2 does d_alloc(), which allocates an external name, stores the name there
    including terminating NUL, does smp_wmb() and stores its address in
    dentry->d_name.name. It proceeds to d_add(dentry, NULL) and d_move()
    old dentry over to that. ->d_name.name value ends up in that dentry.

    In the meanwhile, CPU1 gets to prepend_name() for that dentry. It fetches
    ->d_name.name and ->d_name.len; the former ends up pointing to new name
    (64-byte kmalloc'ed array), the latter - 255 (length of the old name).
    Nothing to force the ordering there, and normally that would be OK, since we'd
    run into the terminating NUL and stop. Except that it's alpha, and we'd need
    a data dependency barrier to guarantee that we see that store of NUL
    __d_alloc() has done. In a similar situation dentry_cmp() would survive; it
    does explicit smp_read_barrier_depends() after fetching ->d_name.name.
    prepend_name() doesn't and it risks walking past the end of kmalloc'ed object
    and possibly oops due to taking a page fault in kernel mode.

    Cc: stable@vger.kernel.org # 3.12+
    Signed-off-by: Al Viro

    Al Viro