28 May, 2016

1 commit

  • smack ->d_instantiate() uses ->setxattr(), so to be able to call it before
    we'd hashed the new dentry and attached it to inode, we need ->setxattr()
    instances getting the inode as an explicit argument rather than obtaining
    it from dentry.

    Similar change for ->getxattr() had been done in commit ce23e64. Unlike
    ->getxattr() (which is used by both selinux and smack instances of
    ->d_instantiate()) ->setxattr() is used only by smack one and unfortunately
    it got missed back then.

    Reported-by: Seung-Woo Kim
    Tested-by: Casey Schaufler
    Signed-off-by: Al Viro

    Al Viro
     

21 May, 2016

1 commit

  • Pull driver core updates from Greg KH:
    "Here's the "big" driver core update for 4.7-rc1.

    Mostly just debugfs changes, the long-known and messy races with
    removing debugfs files should be fixed thanks to the great work of
    Nicolai Stange. We also have some isa updates in here (the x86
    maintainers told me to take it through this tree), a new warning when
    we run out of dynamic char major numbers, and a few other assorted
    changes, details in the shortlog.

    All have been in linux-next for some time with no reported issues"

    * tag 'driver-core-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (32 commits)
    Revert "base: dd: don't remove driver_data in -EPROBE_DEFER case"
    gpio: ws16c48: Utilize the ISA bus driver
    gpio: 104-idio-16: Utilize the ISA bus driver
    gpio: 104-idi-48: Utilize the ISA bus driver
    gpio: 104-dio-48e: Utilize the ISA bus driver
    watchdog: ebc-c384_wdt: Utilize the ISA bus driver
    iio: stx104: Utilize the module_isa_driver and max_num_isa_dev macros
    iio: stx104: Add X86 dependency to STX104 Kconfig option
    Documentation: Add ISA bus driver documentation
    isa: Implement the max_num_isa_dev macro
    isa: Implement the module_isa_driver macro
    pnp: pnpbios: Add explicit X86_32 dependency to PNPBIOS
    isa: Decouple X86_32 dependency from the ISA Kconfig option
    driver-core: use 'dev' argument in dev_dbg_ratelimited stub
    base: dd: don't remove driver_data in -EPROBE_DEFER case
    kernfs: Move faulting copy_user operations outside of the mutex
    devcoredump: add scatterlist support
    debugfs: unproxify files created through debugfs_create_u32_array()
    debugfs: unproxify files created through debugfs_create_blob()
    debugfs: unproxify files created through debugfs_create_bool()
    ...

    Linus Torvalds
     

18 May, 2016

1 commit

  • Pull parallel filesystem directory handling update from Al Viro.

    This is the main parallel directory work by Al that makes the vfs layer
    able to do lookup and readdir in parallel within a single directory.
    That's a big change, since this used to be all protected by the
    directory inode mutex.

    The inode mutex is replaced by an rwsem, and serialization of lookups of
    a single name is done by a "in-progress" dentry marker.

    The series begins with xattr cleanups, and then ends with switching
    filesystems over to actually doing the readdir in parallel (switching to
    the "iterate_shared()" that only takes the read lock).

    A more detailed explanation of the process from Al Viro:
    "The xattr work starts with some acl fixes, then switches ->getxattr to
    passing inode and dentry separately. This is the point where the
    things start to get tricky - that got merged into the very beginning
    of the -rc3-based #work.lookups, to allow untangling the
    security_d_instantiate() mess. The xattr work itself proceeds to
    switch a lot of filesystems to generic_...xattr(); no complications
    there.

    After that initial xattr work, the series then does the following:

    - untangle security_d_instantiate()

    - convert a bunch of open-coded lookup_one_len_unlocked() to calls of
    that thing; one such place (in overlayfs) actually yields a trivial
    conflict with overlayfs fixes later in the cycle - overlayfs ended
    up switching to a variant of lookup_one_len_unlocked() sans the
    permission checks. I would've dropped that commit (it gets
    overridden on merge from #ovl-fixes in #for-next; proper resolution
    is to use the variant in mainline fs/overlayfs/super.c), but I
    didn't want to rebase the damn thing - it was fairly late in the
    cycle...

    - some filesystems had managed to depend on lookup/lookup exclusion
    for *fs-internal* data structures in a way that would break if we
    relaxed the VFS exclusion. Fixing hadn't been hard, fortunately.

    - core of that series - parallel lookup machinery, replacing
    ->i_mutex with rwsem, making lookup_slow() take it only shared. At
    that point lookups happen in parallel; lookups on the same name
    wait for the in-progress one to be done with that dentry.

    Surprisingly little code, at that - almost all of it is in
    fs/dcache.c, with fs/namei.c changes limited to lookup_slow() -
    making it use the new primitive and actually switching to locking
    shared.

    - parallel readdir stuff - first of all, we provide the exclusion on
    per-struct file basis, same as we do for read() vs lseek() for
    regular files. That takes care of most of the needed exclusion in
    readdir/readdir; however, these guys are trickier than lookups, so
    I went for switching them one-by-one. To do that, a new method
    '->iterate_shared()' is added and filesystems are switched to it
    as they are either confirmed to be OK with shared lock on directory
    or fixed to be OK with that. I hope to kill the original method
    come next cycle (almost all in-tree filesystems are switched
    already), but it's still not quite finished.

    - several filesystems get switched to parallel readdir. The
    interesting part here is dealing with dcache preseeding by readdir;
    that needs minor adjustment to be safe with directory locked only
    shared.

    Most of the filesystems doing that got switched to in those
    commits. Important exception: NFS. Turns out that NFS folks, with
    their, er, insistence on VFS getting the fuck out of the way of the
    Smart Filesystem Code That Knows How And What To Lock(tm) have
    grown the locking of their own. They had their own homegrown
    rwsem, with lookup/readdir/atomic_open being *writers* (sillyunlink
    is the reader there). Of course, with VFS getting the fuck out of
    the way, as requested, the actual smarts of the smart filesystem
    code etc. had become exposed...

    - do_last/lookup_open/atomic_open cleanups. As the result, open()
    without O_CREAT locks the directory only shared. Including the
    ->atomic_open() case. Backmerge from #for-linus in the middle of
    that - atomic_open() fix got brought in.

    - then comes NFS switch to saner (VFS-based ;-) locking, killing the
    homegrown "lookup and readdir are writers" kinda-sorta rwsem. All
    exclusion for sillyunlink/lookup is done by the parallel lookups
    mechanism. Exclusion between sillyunlink and rmdir is a real rwsem
    now - rmdir being the writer.

    Result: NFS lookups/readdirs/O_CREAT-less opens happen in parallel
    now.

    - the rest of the series consists of switching a lot of filesystems
    to parallel readdir; in a lot of cases ->llseek() gets simplified
    as well. One backmerge in there (again, #for-linus - rockridge
    fix)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (74 commits)
    ext4: switch to ->iterate_shared()
    hfs: switch to ->iterate_shared()
    hfsplus: switch to ->iterate_shared()
    hostfs: switch to ->iterate_shared()
    hpfs: switch to ->iterate_shared()
    hpfs: handle allocation failures in hpfs_add_pos()
    gfs2: switch to ->iterate_shared()
    f2fs: switch to ->iterate_shared()
    afs: switch to ->iterate_shared()
    befs: switch to ->iterate_shared()
    befs: constify stuff a bit
    isofs: switch to ->iterate_shared()
    get_acorn_filename(): deobfuscate a bit
    btrfs: switch to ->iterate_shared()
    logfs: no need to lock directory in lseek
    switch ecryptfs to ->iterate_shared
    9p: switch to ->iterate_shared()
    fat: switch to ->iterate_shared()
    romfs, squashfs: switch to ->iterate_shared()
    more trivial ->iterate_shared conversions
    ...

    Linus Torvalds
     

12 May, 2016

1 commit


10 May, 2016

1 commit

  • Patch summary:

    When showing a cgroupfs entry in mountinfo, show the path of the mount
    root dentry relative to the reader's cgroup namespace root.

    Short explanation (courtesy of mkerrisk):

    If we create a new cgroup namespace, then we want both /proc/self/cgroup
    and /proc/self/mountinfo to show cgroup paths that are correctly
    virtualized with respect to the cgroup mount point. Previous to this
    patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo
    does not.

    Long version:

    When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup
    namespace, and then mounts a new instance of the freezer cgroup, the new
    mount will be rooted at /a/b. The root dentry field of the mountinfo
    entry will show '/a/b'.

    cat > /tmp/do1 << EOF
    mount -t cgroup -o freezer freezer /mnt
    grep freezer /proc/self/mountinfo
    EOF

    unshare -Gm bash /tmp/do1
    > 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
    > 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer

    The task's freezer cgroup entry in /proc/self/cgroup will simply show
    '/':

    grep freezer /proc/self/cgroup
    9:freezer:/

    If instead the same task simply bind mounts the /a/b cgroup directory,
    the resulting mountinfo entry will again show /a/b for the dentry root.
    However in this case the task will find its own cgroup at /mnt/a/b,
    not at /mnt:

    mount --bind /sys/fs/cgroup/freezer/a/b /mnt
    130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer

    In other words, there is no way for the task to know, based on what is
    in mountinfo, which cgroup directory is its own.

    Example (by mkerrisk):

    First, a little script to save some typing and verbiage:

    echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)"
    cat /proc/self/mountinfo | grep freezer |
    awk '{print "\tmountinfo:\t\t" $4 "\t" $5}'

    Create cgroup, place this shell into the cgroup, and look at the state
    of the /proc files:

    2653
    2653 # Our shell
    14254 # cat(1)
    /proc/self/cgroup: 10:freezer:/a/b
    mountinfo: / /sys/fs/cgroup/freezer

    Create a shell in new cgroup and mount namespaces. The act of creating
    a new cgroup namespace causes the process's current cgroups directories
    to become its cgroup root directories. (Here, I'm using my own version
    of the "unshare" utility, which takes the same options as the util-linux
    version):

    Look at the state of the /proc files:

    /proc/self/cgroup: 10:freezer:/
    mountinfo: / /sys/fs/cgroup/freezer

    The third entry in /proc/self/cgroup (the pathname of the cgroup inside
    the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which
    is rooted at /a/b in the outer namespace.

    However, the info in /proc/self/mountinfo is not for this cgroup
    namespace, since we are seeing a duplicate of the mount from the
    old mount namespace, and the info there does not correspond to the
    new cgroup namespace. However, trying to create a new mount still
    doesn't show us the right information in mountinfo:

    # propagating to other mountns
    /proc/self/cgroup: 7:freezer:/
    mountinfo: /a/b /mnt/freezer

    The act of creating a new cgroup namespace caused the process's
    current freezer directory, "/a/b", to become its cgroup freezer root
    directory. In other words, the pathname directory of the directory
    within the newly mounted cgroup filesystem should be "/",
    but mountinfo wrongly shows us "/a/b". The consequence of this is
    that the process in the cgroup namespace cannot correctly construct
    the pathname of its cgroup root directory from the information in
    /proc/PID/mountinfo.

    With this patch, the dentry root field in mountinfo is shown relative
    to the reader's cgroup namespace. So the same steps as above:

    /proc/self/cgroup: 10:freezer:/a/b
    mountinfo: / /sys/fs/cgroup/freezer
    /proc/self/cgroup: 10:freezer:/
    mountinfo: /../.. /sys/fs/cgroup/freezer
    /proc/self/cgroup: 10:freezer:/
    mountinfo: / /mnt/freezer

    cgroup.clone_children freezer.parent_freezing freezer.state tasks
    cgroup.procs freezer.self_freezing notify_on_release
    3164
    2653 # First shell that placed in this cgroup
    3164 # Shell started by 'unshare'
    14197 # cat(1)

    Signed-off-by: Serge Hallyn
    Tested-by: Michael Kerrisk
    Acked-by: Michael Kerrisk
    Signed-off-by: Tejun Heo

    Serge E. Hallyn
     

09 May, 2016

1 commit


03 May, 2016

3 commits


01 May, 2016

1 commit

  • A fault in a user provided buffer may lead anywhere, and lockdep warns
    that we have a potential deadlock between the mm->mmap_sem and the
    kernfs file mutex:

    [ 82.811702] ======================================================
    [ 82.811705] [ INFO: possible circular locking dependency detected ]
    [ 82.811709] 4.5.0-rc4-gfxbench+ #1 Not tainted
    [ 82.811711] -------------------------------------------------------
    [ 82.811714] kms_setmode/5859 is trying to acquire lock:
    [ 82.811717] (&dev->struct_mutex){+.+.+.}, at: [] drm_gem_mmap+0x1a1/0x270
    [ 82.811731]
    but task is already holding lock:
    [ 82.811734] (&mm->mmap_sem){++++++}, at: [] vm_mmap_pgoff+0x44/0xa0
    [ 82.811745]
    which lock already depends on the new lock.

    [ 82.811749]
    the existing dependency chain (in reverse order) is:
    [ 82.811752]
    -> #3 (&mm->mmap_sem){++++++}:
    [ 82.811761] [] lock_acquire+0xc3/0x1d0
    [ 82.811766] [] __might_fault+0x75/0xa0
    [ 82.811771] [] kernfs_fop_write+0x8a/0x180
    [ 82.811787] [] __vfs_write+0x23/0xe0
    [ 82.811792] [] vfs_write+0xa4/0x190
    [ 82.811797] [] SyS_write+0x44/0xb0
    [ 82.811801] [] entry_SYSCALL_64_fastpath+0x16/0x73
    [ 82.811807]
    -> #2 (s_active#6){++++.+}:
    [ 82.811814] [] lock_acquire+0xc3/0x1d0
    [ 82.811819] [] __kernfs_remove+0x210/0x2f0
    [ 82.811823] [] kernfs_remove_by_name_ns+0x40/0xa0
    [ 82.811828] [] sysfs_remove_file_ns+0x10/0x20
    [ 82.811832] [] device_del+0x124/0x250
    [ 82.811837] [] device_unregister+0x19/0x60
    [ 82.811841] [] cpu_cache_sysfs_exit+0x51/0xb0
    [ 82.811846] [] cacheinfo_cpu_callback+0x38/0x70
    [ 82.811851] [] notifier_call_chain+0x39/0xa0
    [ 82.811856] [] __raw_notifier_call_chain+0x9/0x10
    [ 82.811860] [] cpu_notify+0x1e/0x40
    [ 82.811865] [] cpu_notify_nofail+0x9/0x20
    [ 82.811869] [] _cpu_down+0x233/0x340
    [ 82.811874] [] disable_nonboot_cpus+0xc9/0x350
    [ 82.811878] [] suspend_devices_and_enter+0x5a1/0xb50
    [ 82.811883] [] pm_suspend+0x543/0x8d0
    [ 82.811888] [] state_store+0x77/0xe0
    [ 82.811892] [] kobj_attr_store+0xf/0x20
    [ 82.811897] [] sysfs_kf_write+0x40/0x50
    [ 82.811902] [] kernfs_fop_write+0x13c/0x180
    [ 82.811906] [] __vfs_write+0x23/0xe0
    [ 82.811910] [] vfs_write+0xa4/0x190
    [ 82.811914] [] SyS_write+0x44/0xb0
    [ 82.811918] [] entry_SYSCALL_64_fastpath+0x16/0x73
    [ 82.811923]
    -> #1 (cpu_hotplug.lock){+.+.+.}:
    [ 82.811929] [] lock_acquire+0xc3/0x1d0
    [ 82.811933] [] mutex_lock_nested+0x62/0x3b0
    [ 82.811940] [] get_online_cpus+0x61/0x80
    [ 82.811944] [] stop_machine+0x1b/0xe0
    [ 82.811949] [] gen8_ggtt_insert_entries__BKL+0x2d/0x30 [i915]
    [ 82.812009] [] ggtt_bind_vma+0x46/0x70 [i915]
    [ 82.812045] [] i915_vma_bind+0x140/0x290 [i915]
    [ 82.812081] [] i915_gem_object_do_pin+0x899/0xb00 [i915]
    [ 82.812117] [] i915_gem_object_pin+0x35/0x40 [i915]
    [ 82.812154] [] intel_init_pipe_control+0xbe/0x210 [i915]
    [ 82.812192] [] intel_logical_rings_init+0xe2/0xde0 [i915]
    [ 82.812232] [] i915_gem_init+0xf3/0x130 [i915]
    [ 82.812278] [] i915_driver_load+0xf2d/0x1770 [i915]
    [ 82.812318] [] drm_dev_register+0xa4/0xb0
    [ 82.812323] [] drm_get_pci_dev+0xce/0x1e0
    [ 82.812328] [] i915_pci_probe+0x2f/0x50 [i915]
    [ 82.812360] [] pci_device_probe+0x87/0xf0
    [ 82.812366] [] driver_probe_device+0x229/0x450
    [ 82.812371] [] __driver_attach+0x83/0x90
    [ 82.812375] [] bus_for_each_dev+0x61/0xa0
    [ 82.812380] [] driver_attach+0x19/0x20
    [ 82.812384] [] bus_add_driver+0x1ef/0x290
    [ 82.812388] [] driver_register+0x5b/0xe0
    [ 82.812393] [] __pci_register_driver+0x5b/0x60
    [ 82.812398] [] drm_pci_init+0xd6/0x100
    [ 82.812402] [] 0xffffffffa027c094
    [ 82.812406] [] do_one_initcall+0xae/0x1d0
    [ 82.812412] [] do_init_module+0x5b/0x1cb
    [ 82.812417] [] load_module+0x1c20/0x2480
    [ 82.812422] [] SyS_finit_module+0x7e/0xa0
    [ 82.812428] [] entry_SYSCALL_64_fastpath+0x16/0x73
    [ 82.812433]
    -> #0 (&dev->struct_mutex){+.+.+.}:
    [ 82.812439] [] __lock_acquire+0x1fc9/0x20f0
    [ 82.812443] [] lock_acquire+0xc3/0x1d0
    [ 82.812456] [] drm_gem_mmap+0x1c7/0x270
    [ 82.812460] [] mmap_region+0x334/0x580
    [ 82.812466] [] do_mmap+0x364/0x410
    [ 82.812470] [] vm_mmap_pgoff+0x6d/0xa0
    [ 82.812474] [] SyS_mmap_pgoff+0x184/0x220
    [ 82.812479] [] SyS_mmap+0x1d/0x20
    [ 82.812484] [] entry_SYSCALL_64_fastpath+0x16/0x73
    [ 82.812489]
    other info that might help us debug this:

    [ 82.812493] Chain exists of:
    &dev->struct_mutex --> s_active#6 --> &mm->mmap_sem

    [ 82.812502] Possible unsafe locking scenario:

    [ 82.812506] CPU0 CPU1
    [ 82.812508] ---- ----
    [ 82.812510] lock(&mm->mmap_sem);
    [ 82.812514] lock(s_active#6);
    [ 82.812519] lock(&mm->mmap_sem);
    [ 82.812522] lock(&dev->struct_mutex);
    [ 82.812526]
    *** DEADLOCK ***

    [ 82.812531] 1 lock held by kms_setmode/5859:
    [ 82.812533] #0: (&mm->mmap_sem){++++++}, at: [] vm_mmap_pgoff+0x44/0xa0
    [ 82.812541]
    stack backtrace:
    [ 82.812547] CPU: 0 PID: 5859 Comm: kms_setmode Not tainted 4.5.0-rc4-gfxbench+ #1
    [ 82.812550] Hardware name: /NUC5CPYB, BIOS PYBSWCEL.86A.0040.2015.0814.1353 08/14/2015
    [ 82.812553] 0000000000000000 ffff880079407bf0 ffffffff813f8505 ffffffff825fb270
    [ 82.812560] ffffffff825c4190 ffff880079407c30 ffffffff810c84ac ffff880079407c90
    [ 82.812566] ffff8800797ed328 ffff8800797ecb00 0000000000000001 ffff8800797ed350
    [ 82.812573] Call Trace:
    [ 82.812578] [] dump_stack+0x67/0x92
    [ 82.812582] [] print_circular_bug+0x1fc/0x310
    [ 82.812586] [] __lock_acquire+0x1fc9/0x20f0
    [ 82.812590] [] lock_acquire+0xc3/0x1d0
    [ 82.812594] [] ? drm_gem_mmap+0x1a1/0x270
    [ 82.812599] [] drm_gem_mmap+0x1c7/0x270
    [ 82.812603] [] ? drm_gem_mmap+0x1a1/0x270
    [ 82.812608] [] mmap_region+0x334/0x580
    [ 82.812612] [] do_mmap+0x364/0x410
    [ 82.812616] [] vm_mmap_pgoff+0x6d/0xa0
    [ 82.812629] [] SyS_mmap_pgoff+0x184/0x220
    [ 82.812633] [] SyS_mmap+0x1d/0x20
    [ 82.812637] [] entry_SYSCALL_64_fastpath+0x16/0x73

    Highly unlikely though this scenario is, we can avoid the issue entirely
    by moving the copy operation from out under the kernfs_get_active()
    tracking by assigning the preallocated buffer its own mutex. The
    temporary buffer allocation doesn't require mutex locking as it is
    entirely local.

    The locked section was extended by the addition of the preallocated buf
    to speed up md user operations in

    commit 2b75869bba676c248d8d25ae6d2bd9221dfffdb6
    Author: NeilBrown
    Date: Mon Oct 13 16:41:28 2014 +1100

    sysfs/kernfs: allow attributes to request write buffer be pre-allocated.

    Reported-by: Ville Syrjälä
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94350
    Signed-off-by: Chris Wilson
    Reviewed-by: Joonas Lahtinen
    Cc: Ville Syrjälä
    Cc: Joonas Lahtinen
    Cc: NeilBrown
    Acked-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Chris Wilson
     

19 Apr, 2016

1 commit


11 Apr, 2016

1 commit


05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

30 Mar, 2016

1 commit

  • This is in preparation for the series that transitions
    filesystem timestamps to use 64 bit time and hence make
    them y2038 safe.

    CURRENT_TIME macro will be deleted before merging the
    aforementioned series.

    Use current_fs_time() instead of CURRENT_TIME for inode
    timestamps.

    struct kernfs_node is associated with a sysfs file/ directory.
    Truncate the values to appropriate time granularity when
    writing to inode timestamps of the files.

    ktime_get_real_ts() is used to obtain times for
    struct kernfs_iattrs. Since these times are later assigned to
    inode times using timespec_truncate() for all filesystem based
    operations, we can save the supers list traversal time here by
    using ktime_get_real_ts() directly.

    Signed-off-by: Deepa Dinamani
    Signed-off-by: Greg Kroah-Hartman

    Deepa Dinamani
     

22 Mar, 2016

1 commit

  • Pull cgroup namespace support from Tejun Heo:
    "These are changes to implement namespace support for cgroup which has
    been pending for quite some time now. It is very straight-forward and
    only affects what part of cgroup hierarchies are visible.

    After unsharing, mounting a cgroup fs will be scoped to the cgroups
    the task belonged to at the time of unsharing and the cgroup paths
    exposed to userland would be adjusted accordingly"

    * 'for-4.6-ns' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: fix and restructure error handling in copy_cgroup_ns()
    cgroup: fix alloc_cgroup_ns() error handling in copy_cgroup_ns()
    Add FS_USERNS_FLAG to cgroup fs
    cgroup: Add documentation for cgroup namespaces
    cgroup: mount cgroupns-root when inside non-init cgroupns
    kernfs: define kernfs_node_dentry
    cgroup: cgroup namespace setns support
    cgroup: introduce cgroup namespaces
    sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
    kernfs: Add API to generate relative kernfs path

    Linus Torvalds
     

17 Feb, 2016

2 commits


08 Feb, 2016

1 commit

  • kernfs_walk_ns() uses a static path_buf[PATH_MAX] to separate out path
    components. Keeping around the 4k buffer just for kernfs_walk_ns() is
    wasteful. This patch makes it piggyback on kernfs_pr_cont_buf[]
    instead. This requires kernfs_walk_ns() to hold kernfs_rename_lock.

    Signed-off-by: Tejun Heo
    Reported-by: Geert Uytterhoeven
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

23 Jan, 2016

1 commit

  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     

15 Jan, 2016

1 commit

  • Currently, all kmem allocations (namely every kmem_cache_alloc, kmalloc,
    alloc_kmem_pages call) are accounted to memory cgroup automatically.
    Callers have to explicitly opt out if they don't want/need accounting
    for some reason. Such a design decision leads to several problems:

    - kmalloc users are highly sensitive to failures, many of them
    implicitly rely on the fact that kmalloc never fails, while memcg
    makes failures quite plausible.

    - A lot of objects are shared among different containers by design.
    Accounting such objects to one of containers is just unfair.
    Moreover, it might lead to pinning a dead memcg along with its kmem
    caches, which aren't tiny, which might result in noticeable increase
    in memory consumption for no apparent reason in the long run.

    - There are tons of short-lived objects. Accounting them to memcg will
    only result in slight noise and won't change the overall picture, but
    we still have to pay accounting overhead.

    For more info, see

    - http://lkml.kernel.org/r/20151105144002.GB15111%40dhcp22.suse.cz
    - http://lkml.kernel.org/r/20151106090555.GK29259@esperanza

    Therefore this patchset switches to the white list policy. Now kmalloc
    users have to explicitly opt in by passing __GFP_ACCOUNT flag.

    Currently, the list of accounted objects is quite limited and only
    includes those allocations that (1) are known to be easily triggered
    from userspace and (2) can fail gracefully (for the full list see patch
    no. 6) and it still misses many object types. However, accounting only
    those objects should be a satisfactory approximation of the behavior we
    used to have for most sane workloads.

    This patch (of 6):

    Revert 499611ed451508a42d1d7d ("kernfs: do not account ino_ida allocations
    to memcg").

    Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be
    fragile and difficult to maintain, because there seem to be many more
    allocations that should not be accounted than those that should be.
    Besides, false accounting an allocation might result in much worse
    consequences than not accounting at all, namely increased memory
    consumption due to pinned dead kmem caches.

    So it was decided to switch to the white-list policy. This patch reverts
    bits introducing the black-list policy. The white-list policy will be
    introduced later in the series.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

13 Jan, 2016

1 commit

  • Pull networking updates from Davic Miller:

    1) Support busy polling generically, for all NAPI drivers. From Eric
    Dumazet.

    2) Add byte/packet counter support to nft_ct, from Floriani Westphal.

    3) Add RSS/XPS support to mvneta driver, from Gregory Clement.

    4) Implement IPV6_HDRINCL socket option for raw sockets, from Hannes
    Frederic Sowa.

    5) Add support for T6 adapter to cxgb4 driver, from Hariprasad Shenai.

    6) Add support for VLAN device bridging to mlxsw switch driver, from
    Ido Schimmel.

    7) Add driver for Netronome NFP4000/NFP6000, from Jakub Kicinski.

    8) Provide hwmon interface to mlxsw switch driver, from Jiri Pirko.

    9) Reorganize wireless drivers into per-vendor directories just like we
    do for ethernet drivers. From Kalle Valo.

    10) Provide a way for administrators "destroy" connected sockets via the
    SOCK_DESTROY socket netlink diag operation. From Lorenzo Colitti.

    11) Add support to add/remove multicast routes via netlink, from Nikolay
    Aleksandrov.

    12) Make TCP keepalive settings per-namespace, from Nikolay Borisov.

    13) Add forwarding and packet duplication facilities to nf_tables, from
    Pablo Neira Ayuso.

    14) Dead route support in MPLS, from Roopa Prabhu.

    15) TSO support for thunderx chips, from Sunil Goutham.

    16) Add driver for IBM's System i/p VNIC protocol, from Thomas Falcon.

    17) Rationalize, consolidate, and more completely document the checksum
    offloading facilities in the networking stack. From Tom Herbert.

    18) Support aborting an ongoing scan in mac80211/cfg80211, from
    Vidyullatha Kanchanapally.

    19) Use per-bucket spinlock for bpf hash facility, from Tom Leiming.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1375 commits)
    net: bnxt: always return values from _bnxt_get_max_rings
    net: bpf: reject invalid shifts
    phonet: properly unshare skbs in phonet_rcv()
    dwc_eth_qos: Fix dma address for multi-fragment skbs
    phy: remove an unneeded condition
    mdio: remove an unneed condition
    mdio_bus: NULL dereference on allocation error
    net: Fix typo in netdev_intersect_features
    net: freescale: mac-fec: Fix build error from phy_device API change
    net: freescale: ucc_geth: Fix build error from phy_device API change
    bonding: Prevent IPv6 link local address on enslaved devices
    IB/mlx5: Add flow steering support
    net/mlx5_core: Export flow steering API
    net/mlx5_core: Make ipv4/ipv6 location more clear
    net/mlx5_core: Enable flow steering support for the IB driver
    net/mlx5_core: Initialize namespaces only when supported by device
    net/mlx5_core: Set priority attributes
    net/mlx5_core: Connect flow tables
    net/mlx5_core: Introduce modify flow table command
    net/mlx5_core: Managing root flow table
    ...

    Linus Torvalds
     

12 Jan, 2016

1 commit

  • Pull vfs xattr updates from Al Viro:
    "Andreas' xattr cleanup series.

    It's a followup to his xattr work that went in last cycle; -0.5KLoC"

    * 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    xattr handlers: Simplify list operation
    ocfs2: Replace list xattr handler operations
    nfs: Move call to security_inode_listsecurity into nfs_listxattr
    xfs: Change how listxattr generates synthetic attributes
    tmpfs: listxattr should include POSIX ACL xattrs
    tmpfs: Use xattr handler infrastructure
    btrfs: Use xattr handler infrastructure
    vfs: Distinguish between full xattr names and proper prefixes
    posix acls: Remove duplicate xattr name definitions
    gfs2: Remove gfs2_xattr_acl_chmod
    vfs: Remove vfs_xattr_cmp

    Linus Torvalds
     

31 Dec, 2015

1 commit


30 Dec, 2015

1 commit


09 Dec, 2015

1 commit

  • new method: ->get_link(); replacement of ->follow_link(). The differences
    are:
    * inode and dentry are passed separately
    * might be called both in RCU and non-RCU mode;
    the former is indicated by passing it a NULL dentry.
    * when called that way it isn't allowed to block
    and should return ERR_PTR(-ECHILD) if it needs to be called
    in non-RCU mode.

    It's a flagday change - the old method is gone, all in-tree instances
    converted. Conversion isn't hard; said that, so far very few instances
    do not immediately bail out when called in RCU mode. That'll change
    in the next commits.

    Signed-off-by: Al Viro

    Al Viro
     

07 Dec, 2015

2 commits

  • When a file on tmpfs has an ACL or a Default ACL, listxattr should include the
    corresponding xattr name.

    Signed-off-by: Andreas Gruenbacher
    Reviewed-by: James Morris
    Cc: Hugh Dickins
    Cc: linux-mm@kvack.org
    Signed-off-by: Al Viro

    Andreas Gruenbacher
     
  • Use the VFS xattr handler infrastructure and get rid of similar code in
    the filesystem. For implementing shmem_xattr_handler_set, we need a
    version of simple_xattr_set which removes the attribute when value is
    NULL. Use this to implement kernfs_iop_removexattr as well.

    Signed-off-by: Andreas Gruenbacher
    Reviewed-by: James Morris
    Cc: Hugh Dickins
    Cc: linux-mm@kvack.org
    Signed-off-by: Al Viro

    Andreas Gruenbacher
     

21 Nov, 2015

1 commit

  • Implement kernfs_walk_and_get() which is similar to
    kernfs_find_and_get() but can walk a path instead of just a name.

    v2: Use strlcpy() instead of strlen() + memcpy() as suggested by
    David.

    Signed-off-by: Tejun Heo
    Acked-by: Greg Kroah-Hartman
    Cc: David Miller

    Tejun Heo
     

19 Aug, 2015

1 commit


04 Jul, 2015

1 commit

  • Pull user namespace updates from Eric Biederman:
    "Long ago and far away when user namespaces where young it was realized
    that allowing fresh mounts of proc and sysfs with only user namespace
    permissions could violate the basic rule that only root gets to decide
    if proc or sysfs should be mounted at all.

    Some hacks were put in place to reduce the worst of the damage could
    be done, and the common sense rule was adopted that fresh mounts of
    proc and sysfs should allow no more than bind mounts of proc and
    sysfs. Unfortunately that rule has not been fully enforced.

    There are two kinds of gaps in that enforcement. Only filesystems
    mounted on empty directories of proc and sysfs should be ignored but
    the test for empty directories was insufficient. So in my tree
    directories on proc, sysctl and sysfs that will always be empty are
    created specially. Every other technique is imperfect as an ordinary
    directory can have entries added even after a readdir returns and
    shows that the directory is empty. Special creation of directories
    for mount points makes the code in the kernel a smidge clearer about
    it's purpose. I asked container developers from the various container
    projects to help test this and no holes were found in the set of mount
    points on proc and sysfs that are created specially.

    This set of changes also starts enforcing the mount flags of fresh
    mounts of proc and sysfs are consistent with the existing mount of
    proc and sysfs. I expected this to be the boring part of the work but
    unfortunately unprivileged userspace winds up mounting fresh copies of
    proc and sysfs with noexec and nosuid clear when root set those flags
    on the previous mount of proc and sysfs. So for now only the atime,
    read-only and nodev attributes which userspace happens to keep
    consistent are enforced. Dealing with the noexec and nosuid
    attributes remains for another time.

    This set of changes also addresses an issue with how open file
    descriptors from /proc//ns/* are displayed. Recently readlink of
    /proc//fd has been triggering a WARN_ON that has not been
    meaningful since it was added (as all of the code in the kernel was
    converted) and is not now actively wrong.

    There is also a short list of issues that have not been fixed yet that
    I will mention briefly.

    It is possible to rename a directory from below to above a bind mount.
    At which point any directory pointers below the renamed directory can
    be walked up to the root directory of the filesystem. With user
    namespaces enabled a bind mount of the bind mount can be created
    allowing the user to pick a directory whose children they can rename
    to outside of the bind mount. This is challenging to fix and doubly
    so because all obvious solutions must touch code that is in the
    performance part of pathname resolution.

    As mentioned above there is also a question of how to ensure that
    developers by accident or with purpose do not introduce exectuable
    files on sysfs and proc and in doing so introduce security regressions
    in the current userspace that will not be immediately obvious and as
    such are likely to require breaking userspace in painful ways once
    they are recognized"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    vfs: Remove incorrect debugging WARN in prepend_path
    mnt: Update fs_fully_visible to test for permanently empty directories
    sysfs: Create mountpoints with sysfs_create_mount_point
    sysfs: Add support for permanently empty directories to serve as mount points.
    kernfs: Add support for always empty directories.
    proc: Allow creating permanently empty directories that serve as mount points
    sysctl: Allow creating permanently empty directories that serve as mountpoints.
    fs: Add helper functions for permanently empty directories.
    vfs: Ignore unlocked mounts in fs_fully_visible
    mnt: Modify fs_fully_visible to deal with locked ro nodev and atime
    mnt: Refactor the logic for mounting sysfs and proc in a user namespace

    Linus Torvalds
     

01 Jul, 2015

1 commit

  • Add a new function kernfs_create_empty_dir that can be used to create
    directory that can not be modified.

    Update the code to use make_empty_dir_inode when reporting a
    permanently empty directory to the vfs.

    Update the code to not allow adding to permanently empty directories.

    Cc: stable@vger.kernel.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

27 Jun, 2015

2 commits

  • Pull cgroup updates from Tejun Heo:

    - threadgroup_lock got reorganized so that its users can pick the
    actual locking mechanism to use. Its only user - cgroups - is
    updated to use a percpu_rwsem instead of per-process rwsem.

    This makes things a bit lighter on hot paths and allows cgroups to
    perform and fail multi-task (a process) migrations atomically.
    Multi-task migrations are used in several places including the
    unified hierarchy.

    - Delegation rule and documentation added to unified hierarchy. This
    will likely be the last interface update from the cgroup core side
    for unified hierarchy before lifting the devel mask.

    - Some groundwork for the pids controller which is scheduled to be
    merged in the coming devel cycle.

    * 'for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: add delegation section to unified hierarchy documentation
    cgroup: require write perm on common ancestor when moving processes on the default hierarchy
    cgroup: separate out cgroup_procs_write_permission() from __cgroup_procs_write()
    kernfs: make kernfs_get_inode() public
    MAINTAINERS: add a cgroup core co-maintainer
    cgroup: fix uninitialised iterator in for_each_subsys_which
    cgroup: replace explicit ss_mask checking with for_each_subsys_which
    cgroup: use bitmask to filter for_each_subsys
    cgroup: add seq_file forward declaration for struct cftype
    cgroup: simplify threadgroup locking
    sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem
    sched, cgroup: reorganize threadgroup locking
    cgroup: switch to unsigned long for bitmasks
    cgroup: reorganize include/linux/cgroup.h
    cgroup: separate out include/linux/cgroup-defs.h
    cgroup: fix some comment typos

    Linus Torvalds
     
  • Pull driver core updates from Greg KH:
    "Here is the driver core / firmware changes for 4.2-rc1.

    A number of small changes all over the place in the driver core, and
    in the firmware subsystem. Nothing really major, full details in the
    shortlog. Some of it is a bit of churn, given that the platform
    driver probing changes was found to not work well, so they were
    reverted.

    All of these have been in linux-next for a while with no reported
    issues"

    * tag 'driver-core-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (31 commits)
    Revert "base/platform: Only insert MEM and IO resources"
    Revert "base/platform: Continue on insert_resource() error"
    Revert "of/platform: Use platform_device interface"
    Revert "base/platform: Remove code duplication"
    firmware: add missing kfree for work on async call
    fs: sysfs: don't pass count == 0 to bin file readers
    base:dd - Fix for typo in comment to function driver_deferred_probe_trigger().
    base/platform: Remove code duplication
    of/platform: Use platform_device interface
    base/platform: Continue on insert_resource() error
    base/platform: Only insert MEM and IO resources
    firmware: use const for remaining firmware names
    firmware: fix possible use after free on name on asynchronous request
    firmware: check for file truncation on direct firmware loading
    firmware: fix __getname() missing failure check
    drivers: of/base: move of_init to driver_init
    drivers/base: cacheinfo: fix annoying typo when DT nodes are absent
    sysfs: disambiguate between "error code" and "failure" in comments
    driver-core: fix build for !CONFIG_MODULES
    driver-core: make __device_attach() static
    ...

    Linus Torvalds
     

23 Jun, 2015

1 commit

  • Pull vfs updates from Al Viro:
    "In this pile: pathname resolution rewrite.

    - recursion in link_path_walk() is gone.

    - nesting limits on symlinks are gone (the only limit remaining is
    that the total amount of symlinks is no more than 40, no matter how
    nested).

    - "fast" (inline) symlinks are handled without leaving rcuwalk mode.

    - stack footprint (independent of the nesting) is below kilobyte now,
    about on par with what it used to be with one level of nested
    symlinks and ~2.8 times lower than it used to be in the worst case.

    - struct nameidata is entirely private to fs/namei.c now (not even
    opaque pointers are being passed around).

    - ->follow_link() and ->put_link() calling conventions had been
    changed; all in-tree filesystems converted, out-of-tree should be
    able to follow reasonably easily.

    For out-of-tree conversions, see Documentation/filesystems/porting
    for details (and in-tree filesystems for examples of conversion).

    That has sat in -next since mid-May, seems to survive all testing
    without regressions and merges clean with v4.1"

    * 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (131 commits)
    turn user_{path_at,path,lpath,path_dir}() into static inlines
    namei: move saved_nd pointer into struct nameidata
    inline user_path_create()
    inline user_path_parent()
    namei: trim do_last() arguments
    namei: stash dfd and name into nameidata
    namei: fold path_cleanup() into terminate_walk()
    namei: saner calling conventions for filename_parentat()
    namei: saner calling conventions for filename_create()
    namei: shift nameidata down into filename_parentat()
    namei: make filename_lookup() reject ERR_PTR() passed as name
    namei: shift nameidata inside filename_lookup()
    namei: move putname() call into filename_lookup()
    namei: pass the struct path to store the result down into path_lookupat()
    namei: uninline set_root{,_rcu}()
    namei: be careful with mountpoint crossings in follow_dotdot_rcu()
    Documentation: remove outdated information from automount-support.txt
    get rid of assorted nameidata-related debris
    lustre: kill unused helper
    lustre: kill unused macro (LOOKUP_CONTINUE)
    ...

    Linus Torvalds
     

19 Jun, 2015

1 commit

  • Move kernfs_get_inode() prototype from fs/kernfs/kernfs-internal.h to
    include/linux/kernfs.h. It obtains the matching inode for a
    kernfs_node.

    It will be used by cgroup for inode based permission checks for now
    but is generally useful.

    Signed-off-by: Tejun Heo
    Acked-by: Greg Kroah-Hartman

    Tejun Heo
     

25 May, 2015

1 commit


15 May, 2015

1 commit

  • root->ino_ida is used for kernfs inode number allocations. Since IDA has
    a layered structure, different IDs can reside on the same layer, which
    is currently accounted to some memory cgroup. The problem is that each
    kmem cache of a memory cgroup has its own directory on sysfs (under
    /sys/fs/kernel//cgroup). If the inode number of such a
    directory or any file in it gets allocated from a layer accounted to the
    cgroup which the cache is created for, the cgroup will get pinned for
    good, because one has to free all kmem allocations accounted to a cgroup
    in order to release it and destroy all its kmem caches. That said we
    must not account layers of ino_ida to any memory cgroup.

    Since per net init operations may create new sysfs entries directly
    (e.g. lo device) or indirectly (nf_conntrack creates a new kmem cache
    per each namespace, which, in turn, creates new sysfs entries), an easy
    way to reproduce this issue is by creating network namespace(s) from
    inside a kmem-active memory cgroup.

    Signed-off-by: Vladimir Davydov
    Acked-by: Tejun Heo
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Greg Thelen
    Cc: Greg Kroah-Hartman
    Cc: [4.0.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

11 May, 2015

3 commits