05 Jul, 2013

1 commit


04 Jul, 2013

2 commits

  • The cp_inodes_count and cp_blocks_count are represented as __le64 type in
    on-disk structure (struct nilfs_checkpoint). But analogous fields in
    in-core structure (struct nilfs_root) are represented by atomic_t type.

    This patch replaces atomic_t on atomic64_t type in representation of
    inodes_count and blocks_count fields in struct nilfs_root.

    Signed-off-by: Vyacheslav Dubeyko
    Acked-by: Ryusuke Konishi
    Acked-by: Joern Engel
    Cc: Clemens Eisserer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vyacheslav Dubeyko
     
  • Currently, NILFS2 returns 0 as free inodes count (f_ffree) and current
    used inodes count as total file nodes in file system (f_files):

    df -i
    Filesystem Inodes IUsed IFree IUse% Mounted on
    /dev/loop0 2 2 0 100% /mnt/nilfs2

    This patch implements real calculation of free inodes count. First of
    all, it is calculated total file nodes in file system as
    (desc_blocks_count * groups_per_desc_block * entries_per_group). Then, it
    is calculated free inodes count as difference the total file nodes and
    used inodes count. As a result, we have such output for NILFS2:

    df -i
    Filesystem Inodes IUsed IFree IUse% Mounted on
    /dev/loop0 4194304 2114701 2079603 51% /mnt/nilfs2

    Reported-by: Clemens Eisserer
    Signed-off-by: Vyacheslav Dubeyko
    Signed-off-by: Ryusuke Konishi
    Tested-by: Vyacheslav Dubeyko
    Cc: Joern Engel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vyacheslav Dubeyko
     

29 Jun, 2013

1 commit


25 May, 2013

1 commit

  • nilfs2: fix issue of nilfs_set_page_dirty for page at EOF boundary

    DESCRIPTION:
    There are use-cases when NILFS2 file system (formatted with block size
    lesser than 4 KB) can be remounted in RO mode because of encountering of
    "broken bmap" issue.

    The issue was reported by Anthony Doggett :
    "The machine I've been trialling nilfs on is running Debian Testing,
    Linux version 3.2.0-4-686-pae (debian-kernel@lists.debian.org) (gcc
    version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian 3.2.35-2), but I've
    also reproduced it (identically) with Debian Unstable amd64 and Debian
    Experimental (using the 3.8-trunk kernel). The problematic partitions
    were formatted with "mkfs.nilfs2 -b 1024 -B 8192"."

    SYMPTOMS:
    (1) System log contains error messages likewise:

    [63102.496756] nilfs_direct_assign: invalid pointer: 0
    [63102.496786] NILFS error (device dm-17): nilfs_bmap_assign: broken bmap (inode number=28)
    [63102.496798]
    [63102.524403] Remounting filesystem read-only

    (2) The NILFS2 file system is remounted in RO mode.

    REPRODUSING PATH:
    (1) Create volume group with name "unencrypted" by means of vgcreate utility.
    (2) Run script (prepared by Anthony Doggett ):

    ----------------[BEGIN SCRIPT]--------------------

    VG=unencrypted
    lvcreate --size 2G --name ntest $VG
    mkfs.nilfs2 -b 1024 -B 8192 /dev/mapper/$VG-ntest
    mkdir /var/tmp/n
    mkdir /var/tmp/n/ntest
    mount /dev/mapper/$VG-ntest /var/tmp/n/ntest
    mkdir /var/tmp/n/ntest/thedir
    cd /var/tmp/n/ntest/thedir
    sleep 2
    date
    darcs init
    sleep 2
    dmesg|tail -n 5
    date
    darcs whatsnew || true
    date
    sleep 2
    dmesg|tail -n 5
    ----------------[END SCRIPT]--------------------

    REPRODUCIBILITY: 100%

    INVESTIGATION:
    As it was discovered, the issue takes place during segment
    construction after executing such sequence of user-space operations:

    open("_darcs/index", O_RDWR|O_CREAT|O_NOCTTY, 0666) = 7
    fstat(7, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
    ftruncate(7, 60)

    The error message "NILFS error (device dm-17): nilfs_bmap_assign: broken
    bmap (inode number=28)" takes place because of trying to get block
    number for third block of the file with logical offset #3072 bytes. As
    it is possible to see from above output, the file has 60 bytes of the
    whole size. So, it is enough one block (1 KB in size) allocation for
    the whole file. Trying to operate with several blocks instead of one
    takes place because of discovering several dirty buffers for this file
    in nilfs_segctor_scan_file() method.

    The root cause of this issue is in nilfs_set_page_dirty function which
    is called just before writing to an mmapped page.

    When nilfs_page_mkwrite function handles a page at EOF boundary, it
    fills hole blocks only inside EOF through __block_page_mkwrite().

    The __block_page_mkwrite() function calls set_page_dirty() after filling
    hole blocks, thus nilfs_set_page_dirty function (=
    a_ops->set_page_dirty) is called. However, the current implementation
    of nilfs_set_page_dirty() wrongly marks all buffers dirty even for page
    at EOF boundary.

    As a result, buffers outside EOF are inconsistently marked dirty and
    queued for write even though they are not mapped with nilfs_get_block
    function.

    FIX:
    This modifies nilfs_set_page_dirty() not to mark hole blocks dirty.

    Thanks to Vyacheslav Dubeyko for his effort on analysis and proposals
    for this issue.

    Signed-off-by: Ryusuke Konishi
    Reported-by: Anthony Doggett
    Reported-by: Vyacheslav Dubeyko
    Cc: Vyacheslav Dubeyko
    Tested-by: Ryusuke Konishi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ryusuke Konishi
     

08 May, 2013

1 commit

  • Faster kernel compiles by way of fewer unnecessary includes.

    [akpm@linux-foundation.org: fix fallout]
    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kent Overstreet
     

01 May, 2013

3 commits

  • page->mapping->host cannot be NULL in nilfs_writepage(), so remove the
    unneeded test.

    The fixes the smatch warning: "fs/nilfs2/inode.c:211 nilfs_writepage()
    error: we previously assumed 'inode' could be null (see line 195)".

    Reported-by: Dan Carpenter
    Signed-off-by: Vyacheslav Dubeyko
    Cc: Ryusuke Konishi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vyacheslav Dubeyko
     
  • Change test_bit(PG_locked, &page->flags) to PageLocked().

    Signed-off-by: Vyacheslav Dubeyko
    Cc: Ryusuke Konishi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vyacheslav Dubeyko
     
  • …river's internal error or metadata corruption

    The NILFS2 driver remounts itself in RO mode in the case of discovering
    metadata corruption (for example, discovering a broken bmap). But
    usually, this takes place when there have been file system operations
    before remounting in RO mode.

    Thereby, NILFS2 driver can be in RO mode with presence of dirty pages in
    modified inodes' address spaces. It results in flush kernel thread's
    infinite trying to flush dirty pages in RO mode. As a result, it is
    possible to see such side effects as: (1) flush kernel thread occupies
    50% - 99% of CPU time; (2) system can't be shutdowned without manual
    power switch off.

    SYMPTOMS:
    (1) System log contains error message: "Remounting filesystem read-only".
    (2) The flush kernel thread occupies 50% - 99% of CPU time.
    (3) The system can't be shutdowned without manual power switch off.

    REPRODUCTION PATH:
    (1) Create volume group with name "unencrypted" by means of vgcreate utility.
    (2) Run script (prepared by Anthony Doggett <Anthony2486@interfaces.org.uk>):

    ----------------[BEGIN SCRIPT]--------------------
    #!/bin/bash

    VG=unencrypted
    #apt-get install nilfs-tools darcs
    lvcreate --size 2G --name ntest $VG
    mkfs.nilfs2 -b 1024 -B 8192 /dev/mapper/$VG-ntest
    mkdir /var/tmp/n
    mkdir /var/tmp/n/ntest
    mount /dev/mapper/$VG-ntest /var/tmp/n/ntest
    mkdir /var/tmp/n/ntest/thedir
    cd /var/tmp/n/ntest/thedir
    sleep 2
    date
    darcs init
    sleep 2
    dmesg|tail -n 5
    date
    darcs whatsnew || true
    date
    sleep 2
    dmesg|tail -n 5
    ----------------[END SCRIPT]--------------------

    (3) Try to shutdown the system.

    REPRODUCIBILITY: 100%

    FIX:

    This patch implements checking mount state of NILFS2 driver in
    nilfs_writepage(), nilfs_writepages() and nilfs_mdt_write_page()
    methods. If it is detected the RO mount state then all dirty pages are
    simply discarded with warning messages is written in system log.

    [akpm@linux-foundation.org: fix printk warning]
    Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com>
    Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
    Cc: Anthony Doggett <Anthony2486@interfaces.org.uk>
    Cc: ARAI Shun-ichi <hermes@ceres.dti.ne.jp>
    Cc: Piotr Szymaniak <szarpaj@grubelek.pl>
    Cc: Zahid Chowdhury <zahid.chowdhury@starsolutions.com>
    Cc: Elmer Zhang <freeboy6716@gmail.com>
    Cc: Wu Fengguang <fengguang.wu@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Vyacheslav Dubeyko
     

04 Mar, 2013

1 commit

  • Modify the request_module to prefix the file system type with "fs-"
    and add aliases to all of the filesystems that can be built as modules
    to match.

    A common practice is to build all of the kernel code and leave code
    that is not commonly needed as modules, with the result that many
    users are exposed to any bug anywhere in the kernel.

    Looking for filesystems with a fs- prefix limits the pool of possible
    modules that can be loaded by mount to just filesystems trivially
    making things safer with no real cost.

    Using aliases means user space can control the policy of which
    filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
    with blacklist and alias directives. Allowing simple, safe,
    well understood work-arounds to known problematic software.

    This also addresses a rare but unfortunate problem where the filesystem
    name is not the same as it's module name and module auto-loading
    would not work. While writing this patch I saw a handful of such
    cases. The most significant being autofs that lives in the module
    autofs4.

    This is relevant to user namespaces because we can reach the request
    module in get_fs_type() without having any special permissions, and
    people get uncomfortable when a user specified string (in this case
    the filesystem type) goes all of the way to request_module.

    After having looked at this issue I don't think there is any
    particular reason to perform any filtering or permission checks beyond
    making it clear in the module request that we want a filesystem
    module. The common pattern in the kernel is to call request_module()
    without regards to the users permissions. In general all a filesystem
    module does once loaded is call register_filesystem() and go to sleep.
    Which means there is not much attack surface exposed by loading a
    filesytem module unless the filesystem is mounted. In a user
    namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
    which most filesystems do not set today.

    Acked-by: Serge Hallyn
    Acked-by: Kees Cook
    Reported-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

27 Feb, 2013

1 commit

  • Pull vfs pile (part one) from Al Viro:
    "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
    locking violations, etc.

    The most visible changes here are death of FS_REVAL_DOT (replaced with
    "has ->d_weak_revalidate()") and a new helper getting from struct file
    to inode. Some bits of preparation to xattr method interface changes.

    Misc patches by various people sent this cycle *and* ocfs2 fixes from
    several cycles ago that should've been upstream right then.

    PS: the next vfs pile will be xattr stuff."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
    saner proc_get_inode() calling conventions
    proc: avoid extra pde_put() in proc_fill_super()
    fs: change return values from -EACCES to -EPERM
    fs/exec.c: make bprm_mm_init() static
    ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
    ocfs2: fix possible use-after-free with AIO
    ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
    get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
    target: writev() on single-element vector is pointless
    export kernel_write(), convert open-coded instances
    fs: encode_fh: return FILEID_INVALID if invalid fid_type
    kill f_vfsmnt
    vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
    nfsd: handle vfs_getattr errors in acl protocol
    switch vfs_getattr() to struct path
    default SET_PERSONALITY() in linux/elf.h
    ceph: prepopulate inodes only when request is aborted
    d_hash_and_lookup(): export, switch open-coded instances
    9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
    9p: split dropping the acls from v9fs_set_create_acl()
    ...

    Linus Torvalds
     

26 Feb, 2013

1 commit


23 Feb, 2013

1 commit


22 Feb, 2013

3 commits

  • Merge misc patches from Andrew Morton:

    - Florian has vanished so I appear to have become fbdev maintainer
    again :(

    - Joel and Mark are distracted to welcome to the new OCFS2 maintainer

    - The backlight queue

    - Small core kernel changes

    - lib/ updates

    - The rtc queue

    - Various random bits

    * akpm: (164 commits)
    rtc: rtc-davinci: use devm_*() functions
    rtc: rtc-max8997: use devm_request_threaded_irq()
    rtc: rtc-max8907: use devm_request_threaded_irq()
    rtc: rtc-da9052: use devm_request_threaded_irq()
    rtc: rtc-wm831x: use devm_request_threaded_irq()
    rtc: rtc-tps80031: use devm_request_threaded_irq()
    rtc: rtc-lp8788: use devm_request_threaded_irq()
    rtc: rtc-coh901331: use devm_clk_get()
    rtc: rtc-vt8500: use devm_*() functions
    rtc: rtc-tps6586x: use devm_request_threaded_irq()
    rtc: rtc-imxdi: use devm_clk_get()
    rtc: rtc-cmos: use dev_warn()/dev_dbg() instead of printk()/pr_debug()
    rtc: rtc-pcf8583: use dev_warn() instead of printk()
    rtc: rtc-sun4v: use pr_warn() instead of printk()
    rtc: rtc-vr41xx: use dev_info() instead of printk()
    rtc: rtc-rs5c313: use pr_err() instead of printk()
    rtc: rtc-at91rm9200: use dev_dbg()/dev_err() instead of printk()/pr_debug()
    rtc: rtc-rs5c372: use dev_dbg()/dev_warn() instead of printk()/pr_debug()
    rtc: rtc-ds2404: use dev_err() instead of printk()
    rtc: rtc-efi: use dev_err()/dev_warn()/pr_err() instead of printk()
    ...

    Linus Torvalds
     
  • Create a helper function to check if a backing device requires stable
    page writes and, if so, performs the necessary wait. Then, make it so
    that all points in the memory manager that handle making pages writable
    use the helper function. This should provide stable page write support
    to most filesystems, while eliminating unnecessary waiting for devices
    that don't require the feature.

    Before this patchset, all filesystems would block, regardless of whether
    or not it was necessary. ext3 would wait, but still generate occasional
    checksum errors. The network filesystems were left to do their own
    thing, so they'd wait too.

    After this patchset, all the disk filesystems except ext3 and btrfs will
    wait only if the hardware requires it. ext3 (if necessary) snapshots
    pages instead of blocking, and btrfs provides its own bdi so the mm will
    never wait. Network filesystems haven't been touched, so either they
    provide their own stable page guarantees or they don't block at all.
    The blocking behavior is back to what it was before 3.0 if you don't
    have a disk requiring stable page writes.

    Here's the result of using dbench to test latency on ext2:

    3.8.0-rc3:
    Operation Count AvgLat MaxLat
    ----------------------------------------
    WriteX 109347 0.028 59.817
    ReadX 347180 0.004 3.391
    Flush 15514 29.828 287.283

    Throughput 57.429 MB/sec 4 clients 4 procs max_latency=287.290 ms

    3.8.0-rc3 + patches:
    WriteX 105556 0.029 4.273
    ReadX 335004 0.005 4.112
    Flush 14982 30.540 298.634

    Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms

    As you can see, the maximum write latency drops considerably with this
    patch enabled. The other filesystems (ext3/ext4/xfs/btrfs) behave
    similarly, but see the cover letter for those results.

    Signed-off-by: Darrick J. Wong
    Acked-by: Steven Whitehouse
    Reviewed-by: Jan Kara
    Cc: Adrian Hunter
    Cc: Andy Lutomirski
    Cc: Artem Bityutskiy
    Cc: Joel Becker
    Cc: Mark Fasheh
    Cc: Jens Axboe
    Cc: Eric Van Hensbergen
    Cc: Ron Minnich
    Cc: Latchesar Ionkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     
  • Pull driver core patches from Greg Kroah-Hartman:
    "Here is the big driver core merge for 3.9-rc1

    There are two major series here, both of which touch lots of drivers
    all over the kernel, and will cause you some merge conflicts:

    - add a new function called devm_ioremap_resource() to properly be
    able to check return values.

    - remove CONFIG_EXPERIMENTAL

    Other than those patches, there's not much here, some minor fixes and
    updates"

    Fix up trivial conflicts

    * tag 'driver-core-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (221 commits)
    base: memory: fix soft/hard_offline_page permissions
    drivercore: Fix ordering between deferred_probe and exiting initcalls
    backlight: fix class_find_device() arguments
    TTY: mark tty_get_device call with the proper const values
    driver-core: constify data for class_find_device()
    firmware: Ignore abort check when no user-helper is used
    firmware: Reduce ifdef CONFIG_FW_LOADER_USER_HELPER
    firmware: Make user-mode helper optional
    firmware: Refactoring for splitting user-mode helper code
    Driver core: treat unregistered bus_types as having no devices
    watchdog: Convert to devm_ioremap_resource()
    thermal: Convert to devm_ioremap_resource()
    spi: Convert to devm_ioremap_resource()
    power: Convert to devm_ioremap_resource()
    mtd: Convert to devm_ioremap_resource()
    mmc: Convert to devm_ioremap_resource()
    mfd: Convert to devm_ioremap_resource()
    media: Convert to devm_ioremap_resource()
    iommu: Convert to devm_ioremap_resource()
    drm: Convert to devm_ioremap_resource()
    ...

    Linus Torvalds
     

05 Feb, 2013

1 commit

  • There exists a situation when GC can work in background alone without
    any other filesystem activity during significant time.

    The nilfs_clean_segments() method calls nilfs_segctor_construct() that
    updates superblocks in the case of NILFS_SC_SUPER_ROOT and
    THE_NILFS_DISCONTINUED flags are set. But when GC is working alone the
    nilfs_clean_segments() is called with unset THE_NILFS_DISCONTINUED flag.
    As a result, the update of superblocks doesn't occurred all this time
    and in the case of SPOR superblocks keep very old values of last super
    root placement.

    SYMPTOMS:

    Trying to mount a NILFS2 volume after SPOR in such environment ends with
    very long mounting time (it can achieve about several hours in some
    cases).

    REPRODUCING PATH:

    1. It needs to use external USB HDD, disable automount and doesn't
    make any additional filesystem activity on the NILFS2 volume.

    2. Generate temporary file with size about 100 - 500 GB (for example,
    dd if=/dev/zero of= bs=1073741824 count=200). The size of
    file defines duration of GC working.

    3. Then it needs to delete file.

    4. Start GC manually by means of command "nilfs-clean -p 0". When you
    start GC by means of such way then, at the end, superblocks is updated
    by once. So, for simulation of SPOR, it needs to wait sometime (15 -
    40 minutes) and simply switch off USB HDD manually.

    5. Switch on USB HDD again and try to mount NILFS2 volume. As a
    result, NILFS2 volume will mount during very long time.

    REPRODUCIBILITY: 100%

    FIX:

    This patch adds checking that superblocks need to update and set
    THE_NILFS_DISCONTINUED flag before nilfs_clean_segments() call.

    Reported-by: Sergey Alexandrov
    Signed-off-by: Vyacheslav Dubeyko
    Tested-by: Vyacheslav Dubeyko
    Acked-by: Ryusuke Konishi
    Tested-by: Ryusuke Konishi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vyacheslav Dubeyko
     

12 Jan, 2013

1 commit

  • The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
    while now and is almost always enabled by default. As agreed during the
    Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

    CC: KONISHI Ryusuke
    Signed-off-by: Kees Cook
    Acked-by: Ryusuke Konishi

    Kees Cook
     

21 Dec, 2012

1 commit


12 Dec, 2012

1 commit

  • Overhaul struct address_space.assoc_mapping renaming it to
    address_space.private_data and its type is redefined to void*. By this
    approach we consistently name the .private_* elements from struct
    address_space as well as allow extended usage for address_space
    association with other data structures through ->private_data.

    Also, all users of old ->assoc_mapping element are converted to reflect
    its new name and type change (->private_data).

    Signed-off-by: Rafael Aquini
    Cc: Rusty Russell
    Cc: "Michael S. Tsirkin"
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andi Kleen
    Cc: Konrad Rzeszutek Wilk
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     

09 Oct, 2012

1 commit

  • Move actual pte filling for non-linear file mappings into the new special
    vma operation: ->remap_pages().

    Filesystems must implement this method to get non-linear mapping support,
    if it uses filemap_fault() then generic_file_remap_pages() can be used.

    Now device drivers can implement this method and obtain nonlinear vma support.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf #arch/tile
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

08 Oct, 2012

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "The big new feature added this time is supporting online resizing
    using the meta_bg feature. This allows us to resize file systems
    which are greater than 16TB. In addition, the speed of online
    resizing has been improved in general.

    We also fix a number of races, some of which could lead to deadlocks,
    in ext4's Asynchronous I/O and online defrag support, thanks to good
    work by Dmitry Monakhov.

    There are also a large number of more minor bug fixes and cleanups
    from a number of other ext4 contributors, quite of few of which have
    submitted fixes for the first time."

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (69 commits)
    ext4: fix ext4_flush_completed_IO wait semantics
    ext4: fix mtime update in nodelalloc mode
    ext4: fix ext_remove_space for punch_hole case
    ext4: punch_hole should wait for DIO writers
    ext4: serialize truncate with owerwrite DIO workers
    ext4: endless truncate due to nonlocked dio readers
    ext4: serialize unlocked dio reads with truncate
    ext4: serialize dio nonlocked reads with defrag workers
    ext4: completed_io locking cleanup
    ext4: fix unwritten counter leakage
    ext4: give i_aiodio_unwritten a more appropriate name
    ext4: ext4_inode_info diet
    ext4: convert to use leXX_add_cpu()
    ext4: ext4_bread usage audit
    fs: reserve fallocate flag codepoint
    ext4: remove redundant offset check in mext_check_arguments()
    ext4: don't clear orphan list on ro mount with errors
    jbd2: fix assertion failure in commit code due to lacking transaction credits
    ext4: release donor reference when EXT4_IOC_MOVE_EXT ioctl fails
    ext4: enable FITRIM ioctl on bigalloc file system
    ...

    Linus Torvalds
     

03 Oct, 2012

3 commits

  • Pull vfs update from Al Viro:

    - big one - consolidation of descriptor-related logics; almost all of
    that is moved to fs/file.c

    (BTW, I'm seriously tempted to rename the result to fd.c. As it is,
    we have a situation when file_table.c is about handling of struct
    file and file.c is about handling of descriptor tables; the reasons
    are historical - file_table.c used to be about a static array of
    struct file we used to have way back).

    A lot of stray ends got cleaned up and converted to saner primitives,
    disgusting mess in android/binder.c is still disgusting, but at least
    doesn't poke so much in descriptor table guts anymore. A bunch of
    relatively minor races got fixed in process, plus an ext4 struct file
    leak.

    - related thing - fget_light() partially unuglified; see fdget() in
    there (and yes, it generates the code as good as we used to have).

    - also related - bits of Cyrill's procfs stuff that got entangled into
    that work; _not_ all of it, just the initial move to fs/proc/fd.c and
    switch of fdinfo to seq_file.

    - Alex's fs/coredump.c spiltoff - the same story, had been easier to
    take that commit than mess with conflicts. The rest is a separate
    pile, this was just a mechanical code movement.

    - a few misc patches all over the place. Not all for this cycle,
    there'll be more (and quite a few currently sit in akpm's tree)."

    Fix up trivial conflicts in the android binder driver, and some fairly
    simple conflicts due to two different changes to the sock_alloc_file()
    interface ("take descriptor handling from sock_alloc_file() to callers"
    vs "net: Providing protocol type via system.sockprotoname xattr of
    /proc/PID/fd entries" adding a dentry name to the socket)

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
    MAX_LFS_FILESIZE should be a loff_t
    compat: fs: Generic compat_sys_sendfile implementation
    fs: push rcu_barrier() from deactivate_locked_super() to filesystems
    btrfs: reada_extent doesn't need kref for refcount
    coredump: move core dump functionality into its own file
    coredump: prevent double-free on an error path in core dumper
    usb/gadget: fix misannotations
    fcntl: fix misannotations
    ceph: don't abuse d_delete() on failure exits
    hypfs: ->d_parent is never NULL or negative
    vfs: delete surplus inode NULL check
    switch simple cases of fget_light to fdget
    new helpers: fdget()/fdput()
    switch o2hb_region_dev_write() to fget_light()
    proc_map_files_readdir(): don't bother with grabbing files
    make get_file() return its argument
    vhost_set_vring(): turn pollstart/pollstop into bool
    switch prctl_set_mm_exe_file() to fget_light()
    switch xfs_find_handle() to fget_light()
    switch xfs_swapext() to fget_light()
    ...

    Linus Torvalds
     
  • There's no reason to call rcu_barrier() on every
    deactivate_locked_super(). We only need to make sure that all delayed rcu
    free inodes are flushed before we destroy related cache.

    Removing rcu_barrier() from deactivate_locked_super() affects some fast
    paths. E.g. on my machine exit_group() of a last process in IPC
    namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time.

    Signed-off-by: Kirill A. Shutemov
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Kirill A. Shutemov
     
  • Pull user namespace changes from Eric Biederman:
    "This is a mostly modest set of changes to enable basic user namespace
    support. This allows the code to code to compile with user namespaces
    enabled and removes the assumption there is only the initial user
    namespace. Everything is converted except for the most complex of the
    filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
    nfs, ocfs2 and xfs as those patches need a bit more review.

    The strategy is to push kuid_t and kgid_t values are far down into
    subsystems and filesystems as reasonable. Leaving the make_kuid and
    from_kuid operations to happen at the edge of userspace, as the values
    come off the disk, and as the values come in from the network.
    Letting compile type incompatible compile errors (present when user
    namespaces are enabled) guide me to find the issues.

    The most tricky areas have been the places where we had an implicit
    union of uid and gid values and were storing them in an unsigned int.
    Those places were converted into explicit unions. I made certain to
    handle those places with simple trivial patches.

    Out of that work I discovered we have generic interfaces for storing
    quota by projid. I had never heard of the project identifiers before.
    Adding full user namespace support for project identifiers accounts
    for most of the code size growth in my git tree.

    Ultimately there will be work to relax privlige checks from
    "capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
    root in a user names to do those things that today we only forbid to
    non-root users because it will confuse suid root applications.

    While I was pushing kuid_t and kgid_t changes deep into the audit code
    I made a few other cleanups. I capitalized on the fact we process
    netlink messages in the context of the message sender. I removed
    usage of NETLINK_CRED, and started directly using current->tty.

    Some of these patches have also made it into maintainer trees, with no
    problems from identical code from different trees showing up in
    linux-next.

    After reading through all of this code I feel like I might be able to
    win a game of kernel trivial pursuit."

    Fix up some fairly trivial conflicts in netfilter uid/git logging code.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
    userns: Convert the ufs filesystem to use kuid/kgid where appropriate
    userns: Convert the udf filesystem to use kuid/kgid where appropriate
    userns: Convert ubifs to use kuid/kgid
    userns: Convert squashfs to use kuid/kgid where appropriate
    userns: Convert reiserfs to use kuid and kgid where appropriate
    userns: Convert jfs to use kuid/kgid where appropriate
    userns: Convert jffs2 to use kuid and kgid where appropriate
    userns: Convert hpfs to use kuid and kgid where appropriate
    userns: Convert btrfs to use kuid/kgid where appropriate
    userns: Convert bfs to use kuid/kgid where appropriate
    userns: Convert affs to use kuid/kgid wherwe appropriate
    userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
    userns: On ia64 deal with current_uid and current_gid being kuid and kgid
    userns: On ppc convert current_uid from a kuid before printing.
    userns: Convert s390 getting uid and gid system calls to use kuid and kgid
    userns: Convert s390 hypfs to use kuid and kgid where appropriate
    userns: Convert binder ipc to use kuids
    userns: Teach security_path_chown to take kuids and kgids
    userns: Add user namespace support to IMA
    userns: Convert EVM to deal with kuids and kgids in it's hmac computation
    ...

    Linus Torvalds
     

01 Oct, 2012

1 commit

  • Commits 5e8830dc85d0 and 41c4d25f78c0 introduced a regression into
    v3.6-rc1 for ext4 in nodealloc mode, such that mtime updates would not
    take place for files modified via mmap if the page was already in the
    page cache. This would also affect ext3 file systems mounted using
    the ext4 file system driver.

    The problem was that ext4_page_mkwrite() had a shortcut which would
    avoid calling __block_page_mkwrite() under some circumstances, and the
    above two commit transferred the responsibility of calling
    file_update_time() to __block_page_mkwrite --- which woudln't get
    called in some circumstances.

    Since __block_page_mkwrite() only has three callers,
    block_page_mkwrite(), ext4_page_mkwrite, and nilfs_page_mkwrite(), the
    best way to solve this is to move the responsibility for calling
    file_update_time() to its caller.

    This problem was found via xfstests #215 with a file system mounted
    with -o nodelalloc.

    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Jan Kara
    Cc: KONISHI Ryusuke
    Cc: stable@vger.kernel.org

    Theodore Ts'o
     

21 Sep, 2012

1 commit


04 Aug, 2012

1 commit


02 Aug, 2012

1 commit

  • Pull second vfs pile from Al Viro:
    "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
    deadlock reproduced by xfstests 068), symlink and hardlink restriction
    patches, plus assorted cleanups and fixes.

    Note that another fsfreeze deadlock (emergency thaw one) is *not*
    dealt with - the series by Fernando conflicts a lot with Jan's, breaks
    userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
    for massive vfsmount leak; this is going to be handled next cycle.
    There probably will be another pull request, but that stuff won't be
    in it."

    Fix up trivial conflicts due to unrelated changes next to each other in
    drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
    delousing target_core_file a bit
    Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
    fs: Remove old freezing mechanism
    ext2: Implement freezing
    btrfs: Convert to new freezing mechanism
    nilfs2: Convert to new freezing mechanism
    ntfs: Convert to new freezing mechanism
    fuse: Convert to new freezing mechanism
    gfs2: Convert to new freezing mechanism
    ocfs2: Convert to new freezing mechanism
    xfs: Convert to new freezing code
    ext4: Convert to new freezing mechanism
    fs: Protect write paths by sb_start_write - sb_end_write
    fs: Skip atime update on frozen filesystem
    fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
    fs: Improve filesystem freezing handling
    switch the protection of percpu_counter list to spinlock
    nfsd: Push mnt_want_write() outside of i_mutex
    btrfs: Push mnt_want_write() outside of i_mutex
    fat: Push mnt_want_write() outside of i_mutex
    ...

    Linus Torvalds
     

31 Jul, 2012

6 commits

  • We change nilfs_page_mkwrite() to provide proper freeze protection for
    writeable page faults (we must wait for frozen filesystem even if the
    page is fully mapped).

    We remove all vfs_check_frozen() checks since they are now handled by
    the generic code.

    CC: linux-nilfs@vger.kernel.org
    CC: KONISHI Ryusuke
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • Add omitted comments for different structures in driver implementation.

    Signed-off-by: Vyacheslav Dubeyko
    Signed-off-by: Ryusuke Konishi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vyacheslav Dubeyko
     
  • An fs-thaw ioctl causes deadlock with a chcp or mkcp -s command:

    chcp D ffff88013870f3d0 0 1325 1324 0x00000004
    ...
    Call Trace:
    nilfs_transaction_begin+0x11c/0x1a0 [nilfs2]
    wake_up_bit+0x20/0x20
    copy_from_user+0x18/0x30 [nilfs2]
    nilfs_ioctl_change_cpmode+0x7d/0xcf [nilfs2]
    nilfs_ioctl+0x252/0x61a [nilfs2]
    do_page_fault+0x311/0x34c
    get_unmapped_area+0x132/0x14e
    do_vfs_ioctl+0x44b/0x490
    __set_task_blocked+0x5a/0x61
    vm_mmap_pgoff+0x76/0x87
    __set_current_blocked+0x30/0x4a
    sys_ioctl+0x4b/0x6f
    system_call_fastpath+0x16/0x1b
    thaw D ffff88013870d890 0 1352 1351 0x00000004
    ...
    Call Trace:
    rwsem_down_failed_common+0xdb/0x10f
    call_rwsem_down_write_failed+0x13/0x20
    down_write+0x25/0x27
    thaw_super+0x13/0x9e
    do_vfs_ioctl+0x1f5/0x490
    vm_mmap_pgoff+0x76/0x87
    sys_ioctl+0x4b/0x6f
    filp_close+0x64/0x6c
    system_call_fastpath+0x16/0x1b

    where the thaw ioctl deadlocked at thaw_super() when called while chcp was
    waiting at nilfs_transaction_begin() called from
    nilfs_ioctl_change_cpmode(). This deadlock is 100% reproducible.

    This is because nilfs_ioctl_change_cpmode() first locks sb->s_umount in
    read mode and then waits for unfreezing in nilfs_transaction_begin(),
    whereas thaw_super() locks sb->s_umount in write mode. The locking of
    sb->s_umount here was intended to make snapshot mounts and the downgrade
    of snapshots to checkpoints exclusive.

    This fixes the deadlock issue by replacing the sb->s_umount usage in
    nilfs_ioctl_change_cpmode() with a dedicated mutex which protects snapshot
    mounts.

    Signed-off-by: Ryusuke Konishi
    Cc: Fernando Luis Vazquez Cao
    Tested-by: Ryusuke Konishi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ryusuke Konishi
     
  • The checkpoint deletion ioctl (rmcp ioctl) has potential for breaking
    snapshot because it is not fully exclusive with checkpoint mode change
    ioctl (chcp ioctl).

    The rmcp ioctl first tests if the specified checkpoint is a snapshot or
    not within nilfs_cpfile_delete_checkpoint function, and then calls
    nilfs_cpfile_delete_checkpoints function to actually invalidate the
    checkpoint only if it's not a snapshot. However, the checkpoint can be
    changed into a snapshot by the chcp ioctl between these two operations.
    In that case, calling nilfs_cpfile_delete_checkpoints() wrongly
    invalidates the snapshot, which leads to snapshot list corruption and
    snapshot count mismatch.

    This fixes the issue by changing nilfs_cpfile_delete_checkpoints() so
    that it reconfirms the target checkpoints are snapshot or not.

    This second check is exclusive with the chcp operation since it is
    protected by an existing semaphore.

    Signed-off-by: Ryusuke Konishi
    Cc: Fernando Luis Vazquez Cao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ryusuke Konishi
     
  • ->delete_inode(), ->write_super_lockfs(), ->unlockfs() are gone so remove
    references to them in the NTFS code. Noticed while cleaning up the
    fsfreeze mess.

    Signed-off-by: Fernando Luis Vazquez Cao
    Signed-off-by: Ryusuke Konishi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fernando Luis Vazquez Cao
     
  • Add omitted comment for ns_mount_state field of the_nilfs structure.

    Signed-off-by: Vyacheslav Dubeyko
    Signed-off-by: Ryusuke Konishi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vyacheslav Dubeyko
     

14 Jul, 2012

3 commits

  • Pass mount flags to sget() so that it can use them in initialising a new
    superblock before the set function is called. They could also be passed to the
    compare function.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     
  • boolean "does it have to be exclusive?" flag is passed instead;
    Local filesystem should just ignore it - the object is guaranteed
    not to be there yet.

    Signed-off-by: Al Viro

    Al Viro
     
  • Just the flags; only NFS cares even about that, but there are
    legitimate uses for such argument. And getting rid of that
    completely would require splitting ->lookup() into a couple
    of methods (at least), so let's leave that alone for now...

    Signed-off-by: Al Viro

    Al Viro
     

21 Jun, 2012

1 commit

  • A gc-inode is a pseudo inode used to buffer the blocks to be moved by
    garbage collection.

    Block caches of gc-inodes must be cleared every time a garbage collection
    function (nilfs_clean_segments) completes. Otherwise, stale blocks
    buffered in the caches may be wrongly reused in successive calls of the GC
    function.

    For user files, this is not a problem because their gc-inodes are
    distinguished by a checkpoint number as well as an inode number. They
    never buffer different blocks if either an inode number, a checkpoint
    number, or a block offset differs.

    However, gc-inodes of sufile, cpfile and DAT file can store different data
    for the same block offset. Thus, the nilfs_clean_segments function can
    move incorrect block for these meta-data files if an old block is cached.
    I found this is really causing meta-data corruption in nilfs.

    This fixes the issue by ensuring cache clear of gc-inodes and resolves
    reported GC problems including checkpoint file corruption, b-tree
    corruption, and the following warning during GC.

    nilfs_palloc_freev: entry number 307234 already freed.
    ...

    Signed-off-by: Ryusuke Konishi
    Tested-by: Ryusuke Konishi
    Cc: [2.6.37+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ryusuke Konishi
     

02 Jun, 2012

1 commit

  • Pull vfs changes from Al Viro.
    "A lot of misc stuff. The obvious groups:
    * Miklos' atomic_open series; kills the damn abuse of
    ->d_revalidate() by NFS, which was the major stumbling block for
    all work in that area.
    * ripping security_file_mmap() and dealing with deadlocks in the
    area; sanitizing the neighborhood of vm_mmap()/vm_munmap() in
    general.
    * ->encode_fh() switched to saner API; insane fake dentry in
    mm/cleancache.c gone.
    * assorted annotations in fs (endianness, __user)
    * parts of Artem's ->s_dirty work (jff2 and reiserfs parts)
    * ->update_time() work from Josef.
    * other bits and pieces all over the place.

    Normally it would've been in two or three pull requests, but
    signal.git stuff had eaten a lot of time during this cycle ;-/"

    Fix up trivial conflicts in Documentation/filesystems/vfs.txt (the
    'truncate_range' inode method was removed by the VM changes, the VFS
    update adds an 'update_time()' method), and in fs/btrfs/ulist.[ch] (due
    to sparse fix added twice, with other changes nearby).

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (95 commits)
    nfs: don't open in ->d_revalidate
    vfs: retry last component if opening stale dentry
    vfs: nameidata_to_filp(): don't throw away file on error
    vfs: nameidata_to_filp(): inline __dentry_open()
    vfs: do_dentry_open(): don't put filp
    vfs: split __dentry_open()
    vfs: do_last() common post lookup
    vfs: do_last(): add audit_inode before open
    vfs: do_last(): only return EISDIR for O_CREAT
    vfs: do_last(): check LOOKUP_DIRECTORY
    vfs: do_last(): make ENOENT exit RCU safe
    vfs: make follow_link check RCU safe
    vfs: do_last(): use inode variable
    vfs: do_last(): inline walk_component()
    vfs: do_last(): make exit RCU safe
    vfs: split do_lookup()
    Btrfs: move over to use ->update_time
    fs: introduce inode operation ->update_time
    reiserfs: get rid of resierfs_sync_super
    reiserfs: mark the superblock as dirty a bit later
    ...

    Linus Torvalds