11 Apr, 2014

1 commit


04 Apr, 2014

1 commit

  • Reclaim will be leaving shadow entries in the page cache radix tree upon
    evicting the real page. As those pages are found from the LRU, an
    iput() can lead to the inode being freed concurrently. At this point,
    reclaim must no longer install shadow pages because the inode freeing
    code needs to ensure the page tree is really empty.

    Add an address_space flag, AS_EXITING, that the inode freeing code sets
    under the tree lock before doing the final truncate. Reclaim will check
    for this flag before installing shadow pages.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

03 Apr, 2014

2 commits

  • Mark functions as static in exofs/ore_raid.c because they are not used
    outside this file.

    This also eliminates the following warning in exofs/ore_raid.c:
    fs/exofs/ore_raid.c:24:14: warning: no previous prototype for _raid_page_alloc [-Wmissing-prototypes]
    fs/exofs/ore_raid.c:29:6: warning: no previous prototype for _raid_page_free [-Wmissing-prototypes]

    Signed-off-by: Rashika Kheria
    Reviewed-by: Josh Triplett
    Signed-off-by: Boaz Harrosh

    Rashika Kheria
     
  • Mark function as static in exofs/super.c because it is not used outside
    this file.

    This also eliminates the following warning in exofs/super.c:
    fs/exofs/super.c:546:5: warning: no previous prototype \
    for __alloc_dev_table[-Wmissing-prototypes]

    Signed-off-by: Rashika Kheria
    Reviewed-by: Josh Triplett
    Signed-off-by: Boaz Harrosh

    Rashika Kheria
     

24 Jan, 2014

4 commits

  • In debug mode exofs is too verbose. Hiding the real problems
    remove some trivial stuff.

    Also fix some other prints.

    Signed-off-by: Boaz Harrosh

    Boaz Harrosh
     
  • If there was an error in fetching an object or extracting
    inode info from attributes. Which means corrupted storage.
    Let it be an empty ZERO dated directory entry so it can be
    deleted. Otherwise the all directory will be inaccessible.

    This does not loose data, because if there is an orphan object
    somewhere it will be recovered by fschk. But usually this only
    means corrupted dir entry. The object was never generated and
    only its link exist. This way we can delete the bad entry.

    Signed-off-by: Boaz Harrosh

    Boaz Harrosh
     
  • With this minimal do nothing patch an application can open O_DIRECT
    and then actually do buffered sync IO instead. But the aio API is
    supported which is a good thing

    Signed-off-by: Boaz Harrosh

    Boaz Harrosh
     
  • In the case of target returning OSD_ERR_PRI_CLEAR_PAGES when we
    only sent for attributes don't crash on NULL bio.
    This is an osd-target bug but don't crash regardless

    Signed-off-by: Boaz Harrosh

    Boaz Harrosh
     

23 Jan, 2014

1 commit

  • At IO preparation we calculate the max pages at each device and
    allocate a BIO per device of that size. The calculation was wrong
    on some unaligned corner cases offset/length combination and would
    make prepare return with -ENOMEM. This would be bad for pnfs-objects
    that would in that case IO through MDS. And fatal for exofs were it
    would fail writes with EIO.

    Fix it by doing the proper math, that will work in all cases. (I
    ran a test with all possible offset/length combinations this time
    round).

    Also when reading we do not need to allocate for the parity units
    since we jump over them.

    Also lower the max_io_length to take into account the parity pages
    so not to allocate BIOs bigger than PAGE_SIZE

    CC: Stable Kernel
    Signed-off-by: Boaz Harrosh

    Boaz Harrosh
     

13 Sep, 2013

1 commit


03 Jul, 2013

1 commit

  • Pull ext4 update from Ted Ts'o:
    "Lots of bug fixes, cleanups and optimizations. In the bug fixes
    category, of note is a fix for on-line resizing file systems where the
    block size is smaller than the page size (i.e., file systems 1k blocks
    on x86, or more interestingly file systems with 4k blocks on Power or
    ia64 systems.)

    In the cleanup category, the ext4's punch hole implementation was
    significantly improved by Lukas Czerner, and now supports bigalloc
    file systems. In addition, Jan Kara significantly cleaned up the
    write submission code path. We also improved error checking and added
    a few sanity checks.

    In the optimizations category, two major optimizations deserve
    mention. The first is that ext4_writepages() is now used for
    nodelalloc and ext3 compatibility mode. This allows writes to be
    submitted much more efficiently as a single bio request, instead of
    being sent as individual 4k writes into the block layer (which then
    relied on the elevator code to coalesce the requests in the block
    queue). Secondly, the extent cache shrink mechanism, which was
    introduce in 3.9, no longer has a scalability bottleneck caused by the
    i_es_lru spinlock. Other optimizations include some changes to reduce
    CPU usage and to avoid issuing empty commits unnecessarily."

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits)
    ext4: optimize starting extent in ext4_ext_rm_leaf()
    jbd2: invalidate handle if jbd2_journal_restart() fails
    ext4: translate flag bits to strings in tracepoints
    ext4: fix up error handling for mpage_map_and_submit_extent()
    jbd2: fix theoretical race in jbd2__journal_restart
    ext4: only zero partial blocks in ext4_zero_partial_blocks()
    ext4: check error return from ext4_write_inline_data_end()
    ext4: delete unnecessary C statements
    ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree()
    jbd2: move superblock checksum calculation to jbd2_write_superblock()
    ext4: pass inode pointer instead of file pointer to punch hole
    ext4: improve free space calculation for inline_data
    ext4: reduce object size when !CONFIG_PRINTK
    ext4: improve extent cache shrink mechanism to avoid to burn CPU time
    ext4: implement error handling of ext4_mb_new_preallocation()
    ext4: fix corruption when online resizing a fs with 1K block size
    ext4: delete unused variables
    ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extents
    jbd2: remove debug dependency on debug_fs and update Kconfig help text
    jbd2: use a single printk for jbd_debug()
    ...

    Linus Torvalds
     

29 Jun, 2013

1 commit


22 May, 2013

1 commit

  • Currently there is no way to truncate partial page where the end
    truncate point is not at the end of the page. This is because it was not
    needed and the functionality was enough for file system truncate
    operation to work properly. However more file systems now support punch
    hole feature and it can benefit from mm supporting truncating page just
    up to the certain point.

    Specifically, with this functionality truncate_inode_pages_range() can
    be changed so it supports truncating partial page at the end of the
    range (currently it will BUG_ON() if 'end' is not at the end of the
    page).

    This commit changes the invalidatepage() address space operation
    prototype to accept range to be invalidated and update all the instances
    for it.

    We also change the block_invalidatepage() in the same way and actually
    make a use of the new length argument implementing range invalidation.

    Actual file system implementations will follow except the file systems
    where the changes are really simple and should not change the behaviour
    in any way .Implementation for truncate_page_range() which will be able
    to accept page unaligned ranges will follow as well.

    Signed-off-by: Lukas Czerner
    Cc: Andrew Morton
    Cc: Hugh Dickins

    Lukas Czerner
     

24 Mar, 2013

1 commit

  • __bio_for_each_segment() iterates bvecs from the specified index
    instead of bio->bv_idx. Currently, the only usage is to walk all the
    bvecs after the bio has been advanced by specifying 0 index.

    For immutable bvecs, we need to split these apart;
    bio_for_each_segment() is going to have a different implementation.
    This will also help document the intent of code that's using it -
    bio_for_each_segment_all() is only legal to use for code that owns the
    bio.

    Signed-off-by: Kent Overstreet
    CC: Jens Axboe
    CC: Neil Brown
    CC: Boaz Harrosh

    Kent Overstreet
     

04 Mar, 2013

1 commit

  • Modify the request_module to prefix the file system type with "fs-"
    and add aliases to all of the filesystems that can be built as modules
    to match.

    A common practice is to build all of the kernel code and leave code
    that is not commonly needed as modules, with the result that many
    users are exposed to any bug anywhere in the kernel.

    Looking for filesystems with a fs- prefix limits the pool of possible
    modules that can be loaded by mount to just filesystems trivially
    making things safer with no real cost.

    Using aliases means user space can control the policy of which
    filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
    with blacklist and alias directives. Allowing simple, safe,
    well understood work-arounds to known problematic software.

    This also addresses a rare but unfortunate problem where the filesystem
    name is not the same as it's module name and module auto-loading
    would not work. While writing this patch I saw a handful of such
    cases. The most significant being autofs that lives in the module
    autofs4.

    This is relevant to user namespaces because we can reach the request
    module in get_fs_type() without having any special permissions, and
    people get uncomfortable when a user specified string (in this case
    the filesystem type) goes all of the way to request_module.

    After having looked at this issue I don't think there is any
    particular reason to perform any filtering or permission checks beyond
    making it clear in the module request that we want a filesystem
    module. The common pattern in the kernel is to call request_module()
    without regards to the users permissions. In general all a filesystem
    module does once loaded is call register_filesystem() and go to sleep.
    Which means there is not much attack surface exposed by loading a
    filesytem module unless the filesystem is mounted. In a user
    namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
    which most filesystems do not set today.

    Acked-by: Serge Hallyn
    Acked-by: Kees Cook
    Reported-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

23 Feb, 2013

1 commit


14 Dec, 2012

1 commit

  • Same bug as fixed by Idan for write_exec was in read_exec.
    Fix the io_state leak and pages state on read error.

    Also while at it:
    The if (!pcol->read_4_write) at the error path is redundant
    because all goto err; are after the if (pcol->read_4_write)
    bale out.

    Signed-off-by: Boaz Harrosh

    Boaz Harrosh
     

12 Dec, 2012

1 commit

  • if ore_write() fails, we would unlock the pages of pcol, which is now
    empty, rather than pcol_copy which owns the pages when ore_write() is
    called. this means that no pages will actually be unlocked
    (pcol.nr_pages == 0) and the writing process (more accurately, the
    syncing process) will hang waiting for a writeback notification that
    never comes.

    moreover, if ore_write() fails, pcol_free() is called for pcol, whereas
    pcol_copy is the object owning the ore_io_state, thus leaking the
    ore_io_state.

    [Boaz]
    I have simplified Idan's original patch a bit, everything else still
    holds

    Signed-off-by: Idan Kedar
    Signed-off-by: Boaz Harrosh

    Idan Kedar
     

12 Oct, 2012

1 commit

  • Pull pile 2 of vfs updates from Al Viro:
    "Stuff in this one - assorted fixes, lglock tidy-up, death to
    lock_super().

    There'll be a VFS pile tomorrow (with patches from Jeff Layton,
    sanitizing getname() and related parts of audit and preparing for
    ESTALE fixes), but I'd rather push the stuff in this one ASAP - some
    of the bugs closed here are quite unpleasant."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    vfs: bogus warnings in fs/namei.c
    consitify do_mount() arguments
    lglock: add DEFINE_STATIC_LGLOCK()
    lglock: make the per_cpu locks static
    lglock: remove unused DEFINE_LGLOCK_LOCKDEP()
    MAX_LFS_FILESIZE definition for 64bit needs LL...
    tmpfs,ceph,gfs2,isofs,reiserfs,xfs: fix fh_len checking
    vfs: drop lock/unlock super
    ufs: drop lock/unlock super
    sysv: drop lock/unlock super
    hpfs: drop lock/unlock super
    fat: drop lock/unlock super
    ext3: drop lock/unlock super
    exofs: drop lock/unlock super
    dup3: Return an error when oldfd == newfd.
    fs: handle failed audit_log_start properly
    fs: prevent use after free in auditing when symlink following was denied

    Linus Torvalds
     

11 Oct, 2012

1 commit

  • Pull block IO update from Jens Axboe:
    "Core block IO bits for 3.7. Not a huge round this time, it contains:

    - First series from Kent cleaning up and generalizing bio allocation
    and freeing.

    - WRITE_SAME support from Martin.

    - Mikulas patches to prevent O_DIRECT crashes when someone changes
    the block size of a device.

    - Make bio_split() work on data-less bio's (like trim/discards).

    - A few other minor fixups."

    Fixed up silent semantic mis-merge as per Mikulas Patocka and Andrew
    Morton. It is due to the VM no longer using a prio-tree (see commit
    6b2dbba8b6ac: "mm: replace vma prio_tree with an interval tree").

    So make set_blocksize() use mapping_mapped() instead of open-coding the
    internal VM knowledge that has changed.

    * 'for-3.7/core' of git://git.kernel.dk/linux-block: (26 commits)
    block: makes bio_split support bio without data
    scatterlist: refactor the sg_nents
    scatterlist: add sg_nents
    fs: fix include/percpu-rwsem.h export error
    percpu-rw-semaphore: fix documentation typos
    fs/block_dev.c:1644:5: sparse: symbol 'blkdev_mmap' was not declared
    blockdev: turn a rw semaphore into a percpu rw semaphore
    Fix a crash when block device is read and block size is changed at the same time
    block: fix request_queue->flags initialization
    block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue()
    block: ioctl to zero block ranges
    block: Make blkdev_issue_zeroout use WRITE SAME
    block: Implement support for WRITE SAME
    block: Consolidate command flag and queue limit checks for merges
    block: Clean up special command handling logic
    block/blk-tag.c: Remove useless kfree
    block: remove the duplicated setting for congestion_threshold
    block: reject invalid queue attribute values
    block: Add bio_clone_bioset(), bio_clone_kmalloc()
    block: Consolidate bio_alloc_bioset(), bio_kmalloc()
    ...

    Linus Torvalds
     

10 Oct, 2012

1 commit


09 Oct, 2012

1 commit


04 Oct, 2012

1 commit


03 Oct, 2012

3 commits

  • Pull vfs update from Al Viro:

    - big one - consolidation of descriptor-related logics; almost all of
    that is moved to fs/file.c

    (BTW, I'm seriously tempted to rename the result to fd.c. As it is,
    we have a situation when file_table.c is about handling of struct
    file and file.c is about handling of descriptor tables; the reasons
    are historical - file_table.c used to be about a static array of
    struct file we used to have way back).

    A lot of stray ends got cleaned up and converted to saner primitives,
    disgusting mess in android/binder.c is still disgusting, but at least
    doesn't poke so much in descriptor table guts anymore. A bunch of
    relatively minor races got fixed in process, plus an ext4 struct file
    leak.

    - related thing - fget_light() partially unuglified; see fdget() in
    there (and yes, it generates the code as good as we used to have).

    - also related - bits of Cyrill's procfs stuff that got entangled into
    that work; _not_ all of it, just the initial move to fs/proc/fd.c and
    switch of fdinfo to seq_file.

    - Alex's fs/coredump.c spiltoff - the same story, had been easier to
    take that commit than mess with conflicts. The rest is a separate
    pile, this was just a mechanical code movement.

    - a few misc patches all over the place. Not all for this cycle,
    there'll be more (and quite a few currently sit in akpm's tree)."

    Fix up trivial conflicts in the android binder driver, and some fairly
    simple conflicts due to two different changes to the sock_alloc_file()
    interface ("take descriptor handling from sock_alloc_file() to callers"
    vs "net: Providing protocol type via system.sockprotoname xattr of
    /proc/PID/fd entries" adding a dentry name to the socket)

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
    MAX_LFS_FILESIZE should be a loff_t
    compat: fs: Generic compat_sys_sendfile implementation
    fs: push rcu_barrier() from deactivate_locked_super() to filesystems
    btrfs: reada_extent doesn't need kref for refcount
    coredump: move core dump functionality into its own file
    coredump: prevent double-free on an error path in core dumper
    usb/gadget: fix misannotations
    fcntl: fix misannotations
    ceph: don't abuse d_delete() on failure exits
    hypfs: ->d_parent is never NULL or negative
    vfs: delete surplus inode NULL check
    switch simple cases of fget_light to fdget
    new helpers: fdget()/fdput()
    switch o2hb_region_dev_write() to fget_light()
    proc_map_files_readdir(): don't bother with grabbing files
    make get_file() return its argument
    vhost_set_vring(): turn pollstart/pollstop into bool
    switch prctl_set_mm_exe_file() to fget_light()
    switch xfs_find_handle() to fget_light()
    switch xfs_swapext() to fget_light()
    ...

    Linus Torvalds
     
  • There's no reason to call rcu_barrier() on every
    deactivate_locked_super(). We only need to make sure that all delayed rcu
    free inodes are flushed before we destroy related cache.

    Removing rcu_barrier() from deactivate_locked_super() affects some fast
    paths. E.g. on my machine exit_group() of a last process in IPC
    namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time.

    Signed-off-by: Kirill A. Shutemov
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Kirill A. Shutemov
     
  • Pull user namespace changes from Eric Biederman:
    "This is a mostly modest set of changes to enable basic user namespace
    support. This allows the code to code to compile with user namespaces
    enabled and removes the assumption there is only the initial user
    namespace. Everything is converted except for the most complex of the
    filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
    nfs, ocfs2 and xfs as those patches need a bit more review.

    The strategy is to push kuid_t and kgid_t values are far down into
    subsystems and filesystems as reasonable. Leaving the make_kuid and
    from_kuid operations to happen at the edge of userspace, as the values
    come off the disk, and as the values come in from the network.
    Letting compile type incompatible compile errors (present when user
    namespaces are enabled) guide me to find the issues.

    The most tricky areas have been the places where we had an implicit
    union of uid and gid values and were storing them in an unsigned int.
    Those places were converted into explicit unions. I made certain to
    handle those places with simple trivial patches.

    Out of that work I discovered we have generic interfaces for storing
    quota by projid. I had never heard of the project identifiers before.
    Adding full user namespace support for project identifiers accounts
    for most of the code size growth in my git tree.

    Ultimately there will be work to relax privlige checks from
    "capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
    root in a user names to do those things that today we only forbid to
    non-root users because it will confuse suid root applications.

    While I was pushing kuid_t and kgid_t changes deep into the audit code
    I made a few other cleanups. I capitalized on the fact we process
    netlink messages in the context of the message sender. I removed
    usage of NETLINK_CRED, and started directly using current->tty.

    Some of these patches have also made it into maintainer trees, with no
    problems from identical code from different trees showing up in
    linux-next.

    After reading through all of this code I feel like I might be able to
    win a game of kernel trivial pursuit."

    Fix up some fairly trivial conflicts in netfilter uid/git logging code.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
    userns: Convert the ufs filesystem to use kuid/kgid where appropriate
    userns: Convert the udf filesystem to use kuid/kgid where appropriate
    userns: Convert ubifs to use kuid/kgid
    userns: Convert squashfs to use kuid/kgid where appropriate
    userns: Convert reiserfs to use kuid and kgid where appropriate
    userns: Convert jfs to use kuid/kgid where appropriate
    userns: Convert jffs2 to use kuid and kgid where appropriate
    userns: Convert hpfs to use kuid and kgid where appropriate
    userns: Convert btrfs to use kuid/kgid where appropriate
    userns: Convert bfs to use kuid/kgid where appropriate
    userns: Convert affs to use kuid/kgid wherwe appropriate
    userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
    userns: On ia64 deal with current_uid and current_gid being kuid and kgid
    userns: On ppc convert current_uid from a kuid before printing.
    userns: Convert s390 getting uid and gid system calls to use kuid and kgid
    userns: Convert s390 hypfs to use kuid and kgid where appropriate
    userns: Convert binder ipc to use kuids
    userns: Teach security_path_chown to take kuids and kgids
    userns: Add user namespace support to IMA
    userns: Convert EVM to deal with kuids and kgids in it's hmac computation
    ...

    Linus Torvalds
     

21 Sep, 2012

1 commit


09 Sep, 2012

1 commit

  • Previously, there was bio_clone() but it only allocated from the fs bio
    set; as a result various users were open coding it and using
    __bio_clone().

    This changes bio_clone() to become bio_clone_bioset(), and then we add
    bio_clone() and bio_clone_kmalloc() as wrappers around it, making use of
    the functionality the last patch adedd.

    This will also help in a later patch changing how bio cloning works.

    Signed-off-by: Kent Overstreet
    CC: Jens Axboe
    CC: NeilBrown
    CC: Alasdair Kergon
    CC: Boaz Harrosh
    CC: Jeff Garzik
    Acked-by: Jeff Garzik
    Signed-off-by: Jens Axboe

    Kent Overstreet
     

13 Aug, 2012

1 commit


04 Aug, 2012

1 commit

  • Pull exofs update from Boaz Harrosh:
    "They are all mostly fixes, except the most important patch by Artem
    Bityutskiy which removes the use of s_dirt. After this patch s_dirt
    can be completely removed from the tree."

    * 'for-linus' of git://git.open-osd.org/linux-open-osd:
    ore: Fix out-of-bounds access in _ios_obj()
    exofs: Use proper max_IO calculations from ore
    exofs: Fix __r4w_get_page when offset is beyond i_size
    exofs: stop using s_dirt
    exofs: readpage_strip: Add a BUG_ON to check for PageLocked(page)

    Linus Torvalds
     

02 Aug, 2012

5 commits

  • _ios_obj() is accessed by group_index not device_table index.

    The oc->comps array is only a group_full of devices at a time
    it is not like ore_comp_dev() which is indexed by a global
    device_table index.

    This did not BUG until now because exofs only uses a single
    COMP for all devices. But with other FSs like PanFS this is
    not true.

    This bug was only in the write_path, all other users were
    using it correctly

    [This is a bug since 3.2 Kernel]
    CC: Stable Tree

    Signed-off-by: Boaz Harrosh

    Boaz Harrosh
     
  • exofs_max_io_pages should just use the ORE's
    calculated layout->max_io_length,

    And avoid unnecessary BUGs, calculations made here were
    also a layering violation.

    Signed-off-by: Boaz Harrosh

    Boaz Harrosh
     
  • It is very common for the end of the file to be unaligned on
    stripe size. But since we know it's beyond file's end then
    the XOR should be preformed with all zeros.

    Old code used to just read zeros out of the OSD devices, which is a great
    waist. But what scares me more about this situation is that, we now have
    pages attached to the file's mapping that are beyond i_size. I don't
    like the kind of bugs this calls for.

    Fix both birds, by returning a global ZERO_PAGE, if offset is beyond
    i_size.

    Signed-off-by: Boaz Harrosh

    Boaz Harrosh
     
  • Exofs has the '->write_super()' handler and makes some use of the '->s_dirt'
    superblock flag, but it really needs neither of them because it never sets
    's_dirt' to one which means the VFS never calls its '->write_super()' handler.
    Thus, remove both.

    Note, I am trying to remove both 's_dirt' and 'write_super()' from VFS
    altogether once all users are gone.

    Signed-off-by: Artem Bityutskiy
    Signed-off-by: Boaz Harrosh

    Artem Bityutskiy
     
  • readpage_strip can be called from several code paths all of which
    require that the page be locked before any operations are carried
    out.

    Since we export the exofs_readpage callback to the VFS, add a
    BUG_ON to check for PageLocked(page) to make sure that this
    understanding is never compromised.

    Signed-off-by: Kautuk Consul
    Signed-off-by: Boaz Harrosh

    Kautuk Consul
     

24 Jul, 2012

1 commit

  • Pull the big VFS changes from Al Viro:
    "This one is *big* and changes quite a few things around VFS. What's in there:

    - the first of two really major architecture changes - death to open
    intents.

    The former is finally there; it was very long in making, but with
    Miklos getting through really hard and messy final push in
    fs/namei.c, we finally have it. Unlike his variant, this one
    doesn't introduce struct opendata; what we have instead is
    ->atomic_open() taking preallocated struct file * and passing
    everything via its fields.

    Instead of returning struct file *, it returns -E... on error, 0
    on success and 1 in "deal with it yourself" case (e.g. symlink
    found on server, etc.).

    See comments before fs/namei.c:atomic_open(). That made a lot of
    goodies finally possible and quite a few are in that pile:
    ->lookup(), ->d_revalidate() and ->create() do not get struct
    nameidata * anymore; ->lookup() and ->d_revalidate() get lookup
    flags instead, ->create() gets "do we want it exclusive" flag.

    With the introduction of new helper (kern_path_locked()) we are rid
    of all struct nameidata instances outside of fs/namei.c; it's still
    visible in namei.h, but not for long. Come the next cycle,
    declaration will move either to fs/internal.h or to fs/namei.c
    itself. [me, miklos, hch]

    - The second major change: behaviour of final fput(). Now we have
    __fput() done without any locks held by caller *and* not from deep
    in call stack.

    That obviously lifts a lot of constraints on the locking in there.
    Moreover, it's legal now to call fput() from atomic contexts (which
    has immediately simplified life for aio.c). We also don't need
    anti-recursion logics in __scm_destroy() anymore.

    There is a price, though - the damn thing has become partially
    asynchronous. For fput() from normal process we are guaranteed
    that pending __fput() will be done before the caller returns to
    userland, exits or gets stopped for ptrace.

    For kernel threads and atomic contexts it's done via
    schedule_work(), so theoretically we might need a way to make sure
    it's finished; so far only one such place had been found, but there
    might be more.

    There's flush_delayed_fput() (do all pending __fput()) and there's
    __fput_sync() (fput() analog doing __fput() immediately). I hope
    we won't need them often; see warnings in fs/file_table.c for
    details. [me, based on task_work series from Oleg merged last
    cycle]

    - sync series from Jan

    - large part of "death to sync_supers()" work from Artem; the only
    bits missing here are exofs and ext4 ones. As far as I understand,
    those are going via the exofs and ext4 trees resp.; once they are
    in, we can put ->write_super() to the rest, along with the thread
    calling it.

    - preparatory bits from unionmount series (from dhowells).

    - assorted cleanups and fixes all over the place, as usual.

    This is not the last pile for this cycle; there's at least jlayton's
    ESTALE work and fsfreeze series (the latter - in dire need of fixes,
    so I'm not sure it'll make the cut this cycle). I'll probably throw
    symlink/hardlink restrictions stuff from Kees into the next pile, too.
    Plus there's a lot of misc patches I hadn't thrown into that one -
    it's large enough as it is..."

    * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (127 commits)
    ext4: switch EXT4_IOC_RESIZE_FS to mnt_want_write_file()
    btrfs: switch btrfs_ioctl_balance() to mnt_want_write_file()
    switch dentry_open() to struct path, make it grab references itself
    spufs: shift dget/mntget towards dentry_open()
    zoran: don't bother with struct file * in zoran_map
    ecryptfs: don't reinvent the wheels, please - use struct completion
    don't expose I_NEW inodes via dentry->d_inode
    tidy up namei.c a bit
    unobfuscate follow_up() a bit
    ext3: pass custom EOF to generic_file_llseek_size()
    ext4: use core vfs llseek code for dir seeks
    vfs: allow custom EOF in generic_file_llseek code
    vfs: Avoid unnecessary WB_SYNC_NONE writeback during sys_sync and reorder sync passes
    vfs: Remove unnecessary flushing of block devices
    vfs: Make sys_sync writeout also block device inodes
    vfs: Create function for iterating over block devices
    vfs: Reorder operations during sys_sync
    quota: Move quota syncing to ->sync_fs method
    quota: Split dquot_quota_sync() to writeback and cache flushing part
    vfs: Move noop_backing_dev_info check from sync into writeback
    ...

    Linus Torvalds
     

20 Jul, 2012

3 commits

  • The read-4-write pages are locked in address ascending order.
    But where unlocked in a way easiest for coding. Fix that,
    locks should be released in opposite order of locking, .i.e
    descending address order.

    I have not hit this dead-lock. It was found by inspecting the
    dbug print-outs. I suspect there is an higher lock at caller that
    protects us, but fix it regardless.

    Signed-off-by: Boaz Harrosh

    Boaz Harrosh
     
  • Do to OOM situations the ore might fail to allocate all resources
    needed for IO of the full request. If some progress was possible
    it would proceed with a partial/short request, for the sake of
    forward progress.

    Since this crashes NFS-core and exofs is just fine without it just
    remove this contraption, and fail.

    TODO:
    Support real forward progress with some reserved allocations
    of resources, such as mem pools and/or bio_sets

    [Bug since 3.2 Kernel]
    CC: Stable Tree
    CC: Benny Halevy
    Signed-off-by: Boaz Harrosh

    Boaz Harrosh
     
  • In RAID_5/6 We used to not permit an IO that it's end
    byte is not stripe_size aligned and spans more than one stripe.
    .i.e the caller must check if after submission the actual
    transferred bytes is shorter, and would need to resubmit
    a new IO with the remainder.

    Exofs supports this, and NFS was supposed to support this
    as well with it's short write mechanism. But late testing has
    exposed a CRASH when this is used with none-RPC layout-drivers.

    The change at NFS is deep and risky, in it's place the fix
    at ORE to lift the limitation is actually clean and simple.
    So here it is below.

    The principal here is that in the case of unaligned IO on
    both ends, beginning and end, we will send two read requests
    one like old code, before the calculation of the first stripe,
    and also a new site, before the calculation of the last stripe.
    If any "boundary" is aligned or the complete IO is within a single
    stripe. we do a single read like before.

    The code is clean and simple by splitting the old _read_4_write
    into 3 even parts:
    1._read_4_write_first_stripe
    2. _read_4_write_last_stripe
    3. _read_4_write_execute

    And calling 1+3 at the same place as before. 2+3 before last
    stripe, and in the case of all in a single stripe then 1+2+3
    is preformed additively.

    Why did I not think of it before. Well I had a strike of
    genius because I have stared at this code for 2 years, and did
    not find this simple solution, til today. Not that I did not try.

    This solution is much better for NFS than the previous supposedly
    solution because the short write was dealt with out-of-band after
    IO_done, which would cause for a seeky IO pattern where as in here
    we execute in order. At both solutions we do 2 separate reads, only
    here we do it within a single IO request. (And actually combine two
    writes into a single submission)

    NFS/exofs code need not change since the ORE API communicates the new
    shorter length on return, what will happen is that this case would not
    occur anymore.

    hurray!!

    [Stable this is an NFS bug since 3.2 Kernel should apply cleanly]
    CC: Stable Tree
    Signed-off-by: Boaz Harrosh

    Boaz Harrosh
     

14 Jul, 2012

1 commit