09 Dec, 2012

1 commit

  • The direct-IO write path already had the i_size checks in mm/filemap.c,
    but it turns out the read path did not, and removing the block size
    checks in fs/block_dev.c (commit bbec0270bdd8: "blkdev_max_block: make
    private to fs/buffer.c") removed the magic "shrink IO to past the end of
    the device" code there.

    Fix it by truncating the IO to the size of the block device, like the
    write path already does.

    NOTE! I suspect the write path would be *much* better off doing it this
    way in fs/block_dev.c, rather than hidden deep in mm/filemap.c. The
    mm/filemap.c code is extremely hard to follow, and has various
    conditionals on the target being a block device (ie the flag passed in
    to 'generic_write_checks()', along with a conditional update of the
    inode timestamp etc).

    It is also quite possible that we should treat this whole block device
    size as a "s_maxbytes" issue, and try to make the logic even more
    generic. However, in the meantime this is the fairly minimal targeted
    fix.

    Noted by Milan Broz thanks to a regression test for the cryptsetup
    reencrypt tool.

    Reported-and-tested-by: Milan Broz
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

06 Dec, 2012

1 commit


05 Dec, 2012

1 commit

  • The block device access simplification that avoided accessing the (racy)
    block size information (commit bbec0270bdd8: "blkdev_max_block: make
    private to fs/buffer.c") no longer checks the maximum block size in the
    block mapping path.

    That was _almost_ as simple as just removing the code entirely, because
    the readers and writers all check the size of the device anyway, so
    under normal circumstances it "just worked".

    However, the block size may be such that the end of the device may
    straddle one single buffer_head. At which point we may still want to
    access the end of the device, but the buffer we use to access it
    partially extends past the end.

    The 'bd_set_size()' function intentionally sets the block size to avoid
    this, but mounting the device - or setting the block size by hand to
    some other value - can modify that block size.

    So instead, teach 'submit_bh()' about the special case of the buffer
    head straddling the end of the device, and turning such an access into a
    smaller IO access, avoiding the problem.

    This, btw, also means that unlike before, we can now access the whole
    device regardless of device block size setting. So now, even if the
    device size is only 512-byte aligned, we can read and write even the
    last sector even when having a much bigger block size for accessing the
    rest of the device.

    So with this, we could now get rid of the 'bd_set_size()' block size
    code entirely - resulting in faster IO for the common case - but that
    would be a separate patch.

    Reported-and-tested-by: Romain Francoise
    Reporeted-and-tested-by: Meelis Roos
    Reported-by: Tony Luck
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

04 Dec, 2012

1 commit

  • Merge 'block-dev' branch.

    I was going to just mark everything here for stable and leave it to the
    3.8 merge window, but having decided on doing another -rc, I migth as
    well merge it now.

    This removes the bd_block_size_semaphore semaphore that was added in
    this release to fix a race condition between block size changes and
    block IO, and replaces it with atomicity guaratees in fs/buffer.c
    instead, along with simplifying fs/block-dev.c.

    This removes more lines than it adds, makes the code generally simpler,
    and avoids the latency/rt issues that the block size semaphore
    introduced for mount.

    I'm not happy with the timing, but it wouldn't be much better doing this
    during the merge window and then having some delayed back-port of it
    into stable.

    * block-dev:
    blkdev_max_block: make private to fs/buffer.c
    direct-io: don't read inode->i_blkbits multiple times
    blockdev: remove bd_block_size_semaphore again
    fs/buffer.c: make block-size be per-page and protected by the page lock

    Linus Torvalds
     

02 Dec, 2012

1 commit


01 Dec, 2012

1 commit


30 Nov, 2012

9 commits


29 Nov, 2012

1 commit


28 Nov, 2012

1 commit

  • Commit eddb079deb4 created a regression in the writepages codepath.
    Previously, whenever it needed to check the size of the file, it did so
    by consulting the inode->i_size field directly. With that patch, the
    i_size was fetched once on entry into the writepages code and that value
    was used henceforth.

    If the file is changing size though (for instance, if someone is writing
    to it or has truncated it), then that value is likely to be wrong. This
    can lead to data corruption. Pages past the EOF at the time that the
    writepages call was issued may be silently dropped and ignored because
    cifs_writepages wrongly assumes that the file must have been truncated
    in the interim.

    Fix cifs_writepages to properly fetch the size from the inode->i_size
    field instead to properly account for this possibility.

    Original bug report is here:

    https://bugzilla.kernel.org/show_bug.cgi?id=50991

    Reported-and-Tested-by: Maxim Britov
    Reviewed-by: Suresh Jayaraman
    Signed-off-by: Jeff Layton
    Signed-off-by: Steve French

    Jeff Layton
     

27 Nov, 2012

4 commits

  • Merge misc fixes from Andrew Morton:
    "8 fixes"

    * emailed patches from Andrew Morton : (8 patches)
    futex: avoid wake_futex() for a PI futex_q
    watchdog: using u64 in get_sample_period()
    writeback: put unused inodes to LRU after writeback completion
    mm: vmscan: check for fatal signals iff the process was throttled
    Revert "mm: remove __GFP_NO_KSWAPD"
    proc: check vma->vm_file before dereferencing
    UAPI: strip the _UAPI prefix from header guards during header installation
    include/linux/bug.h: fix sparse warning related to BUILD_BUG_ON_INVALID

    Linus Torvalds
     
  • Pull ext3 regression fix from Jan Kara:
    "Fix an ext3 regression introduced during 3.7 merge window. It leads
    to deadlock if you stress the filesystem in the right way (luckily
    only if blocksize < pagesize)."

    * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
    jbd: Fix lock ordering bug in journal_unmap_buffer()

    Linus Torvalds
     
  • Commit 169ebd90131b ("writeback: Avoid iput() from flusher thread")
    removed iget-iput pair from inode writeback. As a side effect, inodes
    that are dirty during iput_final() call won't be ever added to inode LRU
    (iput_final() doesn't add dirty inodes to LRU and later when the inode
    is cleaned there's noone to add the inode there). Thus inodes are
    effectively unreclaimable until someone looks them up again.

    The practical effect of this bug is limited by the fact that inodes are
    pinned by a dentry for long enough that the inode gets cleaned. But
    still the bug can have nasty consequences leading up to OOM conditions
    under certain circumstances. Following can easily reproduce the
    problem:

    for (( i = 0; i < 1000; i++ )); do
    mkdir $i
    for (( j = 0; j < 1000; j++ )); do
    touch $i/$j
    echo 2 > /proc/sys/vm/drop_caches
    done
    done

    then one needs to run 'sync; ls -lR' to make inodes reclaimable again.

    We fix the issue by inserting unused clean inodes into the LRU after
    writeback finishes in inode_sync_complete().

    Signed-off-by: Jan Kara
    Reported-by: OGAWA Hirofumi
    Cc: Al Viro
    Cc: OGAWA Hirofumi
    Cc: Wu Fengguang
    Cc: Dave Chinner
    Cc: [3.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Commit 7b540d0646ce ("proc_map_files_readdir(): don't bother with
    grabbing files") switched proc_map_files_readdir() to use @f_mode
    directly instead of grabbing @file reference, but same time the test for
    @vm_file presence was lost leading to nil dereference. The patch brings
    the test back.

    The all proc_map_files feature is CONFIG_CHECKPOINT_RESTORE wrapped
    (which is set to 'n' by default) so the bug doesn't affect regular
    kernels.

    The regression is 3.7-rc1 only as far as I can tell.

    [gorcunov@openvz.org: provided changelog]
    Signed-off-by: Stanislav Kinsbursky
    Acked-by: Cyrill Gorcunov
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stanislav Kinsbursky
     

24 Nov, 2012

1 commit

  • Pull MTD fixes from David Woodhouse:
    "The most important part of this is that it fixes a regression in
    Samsung NAND chip detection, introduced by some rework which went into
    3.7. The initial fix wasn't quite complete, so it's in two parts. In
    fact the first part is committed twice (Artem committed his own copy
    of the same patch) and I've merged Artem's tree into mine which
    already had that fix.

    I'd have recommitted that to make it somewhat cleaner, but figured by
    this point in the release cycle it was better to merge *exactly* the
    commits which have been in linux-next.

    If I'd recommitted, I'd also omit the sparse warning fix. But it's
    there, and it's harmless — just marking one function as 'static' in
    onenand code.

    This also includes a couple more fixes for stable: an AB-BA deadlock
    in JFFS2, and an invalid range check in slram."

    * tag 'for-linus-20121123' of git://git.infradead.org/mtd-2.6:
    mtd: nand: fix Samsung SLC detection regression
    mtd: nand: fix Samsung SLC NAND identification regression
    jffs2: Fix lock acquisition order bug in jffs2_write_begin
    mtd: onenand: Make flexonenand_set_boundary static
    mtd: slram: invalid checking of absolute end address
    mtd: ofpart: Fix incorrect NULL check in parse_ofoldpart_partitions()
    mtd: nand: fix Samsung SLC NAND identification regression

    Linus Torvalds
     

23 Nov, 2012

1 commit

  • Commit 09e05d48 introduced a wait for transaction commit into
    journal_unmap_buffer() in the case we are truncating a buffer undergoing commit
    in the page stradding i_size on a filesystem with blocksize < pagesize. Sadly
    we forgot to drop buffer lock before waiting for transaction commit and thus
    deadlock is possible when kjournald wants to lock the buffer.

    Fix the problem by dropping the buffer lock before waiting for transaction
    commit. Since we are still holding page lock (and that is OK), buffer cannot
    disappear under us.

    CC: stable@vger.kernel.org # Wherever commit 09e05d48 was taken
    Signed-off-by: Jan Kara

    Jan Kara
     

21 Nov, 2012

1 commit

  • Pull reiserfs and ext3 fixes from Jan Kara:
    "Fixes of reiserfs deadlocks when quotas are enabled (locking there was
    completely busted by BKL conversion) and also one small ext3 fix in
    the trim interface."

    * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
    ext3: Avoid underflow of in ext3_trim_fs()
    reiserfs: Move quota calls out of write lock
    reiserfs: Protect reiserfs_quota_write() with write lock
    reiserfs: Protect reiserfs_quota_on() with write lock
    reiserfs: Fix lock ordering during remount

    Linus Torvalds
     

20 Nov, 2012

5 commits


19 Nov, 2012

3 commits

  • If the FAN_Q_OVERFLOW bit set in event->mask, the fanotify event
    metadata will not contain a valid file descriptor, but
    copy_event_to_user() didn't check for that, and unconditionally does a
    fd_install() on the file descriptor.

    Which in turn will cause a BUG_ON() in __fd_install().

    Introduced by commit 352e3b249284 ("fanotify: sanitize failure exits in
    copy_event_to_user()")

    Mea culpa - missed that path ;-/

    Reported-by: Alex Shi
    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     
  • Pull misc VFS fixes from Al Viro:
    "Remove a bogus BUG_ON() that can trigger spuriously + alpha bits of
    do_mount() constification I'd missed during the merge window."

    This pull request came in a week ago, I missed it for some reason.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    kill bogus BUG_ON() in do_close_on_exec()
    missing const in alpha callers of do_mount()

    Linus Torvalds
     
  • Pull xfs bugfixes from Ben Myers:

    - fix attr tree double split corruption

    - fix broken error handling in xfs_vm_writepage

    - drop buffer io reference when a bad bio is built

    * tag 'for-linus-v3.7-rc7' of git://oss.sgi.com/xfs/xfs:
    xfs: drop buffer io reference when a bad bio is built
    xfs: fix broken error handling in xfs_vm_writepage
    xfs: fix attr tree double split corruption

    Linus Torvalds
     

17 Nov, 2012

4 commits

  • Error handling in xfs_buf_ioapply_map() does not handle IO reference
    counts correctly. We increment the b_io_remaining count before
    building the bio, but then fail to decrement it in the failure case.
    This leads to the buffer never running IO completion and releasing
    the reference that the IO holds, so at unmount we can leak the
    buffer. This leak is captured by this assert failure during unmount:

    XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/xfs_mount.c, line: 273

    This is not a new bug - the b_io_remaining accounting has had this
    problem for a long, long time - it's just very hard to get a
    zero length bio being built by this code...

    Further, the buffer IO error can be overwritten on a multi-segment
    buffer by subsequent bio completions for partial sections of the
    buffer. Hence we should only set the buffer error status if the
    buffer is not already carrying an error status. This ensures that a
    partial IO error on a multi-segment buffer will not be lost. This
    part of the problem is a regression, however.

    cc:
    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • When we shut down the filesystem, it might first be detected in
    writeback when we are allocating a inode size transaction. This
    happens after we have moved all the pages into the writeback state
    and unlocked them. Unfortunately, if we fail to set up the
    transaction we then abort writeback and try to invalidate the
    current page. This then triggers are BUG() in block_invalidatepage()
    because we are trying to invalidate an unlocked page.

    Fixing this is a bit of a chicken and egg problem - we can't
    allocate the transaction until we've clustered all the pages into
    the IO and we know the size of it (i.e. whether the last block of
    the IO is beyond the current EOF or not). However, we don't want to
    hold pages locked for long periods of time, especially while we lock
    other pages to cluster them into the write.

    To fix this, we need to make a clear delineation in writeback where
    errors can only be handled by IO completion processing. That is,
    once we have marked a page for writeback and unlocked it, we have to
    report errors via IO completion because we've already started the
    IO. We may not have submitted any IO, but we've changed the page
    state to indicate that it is under IO so we must now use the IO
    completion path to report errors.

    To do this, add an error field to xfs_submit_ioend() to pass it the
    error that occurred during the building on the ioend chain. When
    this is non-zero, mark each ioend with the error and call
    xfs_finish_ioend() directly rather than building bios. This will
    immediately push the ioends through completion processing with the
    error that has occurred.

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • In certain circumstances, a double split of an attribute tree is
    needed to insert or replace an attribute. In rare situations, this
    can go wrong, leaving the attribute tree corrupted. In this case,
    the attr being replaced is the last attr in a leaf node, and the
    replacement is larger so doesn't fit in the same leaf node.
    When we have the initial condition of a node format attribute
    btree with two leaves at index 1 and 2. Call them L1 and L2. The
    leaf L1 is completely full, there is not a single byte of free space
    in it. L2 is mostly empty. The attribute being replaced - call it X
    - is the last attribute in L1.

    The way an attribute replace is executed is that the replacement
    attribute - call it Y - is first inserted into the tree, but has an
    INCOMPLETE flag set on it so that list traversals ignore it. Once
    this transaction is committed, a second transaction it run to
    atomically mark Y as COMPLETE and X as INCOMPLETE, so that a
    traversal will now find Y and skip X. Once that transaction is
    committed, attribute X is then removed.

    So, the initial condition is:

    +--------+ +--------+
    | L1 | | L2 |
    | fwd: 2 |---->| fwd: 0 |
    | bwd: 0 || fwd: 3 |---->| fwd: 2 |---->| fwd: 0 |
    | bwd: 0 || fwd: 3 |---->| fwd: 2 |---->| fwd: 0 |
    | bwd: 0 |
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • This is mostly a revert of 01dc52ebdf47 ("oom: remove deprecated oom_adj")
    from Davidlohr Bueso.

    It reintroduces /proc/pid/oom_adj for backwards compatibility with earlier
    kernels. It simply scales the value linearly when /proc/pid/oom_score_adj
    is written.

    The major difference is that its scheduled removal is no longer included
    in Documentation/feature-removal-schedule.txt. We do warn users with a
    single printk, though, to suggest the more powerful and supported
    /proc/pid/oom_score_adj interface.

    Reported-by: Artem S. Tashkinov
    Signed-off-by: David Rientjes
    Signed-off-by: Linus Torvalds

    David Rientjes
     

16 Nov, 2012

1 commit

  • Pull UBIFS fixes from Artem Bityutskiy:
    "Two patches which fix a problem reported by several people in the
    past, but only fixed now because no one gave enough material for
    debugging.

    Anyway, these fix the problem that sometimes after a power cut the
    file-system is not mountable with the following symptom:

    grab_empty_leb: could not find an empty LEB

    The fixes make the file-system mountable again."

    * tag 'upstream-3.7-rc6' of git://git.infradead.org/linux-ubifs:
    UBIFS: fix mounting problems after power cuts
    UBIFS: introduce categorized lprops counter

    Linus Torvalds
     

15 Nov, 2012

1 commit

  • Passing a NULL id causes a NULL pointer deference in writers such as
    erst_writer and efi_pstore_write because they expect to update this id.
    Pass a dummy id instead.

    This avoids a cascade of oopses caused when the initial
    pstore_console_write passes a null which in turn causes writes to the
    console causing further oopses in subsequent pstore_console_write calls.

    Signed-off-by: Colin Ian King
    Acked-by: Kees Cook
    Cc: stable@vger.kernel.org
    Signed-off-by: Anton Vorontsov

    Colin Ian King
     

12 Nov, 2012

1 commit

  • It can be legitimately triggered via procfs access. Now, at least
    2 of 3 of get_files_struct() callers in procfs are useless, but
    when and if we get rid of those we can always add WARN_ON() here.
    BUG_ON() at that spot is simply wrong.

    Signed-off-by: Al Viro

    Al Viro
     

10 Nov, 2012

1 commit