13 Mar, 2012

4 commits

  • commit 5bccda0ebc7c0331b81ac47d39e4b920b198b2cd upstream.

    The cifs code will attempt to open files on lookup under certain
    circumstances. What happens though if we find that the file we opened
    was actually a FIFO or other special file?

    Currently, the open filehandle just ends up being leaked leading to
    a dentry refcount mismatch and oops on umount. Fix this by having the
    code close the filehandle on the server if it turns out not to be a
    regular file. While we're at it, change this spaghetti if statement
    into a switch too.

    Reported-by: CAI Qian
    Tested-by: CAI Qian
    Reviewed-by: Shirish Pargaonkar
    Signed-off-by: Jeff Layton
    Signed-off-by: Steve French
    Signed-off-by: Greg Kroah-Hartman

    Jeff Layton
     
  • commit 880641bb9da2473e9ecf6c708d993b29928c1b3c upstream.

    Bart Van Assche reported a hung fio process when either hot-removing
    storage or when interrupting the fio process itself. The (pruned) call
    trace for the latter looks like so:

    fio D 0000000000000001 0 6849 6848 0x00000004
    ffff880092541b88 0000000000000046 ffff880000000000 ffff88012fa11dc0
    ffff88012404be70 ffff880092541fd8 ffff880092541fd8 ffff880092541fd8
    ffff880128b894d0 ffff88012404be70 ffff880092541b88 000000018106f24d
    Call Trace:
    schedule+0x3f/0x60
    io_schedule+0x8f/0xd0
    wait_for_all_aios+0xc0/0x100
    exit_aio+0x55/0xc0
    mmput+0x2d/0x110
    exit_mm+0x10d/0x130
    do_exit+0x671/0x860
    do_group_exit+0x44/0xb0
    get_signal_to_deliver+0x218/0x5a0
    do_signal+0x65/0x700
    do_notify_resume+0x65/0x80
    int_signal+0x12/0x17

    The problem lies with the allocation batching code. It will
    opportunistically allocate kiocbs, and then trim back the list of iocbs
    when there is not enough room in the completion ring to hold all of the
    events.

    In the case above, what happens is that the pruning back of events ends
    up freeing up the last active request and the context is marked as dead,
    so it is thus responsible for waking up waiters. Unfortunately, the
    code does not check for this condition, so we end up with a hung task.

    Signed-off-by: Jeff Moyer
    Reported-by: Bart Van Assche
    Tested-by: Bart Van Assche
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jeff Moyer
     
  • commit c8e252586f8d5de906385d8cf6385fee289a825e upstream.

    The regset common infrastructure assumed that regsets would always
    have .get and .set methods, but not necessarily .active methods.
    Unfortunately people have since written regsets without .set methods.

    Rather than putting in stub functions everywhere, handle regsets with
    null .get or .set methods explicitly.

    Signed-off-by: H. Peter Anvin
    Reviewed-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    H. Peter Anvin
     
  • commit a32744d4abae24572eff7269bc17895c41bd0085 upstream.

    When the autofs protocol version 5 packet type was added in commit
    5c0a32fc2cd0 ("autofs4: add new packet type for v5 communications"), it
    obvously tried quite hard to be word-size agnostic, and uses explicitly
    sized fields that are all correctly aligned.

    However, with the final "char name[NAME_MAX+1]" array at the end, the
    actual size of the structure ends up being not very well defined:
    because the struct isn't marked 'packed', doing a "sizeof()" on it will
    align the size of the struct up to the biggest alignment of the members
    it has.

    And despite all the members being the same, the alignment of them is
    different: a "__u64" has 4-byte alignment on x86-32, but native 8-byte
    alignment on x86-64. And while 'NAME_MAX+1' ends up being a nice round
    number (256), the name[] array starts out a 4-byte aligned.

    End result: the "packed" size of the structure is 300 bytes: 4-byte, but
    not 8-byte aligned.

    As a result, despite all the fields being in the same place on all
    architectures, sizeof() will round up that size to 304 bytes on
    architectures that have 8-byte alignment for u64.

    Note that this is *not* a problem for 32-bit compat mode on POWER, since
    there __u64 is 8-byte aligned even in 32-bit mode. But on x86, 32-bit
    and 64-bit alignment is different for 64-bit entities, and as a result
    the structure that has exactly the same layout has different sizes.

    So on x86-64, but no other architecture, we will just subtract 4 from
    the size of the structure when running in a compat task. That way we
    will write the properly sized packet that user mode expects.

    Not pretty. Sadly, this very subtle, and unnecessary, size difference
    has been encoded in user space that wants to read packets of *exactly*
    the right size, and will refuse to touch anything else.

    Reported-and-tested-by: Thomas Meyer
    Signed-off-by: Ian Kent
    Signed-off-by: Linus Torvalds
    Cc: Jonathan Nieder
    Signed-off-by: Greg Kroah-Hartman

    Ian Kent
     

01 Mar, 2012

8 commits

  • commit 28d82dc1c4edbc352129f97f4ca22624d1fe61de upstream.

    The current epoll code can be tickled to run basically indefinitely in
    both loop detection path check (on ep_insert()), and in the wakeup paths.
    The programs that tickle this behavior set up deeply linked networks of
    epoll file descriptors that cause the epoll algorithms to traverse them
    indefinitely. A couple of these sample programs have been previously
    posted in this thread: https://lkml.org/lkml/2011/2/25/297.

    To fix the loop detection path check algorithms, I simply keep track of
    the epoll nodes that have been already visited. Thus, the loop detection
    becomes proportional to the number of epoll file descriptor and links.
    This dramatically decreases the run-time of the loop check algorithm. In
    one diabolical case I tried it reduced the run-time from 15 mintues (all
    in kernel time) to .3 seconds.

    Fixing the wakeup paths could be done at wakeup time in a similar manner
    by keeping track of nodes that have already been visited, but the
    complexity is harder, since there can be multiple wakeups on different
    cpus...Thus, I've opted to limit the number of possible wakeup paths when
    the paths are created.

    This is accomplished, by noting that the end file descriptor points that
    are found during the loop detection pass (from the newly added link), are
    actually the sources for wakeup events. I keep a list of these file
    descriptors and limit the number and length of these paths that emanate
    from these 'source file descriptors'. In the current implemetation I
    allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of
    length 4 and 10 of length 5. Note that it is sufficient to check the
    'source file descriptors' reachable from the newly added link, since no
    other 'source file descriptors' will have newly added links. This allows
    us to check only the wakeup paths that may have gotten too long, and not
    re-check all possible wakeup paths on the system.

    In terms of the path limit selection, I think its first worth noting that
    the most common case for epoll, is probably the model where you have 1
    epoll file descriptor that is monitoring n number of 'source file
    descriptors'. In this case, each 'source file descriptor' has a 1 path of
    length 1. Thus, I believe that the limits I'm proposing are quite
    reasonable and in fact may be too generous. Thus, I'm hoping that the
    proposed limits will not prevent any workloads that currently work to
    fail.

    In terms of locking, I have extended the use of the 'epmutex' to all
    epoll_ctl add and remove operations. Currently its only used in a subset
    of the add paths. I need to hold the epmutex, so that we can correctly
    traverse a coherent graph, to check the number of paths. I believe that
    this additional locking is probably ok, since its in the setup/teardown
    paths, and doesn't affect the running paths, but it certainly is going to
    add some extra overhead. Also, worth noting is that the epmuex was
    recently added to the ep_ctl add operations in the initial path loop
    detection code using the argument that it was not on a critical path.

    Another thing to note here, is the length of epoll chains that is allowed.
    Currently, eventpoll.c defines:

    /* Maximum number of nesting allowed inside epoll sets */
    #define EP_MAX_NESTS 4

    This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS
    + 1). However, this limit is currently only enforced during the loop
    check detection code, and only when the epoll file descriptors are added
    in a certain order. Thus, this limit is currently easily bypassed. The
    newly added check for wakeup paths, stricly limits the wakeup paths to a
    length of 5, regardless of the order in which ep's are linked together.
    Thus, a side-effect of the new code is a more consistent enforcement of
    the graph depth.

    Thus far, I've tested this, using the sample programs previously
    mentioned, which now either return quickly or return -EINVAL. I've also
    testing using the piptest.c epoll tester, which showed no difference in
    performance. I've also created a number of different epoll networks and
    tested that they behave as expectded.

    I believe this solves the original diabolical test cases, while still
    preserving the sane epoll nesting.

    Signed-off-by: Jason Baron
    Cc: Nelson Elhage
    Cc: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jason Baron
     
  • commit 971316f0503a5c50633d07b83b6db2f15a3a5b00 upstream.

    signalfd_cleanup() ensures that ->signalfd_wqh is not used, but
    this is not enough. eppoll_entry->whead still points to the memory
    we are going to free, ep_unregister_pollwait()->remove_wait_queue()
    is obviously unsafe.

    Change ep_poll_callback(POLLFREE) to set eppoll_entry->whead = NULL,
    change ep_unregister_pollwait() to check pwq->whead != NULL under
    rcu_read_lock() before remove_wait_queue(). We add the new helper,
    ep_remove_wait_queue(), for this.

    This works because sighand_cachep is SLAB_DESTROY_BY_RCU and because
    ->signalfd_wqh is initialized in sighand_ctor(), not in copy_sighand.
    ep_unregister_pollwait()->remove_wait_queue() can play with already
    freed and potentially reused ->sighand, but this is fine. This memory
    must have the valid ->signalfd_wqh until rcu_read_unlock().

    Reported-by: Maxime Bizon
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Oleg Nesterov
     
  • commit d80e731ecab420ddcb79ee9d0ac427acbc187b4b upstream.

    This patch is intentionally incomplete to simplify the review.
    It ignores ep_unregister_pollwait() which plays with the same wqh.
    See the next change.

    epoll assumes that the EPOLL_CTL_ADD'ed file controls everything
    f_op->poll() needs. In particular it assumes that the wait queue
    can't go away until eventpoll_release(). This is not true in case
    of signalfd, the task which does EPOLL_CTL_ADD uses its ->sighand
    which is not connected to the file.

    This patch adds the special event, POLLFREE, currently only for
    epoll. It expects that init_poll_funcptr()'ed hook should do the
    necessary cleanup. Perhaps it should be defined as EPOLLFREE in
    eventpoll.

    __cleanup_sighand() is changed to do wake_up_poll(POLLFREE) if
    ->signalfd_wqh is not empty, we add the new signalfd_cleanup()
    helper.

    ep_poll_callback(POLLFREE) simply does list_del_init(task_list).
    This make this poll entry inconsistent, but we don't care. If you
    share epoll fd which contains our sigfd with another process you
    should blame yourself. signalfd is "really special". I simply do
    not know how we can define the "right" semantics if it used with
    epoll.

    The main problem is, epoll calls signalfd_poll() once to establish
    the connection with the wait queue, after that signalfd_poll(NULL)
    returns the different/inconsistent results depending on who does
    EPOLL_CTL_MOD/signalfd_read/etc. IOW: apart from sigmask, signalfd
    has nothing to do with the file, it works with the current thread.

    In short: this patch is the hack which tries to fix the symptoms.
    It also assumes that nobody can take tasklist_lock under epoll
    locks, this seems to be true.

    Note:

    - we do not have wake_up_all_poll() but wake_up_poll()
    is fine, poll/epoll doesn't use WQ_FLAG_EXCLUSIVE.

    - signalfd_cleanup() uses POLLHUP along with POLLFREE,
    we need a couple of simple changes in eventpoll.c to
    make sure it can't be "lost".

    Reported-by: Maxime Bizon
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Oleg Nesterov
     
  • commit abe9a6d57b4544ac208401f9c0a4262814db2be4 upstream.

    server_scope would never be freed if nfs4_check_cl_exchange_flags() returned
    non-zero

    Signed-off-by: Weston Andros Adamson
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Weston Andros Adamson
     
  • commit b9f9a03150969e4bd9967c20bce67c4de769058f upstream.

    To ensure that we don't just reuse the bad delegation when we attempt to
    recover the nfs4_state that received the bad stateid error.

    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit 331818f1c468a24e581aedcbe52af799366a9dfe upstream.

    Commit bf118a342f10dafe44b14451a1392c3254629a1f (NFSv4: include bitmap
    in nfsv4 get acl data) introduces the 'acl_scratch' page for the case
    where we may need to decode multi-page data. However it fails to take
    into account the fact that the variable may be NULL (for the case where
    we're not doing multi-page decode), and it also attaches it to the
    encoding xdr_stream rather than the decoding one.

    The immediate result is an Oops in nfs4_xdr_enc_getacl due to the
    call to page_address() with a NULL page pointer.

    Signed-off-by: Trond Myklebust
    Cc: Andy Adamson
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit e188dc02d3a9c911be56eca5aa114fe7e9822d53 upstream.

    d_inode_lookup() leaks a dentry reference on IS_DEADDIR().

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Miklos Szeredi
     
  • commit 545d680938be1e86a6c5250701ce9abaf360c495 upstream.

    After passing through a ->setxattr() call, eCryptfs needs to copy the
    inode attributes from the lower inode to the eCryptfs inode, as they
    may have changed in the lower filesystem's ->setxattr() path.

    One example is if an extended attribute containing a POSIX Access
    Control List is being set. The new ACL may cause the lower filesystem to
    modify the mode of the lower inode and the eCryptfs inode would need to
    be updated to reflect the new mode.

    https://launchpad.net/bugs/926292

    Signed-off-by: Tyler Hicks
    Reported-by: Sebastien Bacher
    Cc: John Johansen
    Signed-off-by: Greg Kroah-Hartman

    Tyler Hicks
     

21 Feb, 2012

3 commits

  • commit ff4fa4a25a33f92b5653bb43add0c63bea98d464 upstream.

    standard_receive3 will check the validity of the response from the
    server (via checkSMB). It'll pass the result of that check to handle_mid
    which will dequeue it and mark it with a status of
    MID_RESPONSE_MALFORMED if checkSMB returned an error. At that point,
    standard_receive3 will also return an error, which will make the
    demultiplex thread skip doing the callback for the mid.

    This is wrong -- if we were able to identify the request and the
    response is marked malformed, then we want the demultiplex thread to do
    the callback. Fix this by making standard_receive3 return 0 in this
    situation.

    Reported-and-Tested-by: Mark Moseley
    Signed-off-by: Jeff Layton
    Signed-off-by: Steve French
    Signed-off-by: Greg Kroah-Hartman

    Jeff Layton
     
  • commit 8b0192a5f478da1c1ae906bf3ffff53f26204f56 upstream.

    Currently, it's always set to 0 (no oplock requested).

    Signed-off-by: Jeff Layton
    Signed-off-by: Steve French
    Signed-off-by: Greg Kroah-Hartman

    Jeff Layton
     
  • commit 15eb77a07c714ac80201abd0a9568888bcee6276 upstream.

    bdi_prune_sb() resets sb->s_bdi to default_backing_dev_info when the
    tearing down the original bdi. Fix trace_writeback_single_inode to
    use sb->s_bdi=default_backing_dev_info rather than bdi->dev=NULL for a
    teared down bdi.

    Reported-by: Rabin Vincent
    Tested-by: Rabin Vincent
    Signed-off-by: Wu Fengguang
    Signed-off-by: Greg Kroah-Hartman

    Wu Fengguang
     

14 Feb, 2012

6 commits

  • commit de47a4176c532ef5961b8a46a2d541a3517412d3 upstream.

    For null user mounts, do not invoke string length function
    during session setup.

    Reported-and-Tested-by: Chris Clayton
    Acked-by: Jeff Layton
    Signed-off-by: Shirish Pargaonkar
    Signed-off-by: Steve French
    Signed-off-by: Greg Kroah-Hartman

    Shirish Pargaonkar
     
  • commit 684a3ff7e69acc7c678d1a1394fe9e757993fd34 upstream.

    ecryptfs_write() can enter an infinite loop when truncating a file to a
    size larger than 4G. This only happens on architectures where size_t is
    represented by 32 bits.

    This was caused by a size_t overflow due to it incorrectly being used to
    store the result of a calculation which uses potentially large values of
    type loff_t.

    [tyhicks@canonical.com: rewrite subject and commit message]
    Signed-off-by: Li Wang
    Signed-off-by: Yunchuan Wen
    Reviewed-by: Cong Wang
    Signed-off-by: Tyler Hicks
    Signed-off-by: Greg Kroah-Hartman

    Li Wang
     
  • commit 853a0c25baf96b028de1654bea1e0c8857eadf3d upstream.

    When we hit EIO while writing LVID, the buffer uptodate bit is cleared.
    This then results in an anoying warning from mark_buffer_dirty() when we
    write the buffer again. So just set uptodate flag unconditionally.

    Reviewed-by: Namjae Jeon
    Signed-off-by: Jan Kara
    Cc: Dave Jones
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 6d08f2c7139790c268820a2e590795cb8333181a upstream.

    Once /proc/pid/mem is opened, the memory can't be released until
    mem_release() even if its owner exits.

    Change mem_open() to do atomic_inc(mm_count) + mmput(), this only
    pins mm_struct. Change mem_rw() to do atomic_inc_not_zero(mm_count)
    before access_remote_vm(), this verifies that this mm is still alive.

    I am not sure what should mem_rw() return if atomic_inc_not_zero()
    fails. With this patch it returns zero to match the "mm == NULL" case,
    may be it should return -EINVAL like it did before e268337d.

    Perhaps it makes sense to add the additional fatal_signal_pending()
    check into the main loop, to ensure we do not hold this memory if
    the target task was oom-killed.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Oleg Nesterov
     
  • commit 572d34b946bae070debd42db1143034d9687e13f upstream.

    No functional changes, cleanup and preparation.

    mem_read() and mem_write() are very similar. Move this code into the
    new common helper, mem_rw(), which takes the additional "int write"
    argument.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Oleg Nesterov
     
  • commit 71879d3cb3dd8f2dfdefb252775c1b3ea04a3dd4 upstream.

    mem_release() can hit mm == NULL, add the necessary check.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Oleg Nesterov
     

04 Feb, 2012

8 commits

  • commit ce597919361dcec97341151690e780eade2a9cf4 upstream.

    Recently an OOPS was observed from the usb serial io_ti driver when it tried to remove
    sysfs directories. Upon investigation it turns out this driver was always buggy
    and that a recent sysfs change had stopped guarding itself against removing attributes
    from sysfs directories that had already been removed. :(

    Historically we have been silent about attempting to files from nonexistent sysfs
    directories and have politely returned error codes. That has resulted in people writing
    broken code that ignores the error codes.

    Issue a kernel WARNING and a stack backtrace to make it clear in no uncertain
    terms that abusing sysfs is not ok, and the callers need to fix their code.

    This change transforms the io_ti OOPS into a more comprehensible error message
    and stack backtrace.

    Signed-off-by: Eric W. Biederman
    Reported-by: Wolfgang Frisch
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit 353b67d8ced4dc53281c88150ad295e24bc4b4c5 upstream.

    When we reach cleanup_journal_tail(), there is no guarantee that
    checkpointed buffers are on a stable storage - especially if buffers were
    written out by log_do_checkpoint(), they are likely to be only in disk's
    caches. Thus when we update journal superblock, effectively removing old
    transaction from journal, this write of superblock can get to stable storage
    before those checkpointed buffers which can result in filesystem corruption
    after a crash.

    A similar problem can happen if we replay the journal and wipe it before
    flushing disk's caches.

    Thus we must unconditionally issue a cache flush before we update journal
    superblock in these cases. The fix is slightly complicated by the fact that we
    have to get log tail before we issue cache flush but we can store it in the
    journal superblock only after the cache flush. Otherwise we risk races where
    new tail is written before appropriate cache flush is finished.

    I managed to reproduce the corruption using somewhat tweaked Chris Mason's
    barrier-test scheduler. Also this should fix occasional reports of 'Bit already
    freed' filesystem errors which are totally unreproducible but inspection of
    several fs images I've gathered over time points to a problem like this.

    Signed-off-by: Jan Kara
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 9b025eb3a89e041bab6698e3858706be2385d692 upstream.

    Commit b52a360b forgot to call xfs_iunlock() when it detected corrupted
    symplink and bailed out. Fix it by jumping to 'out' instead of doing return.

    CC: Carlos Maiolino
    Signed-off-by: Jan Kara
    Reviewed-by: Alex Elder
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 58ded24f0fcb85bddb665baba75892f6ad0f4b8a upstream.

    If pages passed to the eCryptfs extent-based crypto functions are not
    mapped and the module parameter ecryptfs_verbosity=1 was specified at
    loading time, a NULL pointer dereference will occur.

    Note that this wouldn't happen on a production system, as you wouldn't
    pass ecryptfs_verbosity=1 on a production system. It leaks private
    information to the system logs and is for debugging only.

    The debugging info printed in these messages is no longer very useful
    and rather than doing a kmap() in these debugging paths, it will be
    better to simply remove the debugging paths completely.

    https://launchpad.net/bugs/913651

    Signed-off-by: Tyler Hicks
    Signed-off-by: Greg Kroah-Hartman

    Tyler Hicks
     
  • commit a261a03904849c3df50bd0300efb7fb3f865137d upstream.

    Most filesystems call inode_change_ok() very early in ->setattr(), but
    eCryptfs didn't call it at all. It allowed the lower filesystem to make
    the call in its ->setattr() function. Then, eCryptfs would copy the
    appropriate inode attributes from the lower inode to the eCryptfs inode.

    This patch changes that and actually calls inode_change_ok() on the
    eCryptfs inode, fairly early in ecryptfs_setattr(). Ideally, the call
    would happen earlier in ecryptfs_setattr(), but there are some possible
    inode initialization steps that must happen first.

    Since the call was already being made on the lower inode, the change in
    functionality should be minimal, except for the case of a file extending
    truncate call. In that case, inode_newsize_ok() was never being
    called on the eCryptfs inode. Rather than inode_newsize_ok() catching
    maximum file size errors early on, eCryptfs would encrypt zeroed pages
    and write them to the lower filesystem until the lower filesystem's
    write path caught the error in generic_write_checks(). This patch
    introduces a new function, called ecryptfs_inode_newsize_ok(), which
    checks if the new lower file size is within the appropriate limits when
    the truncate operation will be growing the lower file.

    In summary this change prevents eCryptfs truncate operations (and the
    resulting page encryptions), which would exceed the lower filesystem
    limits or FSIZE rlimits, from ever starting.

    Signed-off-by: Tyler Hicks
    Reviewed-by: Li Wang
    Signed-off-by: Greg Kroah-Hartman

    Tyler Hicks
     
  • commit 5e6f0d769017cc49207ef56996e42363ec26c1f0 upstream.

    ecryptfs_write() handles the truncation of eCryptfs inodes. It grabs a
    page, zeroes out the appropriate portions, and then encrypts the page
    before writing it to the lower filesystem. It was unkillable and due to
    the lack of sparse file support could result in tying up a large portion
    of system resources, while encrypting pages of zeros, with no way for
    the truncate operation to be stopped from userspace.

    This patch adds the ability for ecryptfs_write() to detect a pending
    fatal signal and return as gracefully as possible. The intent is to
    leave the lower file in a useable state, while still allowing a user to
    break out of the encryption loop. If a pending fatal signal is detected,
    the eCryptfs inode size is updated to reflect the modified inode size
    and then -EINTR is returned.

    Signed-off-by: Tyler Hicks
    Signed-off-by: Greg Kroah-Hartman

    Tyler Hicks
     
  • commit 30373dc0c87ffef68d5628e77d56ffb1fa22e1ee upstream.

    Print inode on metadata read failure. The only real
    way of dealing with metadata read failures is to delete
    the underlying file system file. Having the inode
    allows one to 'find . -inum INODE`.

    [tyhicks@canonical.com: Removed some minor not-for-stable parts]
    Signed-off-by: Tim Gardner
    Reviewed-by: Kees Cook
    Signed-off-by: Tyler Hicks
    Signed-off-by: Greg Kroah-Hartman

    Tim Gardner
     
  • commit db10e556518eb9d21ee92ff944530d84349684f4 upstream.

    A malicious count value specified when writing to /dev/ecryptfs may
    result in a a very large kernel memory allocation.

    This patch peeks at the specified packet payload size, adds that to the
    size of the packet headers and compares the result with the write count
    value. The resulting maximum memory allocation size is approximately 532
    bytes.

    Signed-off-by: Tyler Hicks
    Reported-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Tyler Hicks
     

26 Jan, 2012

11 commits

  • commit 85e72aa5384b1a614563ad63257ded0e91d1a620 upstream.

    /proc/pid/clear_refs is used to clear the Referenced and YOUNG bits for
    pages and corresponding page table entries of the task with PID pid, which
    includes any special mappings inserted into the page tables in order to
    provide things like vDSOs and user helper functions.

    On ARM this causes a problem because the vectors page is mapped as a
    global mapping and since ec706dab ("ARM: add a vma entry for the user
    accessible vector page"), a VMA is also inserted into each task for this
    page to aid unwinding through signals and syscall restarts. Since the
    vectors page is required for handling faults, clearing the YOUNG bit (and
    subsequently writing a faulting pte) means that we lose the vectors page
    *globally* and cannot fault it back in. This results in a system deadlock
    on the next exception.

    To see this problem in action, just run:

    $ echo 1 > /proc/self/clear_refs

    on an ARM platform (as any user) and watch your system hang. I think this
    has been the case since 2.6.37

    This patch avoids clearing the aforementioned bits for reserved pages,
    therefore leaving the vectors page intact on ARM. Since reserved pages
    are not candidates for swap, this change should not have any impact on the
    usefulness of clear_refs.

    Signed-off-by: Will Deacon
    Reported-by: Moussa Ba
    Acked-by: Hugh Dickins
    Cc: David Rientjes
    Cc: Russell King
    Acked-by: Nicolas Pitre
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Will Deacon
     
  • commit ce91acb3acae26f4163c5a6f1f695d1a1e8d9009 upstream.

    We've had some reports of servers (namely, the Solaris in-kernel CIFS
    server) that don't deal properly with writes that are "too large" even
    though they set CAP_LARGE_WRITE_ANDX. Change the default to better
    mirror what windows clients do.

    Cc: Pavel Shilovsky
    Reported-by: Nick Davis
    Signed-off-by: Jeff Layton
    Signed-off-by: Steve French
    Signed-off-by: Greg Kroah-Hartman

    Jeff Layton
     
  • commit b1c770c273a4787069306fc82aab245e9ac72e9d upstream

    When finding the longest extent in an AG, we read the value directly
    out of the AGF buffer without endian conversion. This will give an
    incorrect length, resulting in FITRIM operations potentially not
    trimming everything that it should.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ben Myers
    Signed-off-by: Greg Kroah-Hartman

    Dave Chinner
     
  • commit e268337dfe26dfc7efd422a804dbb27977a3cccc upstream.

    Jüri Aedla reported that the /proc//mem handling really isn't very
    robust, and it also doesn't match the permission checking of any of the
    other related files.

    This changes it to do the permission checks at open time, and instead of
    tracking the process, it tracks the VM at the time of the open. That
    simplifies the code a lot, but does mean that if you hold the file
    descriptor open over an execve(), you'll continue to read from the _old_
    VM.

    That is different from our previous behavior, but much simpler. If
    somebody actually finds a load where this matters, we'll need to revert
    this commit.

    I suspect that nobody will ever notice - because the process mapping
    addresses will also have changed as part of the execve. So you cannot
    actually usefully access the fd across a VM change simply because all
    the offsets for IO would have changed too.

    Reported-by: Jüri Aedla
    Cc: Al Viro
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     
  • commit c3e0ef9a298e028a82ada28101ccd5cf64d209ee upstream.

    For 32-bit architectures using standard jiffies the idletime calculation
    in uptime_proc_show will quickly overflow. It takes (2^32 / HZ) seconds
    of idle-time, or e.g. 12.45 days with no load on a quad-core with HZ=1000.
    Switch to 64-bit calculations.

    Cc: Michael Abbott
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Greg Kroah-Hartman

    Martin Schwidefsky
     
  • commit 74a6eeb44ca6174d9cc93b9b8b4d58211c57bc80 upstream.

    One bio can have at most BIO_MAX_PAGES pages. We should limit it bec otherwise
    bio_alloc will fail when there are many pages in one read/write_pagelist.

    Signed-off-by: Peng Tao
    Signed-off-by: Benny Halevy
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Peng Tao
     
  • commit 93a3844ee0f843b05a1df4b52e1a19ff26b98d24 upstream.

    bl_free_block_dev() may sleep. We can not call it with spinlock held.
    Besides, there is no need to take bm_lock as we are last user freeing bm_devlist.

    Signed-off-by: Peng Tao
    Signed-off-by: Benny Halevy
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Peng Tao
     
  • commit 39e567ae36fe03c2b446e1b83ee3d39bea08f90b upstream.

    When calling _add_entry, we should take the im_lock to protect
    agains other modifiers.

    Signed-off-by: Peng Tao
    Signed-off-by: Benny Halevy
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Peng Tao
     
  • commit eaf5f9073533cde21c7121c136f1c3f072d9cf59 upstream.

    Two (or more) concurrent calls of shrink_dcache_parent() on the same dentry may
    cause shrink_dcache_parent() to loop forever.

    Here's what appears to happen:

    1 - CPU0: select_parent(P) finds C and puts it on dispose list, returns 1

    2 - CPU1: select_parent(P) locks P->d_lock

    3 - CPU0: shrink_dentry_list() locks C->d_lock
    dentry_kill(C) tries to lock P->d_lock but fails, unlocks C->d_lock

    4 - CPU1: select_parent(P) locks C->d_lock,
    moves C from dispose list being processed on CPU0 to the new
    dispose list, returns 1

    5 - CPU0: shrink_dentry_list() finds dispose list empty, returns

    6 - Goto 2 with CPU0 and CPU1 switched

    Basically select_parent() steals the dentry from shrink_dentry_list() and thinks
    it found a new one, causing shrink_dentry_list() to think it's making progress
    and loop over and over.

    One way to trigger this is to make udev calls stat() on the sysfs file while it
    is going away.

    Having a file in /lib/udev/rules.d/ with only this one rule seems to the trick:

    ATTR{vendor}=="0x8086", ATTR{device}=="0x10ca", ENV{PCI_SLOT_NAME}="%k", ENV{MATCHADDR}="$attr{address}", RUN+="/bin/true"

    Then execute the following loop:

    while true; do
    echo -bond0 > /sys/class/net/bonding_masters
    echo +bond0 > /sys/class/net/bonding_masters
    echo -bond1 > /sys/class/net/bonding_masters
    echo +bond1 > /sys/class/net/bonding_masters
    done

    One fix would be to check all callers and prevent concurrent calls to
    shrink_dcache_parent(). But I think a better solution is to stop the
    stealing behavior.

    This patch adds a new dentry flag that is set when the dentry is added to the
    dispose list. The flag is cleared in dentry_lru_del() in case the dentry gets a
    new reference just before being pruned.

    If the dentry has this flag, select_parent() will skip it and let
    shrink_dentry_list() retry pruning it. With select_parent() skipping those
    dentries there will not be the appearance of progress (new dentries found) when
    there is none, hence shrink_dcache_parent() will not loop forever.

    Set the flag is also set in prune_dcache_sb() for consistency as suggested by
    Linus.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Miklos Szeredi
     
  • commit b48f03b319ba78f3abf9a7044d1f436d8d90f4f9 upstream.

    select_parent currently abuses the dentry cache LRU to provide
    cleanup features for child dentries that need to be freed. It moves
    them to the tail of the LRU, then tells shrink_dcache_parent() to
    calls __shrink_dcache_sb to unconditionally move them to a dispose
    list (as DCACHE_REFERENCED is ignored). __shrink_dcache_sb() has to
    relock the dentries to move them off the LRU onto the dispose list,
    but otherwise does not touch the dentries that select_parent() moved
    to the tail of the LRU. It then passses the dispose list to
    shrink_dentry_list() which tries to free the dentries.

    IOWs, the use of __shrink_dcache_sb() is superfluous - we can build
    exactly the same list of dentries for disposal directly in
    select_parent() and call shrink_dentry_list() instead of calling
    __shrink_dcache_sb() to do that. This means that we avoid long holds
    on the lru lock walking the LRU moving dentries to the dispose list
    We also avoid the need to relock each dentry just to move it off the
    LRU, reducing the numebr of times we lock each dentry to dispose of
    them in shrink_dcache_parent() from 3 to 2 times.

    Further, we remove one of the two callers of __shrink_dcache_sb().
    This also means that __shrink_dcache_sb can be moved into back into
    prune_dcache_sb() and we no longer have to handle referenced
    dentries conditionally, simplifying the code.

    Signed-off-by: Dave Chinner
    Signed-off-by: Linus Torvalds
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Dave Chinner
     
  • commit fed474857efbed79cd390d0aee224231ca718f63 upstream.

    Removing the parent of a watched file results in "kernel BUG at
    fs/notify/mark.c:139".

    To reproduce

    add "-w /tmp/audit/dir/watched_file" to audit.rules
    rm -rf /tmp/audit/dir

    This is caused by fsnotify_destroy_mark() being called without an
    extra reference taken by the caller.

    Reported by Francesco Cosoleto here:

    https://bugzilla.novell.com/show_bug.cgi?id=689860

    Fix by removing the BUG_ON and adding a comment about not accessing mark after
    the iput.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Miklos Szeredi