Eric Lee / smarc-fsl-linux-kernel

15 Mar, 2012

1 commit

f1cbd03f5 Merge branch 'for-linus' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block fixes from Jens Axboe:
"Been sitting on this for a while, but lets get this out the door.
This fixes various important bugs for 3.3 final, along with a few more
trivial ones. Please pull!"

* 'for-linus' of git://git.kernel.dk/linux-block:
block: fix ioc leak in put_io_context
block, sx8: fix pointer math issue getting fw version
Block: use a freezable workqueue for disk-event polling
drivers/block/DAC960: fix -Wuninitialized warning
drivers/block/DAC960: fix DAC960_V2_IOCTL_Opcode_T -Wenum-compare warning
block: fix __blkdev_get and add_disk race condition
block: Fix setting bio flags in drivers (sd_dif/floppy)
block: Fix NULL pointer dereference in sd_revalidate_disk
block: exit_io_context() should call elevator_exit_icq_fn()
block: simplify ioc_release_fn()
block: replace icq->changed with icq->flags

Linus Torvalds
2012-03-15 08:16:45 +0800

14 Mar, 2012

1 commit

8e8bb96d2 Merge git://git.samba.org/sfrench/cifs-2.6 ... Browse Code »

Pull CIFS fixes from Steve French.

* git://git.samba.org/sfrench/cifs-2.6:
CIFS: Do not kmalloc under the flocks spinlock
cifs: possible memory leak in xattr.

Linus Torvalds
2012-03-14 08:03:53 +0800

11 Mar, 2012

5 commits

310fa7a36 restore smp_mb() in unlock_new_inode() ... Browse Code »

wait_on_inode() doesn't have ->i_lock

Signed-off-by: Al Viro

Al Viro
2012-03-11 06:07:28 +0800
7f6c7e62f vfs: fix return value from do_last() ... Browse Code »
1

complete_walk() returns either ECHILD or ESTALE. do_last() turns this into
ECHILD unconditionally. If not in RCU mode, this error will reach userspace
which is complete nonsense.

Signed-off-by: Miklos Szeredi
CC: stable@vger.kernel.org
Signed-off-by: Al Viro

Miklos Szeredi
2012-03-11 06:05:30 +0800
097b180ca vfs: fix double put after complete_walk() ... Browse Code »
1

complete_walk() already puts nd->path, no need to do it again at cleanup time.

This would result in Oopses if triggered, apparently the codepath is not too
well exercised.

Signed-off-by: Miklos Szeredi
CC: stable@vger.kernel.org
Signed-off-by: Al Viro

Miklos Szeredi
2012-03-11 06:05:30 +0800
f6940fe90 udf: Fix deadlock in udf_release_file() ... Browse Code »

udf_release_file() can be called from munmap() path with mmap_sem held. Thus
we cannot take i_mutex there because that ranks above mmap_sem. Luckily,
i_mutex is not needed in udf_release_file() anymore since protection by
i_data_sem is enough to protect from races with write and truncate.

Reported-by: Al Viro
Reviewed-by: Namjae Jeon
Signed-off-by: Jan Kara
Signed-off-by: Al Viro

Jan Kara
2012-03-11 05:05:38 +0800
978d6d8c4 vfs: Correctly set the dir i_mutex lockdep class ... Browse Code »

9a7aa12f3911853a introduced additional logic around setting the i_mutex
lockdep class for directory inodes. The idea was that some filesystems
may want their own special lockdep class for different directory
inodes and calling unlock_new_inode() should not clobber one of
those special classes.

I believe that the added conditional, around the *negated* return value
of lockdep_match_class(), caused directory inodes to be placed in the
wrong lockdep class.

inode_init_always() sets the i_mutex lockdep class with i_mutex_key for
all inodes. If the filesystem did not change the class during inode
initialization, then the conditional mentioned above was false and the
directory inode was incorrectly left in the non-directory lockdep class.
If the filesystem did set a special lockdep class, then the conditional
mentioned above was true and that class was clobbered with
i_mutex_dir_key.

This patch removes the negation from the conditional so that the i_mutex
lockdep class is properly set for directory inodes. Special classes are
preserved and directory inodes with unmodified classes are set with
i_mutex_dir_key.

Signed-off-by: Tyler Hicks
Reviewed-by: Jan Kara
Signed-off-by: Al Viro

Tyler Hicks
2012-03-11 05:05:38 +0800

10 Mar, 2012

3 commits

c7b285550 aio: fix the "too late munmap()" race ... Browse Code »
1

Current code has put_ioctx() called asynchronously from aio_fput_routine();
that's done *after* we have killed the request that used to pin ioctx,
so there's nothing to stop io_destroy() waiting in wait_for_all_aios()
from progressing. As the result, we can end up with async call of
put_ioctx() being the last one and possibly happening during exit_mmap()
or elf_core_dump(), neither of which expects stray munmap() being done
to them...

We do need to prevent _freeing_ ioctx until aio_fput_routine() is done
with that, but that's all we care about - neither io_destroy() nor
exit_aio() will progress past wait_for_all_aios() until aio_fput_routine()
does really_put_req(), so the ioctx teardown won't be done until then
and we don't care about the contents of ioctx past that point.

Since actual freeing of these suckers is RCU-delayed, we don't need to
bump ioctx refcount when request goes into list for async removal.
All we need is rcu_read_lock held just over the ->ctx_lock-protected
area in aio_fput_routine().

Signed-off-by: Al Viro
Reviewed-by: Jeff Moyer
Acked-by: Benjamin LaHaise
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

Al Viro
2012-03-10 10:59:59 +0800
86b62a2cb aio: fix io_setup/io_destroy race ... Browse Code »
1

Have ioctx_alloc() return an extra reference, so that caller would drop it
on success and not bother with re-grabbing it on failure exit. The current
code is obviously broken - io_destroy() from another thread that managed
to guess the address io_setup() would've returned would free ioctx right
under us; gets especially interesting if aio_context_t * we pass to
io_setup() points to PROT_READ mapping, so put_user() fails and we end
up doing io_destroy() on kioctx another thread has just got freed...

Signed-off-by: Al Viro
Acked-by: Benjamin LaHaise
Reviewed-by: Jeff Moyer
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

Al Viro
2012-03-10 10:59:59 +0800
86e060083 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs ... Browse Code »

Pull btrfs updates from Chris Mason:
"I have two additional and btrfs fixes in my for-linus branch. One is
a casting error that leads to memory corruption on i386 during scrub,
and the other fixes a corner case in the backref walking code (also
triggered by scrub)."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix casting error in scrub reada code
btrfs: fix locking issues in find_parent_nodes()

Linus Torvalds
2012-03-10 10:09:18 +0800

07 Mar, 2012

3 commits

d5751469f CIFS: Do not kmalloc under the flocks spinlock ... Browse Code »

Reorganize the code to make the memory already allocated before
spinlock'ed loop.

Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton
Signed-off-by: Pavel Shilovsky
Signed-off-by: Steve French

Pavel Shilovsky
2012-03-07 11:50:15 +0800
b0f8ef202 cifs: possible memory leak in xattr. ... Browse Code »

Memory is allocated irrespective of whether CIFS_ACL is configured
or not. But free is happenning only if CIFS_ACL is set. This is a
possible memory leak scenario.

Fix is:
Allocate and free memory only if CIFS_ACL is configured.

Signed-off-by: Santosh Nayak
Reviewed-by: Shirish Pargaonkar
Signed-off-by: Steve French

Santosh Nayak
2012-03-07 11:46:53 +0800
71fece951 Merge git://git.samba.org/sfrench/cifs-2.6 ... Browse Code »

Pull CIFS fixes from Steve French

* git://git.samba.org/sfrench/cifs-2.6:
cifs: fix dentry refcount leak when opening a FIFO on lookup
CIFS: Fix mkdir/rmdir bug for the non-POSIX case

Linus Torvalds
2012-03-07 08:55:50 +0800

06 Mar, 2012

5 commits

3e85fb9cd Merge branch 'akpm' (Andrew's patch bomb) ... Browse Code »

Merge the emailed seties of 19 patches from Andrew Morton

* akpm:
rapidio/tsi721: fix queue wrapping bug in inbound doorbell handler
memcg: fix mapcount check in move charge code for anonymous page
mm: thp: fix BUG on mm->nr_ptes
alpha: fix 32/64-bit bug in futex support
memcg: fix GPF when cgroup removal races with last exit
debugobjects: Fix selftest for static warnings
floppy/scsi: fix setting of BIO flags
memcg: fix deadlock by inverting lrucare nesting
drivers/rtc/rtc-r9701.c: fix crash in r9701_remove()
c2port: class_create() returns an ERR_PTR
pps: class_create() returns an ERR_PTR, not NULL
hung_task: fix the broken rcu_lock_break() logic
vfork: kill PF_STARTING
coredump_wait: don't call complete_vfork_done()
vfork: make it killable
vfork: introduce complete_vfork_done()
aio: wake up waiters when freeing unused kiocbs
kprobes: return proper error code from register_kprobe()
kmsg_dump: don't run on non-error paths by default

Linus Torvalds
2012-03-06 07:50:25 +0800
57b59c4a1 coredump_wait: don't call complete_vfork_done() ... Browse Code »

Now that CLONE_VFORK is killable, coredump_wait() no longer needs
complete_vfork_done(). zap_threads() should find and kill all tasks with
the same ->mm, this includes our parent if ->vfork_done is set.

mm_release() becomes the only caller, unexport complete_vfork_done().

Signed-off-by: Oleg Nesterov
Acked-by: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2012-03-06 07:49:42 +0800
c415c3b47 vfork: introduce complete_vfork_done() ... Browse Code »

No functional changes.

Move the clear-and-complete-vfork_done code into the new trivial helper,
complete_vfork_done().

Signed-off-by: Oleg Nesterov
Acked-by: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2012-03-06 07:49:42 +0800
880641bb9 aio: wake up waiters when freeing unused kiocbs ... Browse Code »

Bart Van Assche reported a hung fio process when either hot-removing
storage or when interrupting the fio process itself. The (pruned) call
trace for the latter looks like so:

fio D 0000000000000001 0 6849 6848 0x00000004
ffff880092541b88 0000000000000046 ffff880000000000 ffff88012fa11dc0
ffff88012404be70 ffff880092541fd8 ffff880092541fd8 ffff880092541fd8
ffff880128b894d0 ffff88012404be70 ffff880092541b88 000000018106f24d
Call Trace:
schedule+0x3f/0x60
io_schedule+0x8f/0xd0
wait_for_all_aios+0xc0/0x100
exit_aio+0x55/0xc0
mmput+0x2d/0x110
exit_mm+0x10d/0x130
do_exit+0x671/0x860
do_group_exit+0x44/0xb0
get_signal_to_deliver+0x218/0x5a0
do_signal+0x65/0x700
do_notify_resume+0x65/0x80
int_signal+0x12/0x17

The problem lies with the allocation batching code. It will
opportunistically allocate kiocbs, and then trim back the list of iocbs
when there is not enough room in the completion ring to hold all of the
events.

In the case above, what happens is that the pruning back of events ends
up freeing up the last active request and the context is marked as dead,
so it is thus responsible for waking up waiters. Unfortunately, the
code does not check for this condition, so we end up with a hung task.

Signed-off-by: Jeff Moyer
Reported-by: Bart Van Assche
Tested-by: Bart Van Assche
Cc: [3.2.x only]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jeff Moyer
2012-03-06 07:49:42 +0800
6414fa6a1 aout: move setup_arg_pages() prior to reading/mapping the binary ... Browse Code »

Signed-off-by: Al Viro
Signed-off-by: Linus Torvalds

Al Viro
2012-03-06 05:51:32 +0800

05 Mar, 2012

1 commit

5483f18e9 vfs: move dentry_cmp from <linux/dcache.h> to fs/dcache.c ... Browse Code »

It's only used inside fs/dcache.c, and we're going to play games with it
for the word-at-a-time patches. This time we really don't even want to
export it, because it really is an internal function to fs/dcache.c, and
has been since it was introduced.

Having it in that extremely hot header file (it's included in pretty
much everything, thanks to ) is a disaster for testing
different versions, and is utterly pointless.

We really should have some kind of header file diet thing, where we
figure out which parts of header files are really better off private and
only result in more expensive compiles.

Signed-off-by: Linus Torvalds

Linus Torvalds
2012-03-05 07:51:42 +0800

03 Mar, 2012

7 commits

a175423c8 Btrfs: fix casting error in scrub reada code ... Browse Code »

The reada code from scrub was casting down a u64 to
an unsigned long so it could insert it into a radix tree.

What it really wanted to do was cast down the result of a shift, instead
of casting down the u64. The bug resulted in trying to insert our
reada struct into the wrong place, which caused soft lockups and other
problems.

Signed-off-by: Chris Mason

Chris Mason
2012-03-03 20:42:35 +0800
d3b010640 btrfs: fix locking issues in find_parent_nodes() ... Browse Code »

- We might unlock head->mutex while it was not locked
- We might leave the function without unlocking delayed_refs->lock

Signed-off-by: Li Zefan
Signed-off-by: Chris Mason

Li Zefan
2012-03-03 20:41:15 +0800
ae942ae71 vfs: export full_name_hash() function to modules ... Browse Code »

Commit 5707c87f "vfs: uninline full_name_hash()" broke the modular
build, because it needs exporting now that it isn't inlined any more.

Reported-by: Tetsuo Handa
Signed-off-by: Linus Torvalds

Linus Torvalds
2012-03-03 11:40:57 +0800
200e9ef7a vfs: split up name hashing in link_path_walk() into helper function ... Browse Code »

The code in link_path_walk() that finds out the length and the hash of
the next path component is some of the hottest code in the kernel. And
I have a version of it that does things at the full width of the CPU
wordsize at a time, but that means that we *really* want to split it up
into a separate helper function.

So this re-organizes the code a bit and splits the hashing part into a
helper function called "hash_name()". It returns the length of the
pathname component, while at the same time computing and writing the
hash to the appropriate location.

The code generation is slightly changed by this patch, but generally for
the better - and the added abstraction actually makes the code easier to
read too. And the new interface is well suited for replacing just the
"hash_name()" function with alternative implementations.

Signed-off-by: Linus Torvalds

Linus Torvalds
2012-03-03 06:49:24 +0800
0145acc20 vfs: uninline full_name_hash() ... Browse Code »

.. and also use it in lookup_one_len() rather than open-coding it.

There aren't any performance-critical users, so inlining it is silly.
But it wouldn't matter if it wasn't for the fact that the word-at-a-time
dentry name patches want to conditionally replace the function, and
uninlining it sets the stage for that.

So again, this is a preparatory patch that doesn't change any semantics,
and only prepares for a much cleaner and testable word-at-a-time dentry
name accessor patch.

Signed-off-by: Linus Torvalds

Linus Torvalds
2012-03-03 06:32:59 +0800
8966be903 vfs: trivial __d_lookup_rcu() cleanups ... Browse Code »

These don't change any semantics, but they clean up the code a bit and
mark some arguments appropriately 'const'.

They came up as I was doing the word-at-a-time dcache name accessor
code, and cleaning this up now allows me to send out a smaller relevant
interesting patch for the experimental stuff.

Signed-off-by: Linus Torvalds

Linus Torvalds
2012-03-03 06:23:30 +0800
c8e252586 regset: Prevent null pointer reference on readonly regsets ... Browse Code »
1

The regset common infrastructure assumed that regsets would always
have .get and .set methods, but not necessarily .active methods.
Unfortunately people have since written regsets without .set methods.

Rather than putting in stub functions everywhere, handle regsets with
null .get or .set methods explicitly.

Signed-off-by: H. Peter Anvin
Reviewed-by: Oleg Nesterov
Acked-by: Roland McGrath
Cc:
Signed-off-by: Linus Torvalds

H. Peter Anvin
2012-03-03 03:38:15 +0800

02 Mar, 2012

1 commit

fe316bf2d block: Fix NULL pointer dereference in sd_revalidate_disk ... Browse Code »
1

Since 2.6.39 (1196f8b), when a driver returns -ENOMEDIUM for open(),
__blkdev_get() calls rescan_partitions() to remove
in-kernel partition structures and raise KOBJ_CHANGE uevent.

However it ends up calling driver's revalidate_disk without open
and could cause oops.

In the case of SCSI:

process A process B
----------------------------------------------
sys_open
__blkdev_get
sd_open
returns -ENOMEDIUM
scsi_remove_device

rescan_partitions
sd_revalidate_disk

Oopses are reported here:
http://marc.info/?l=linux-scsi&m=132388619710052

This patch separates the partition invalidation from rescan_partitions()
and use it for -ENOMEDIUM case.

Reported-by: Huajun Li
Signed-off-by: Jun'ichi Nomura
Acked-by: Tejun Heo
Cc: stable@kernel.org
Signed-off-by: Jens Axboe

Jun'ichi Nomura
2012-03-02 17:38:33 +0800

29 Feb, 2012

1 commit

164974a8f ecryptfs: fix printk format warning for size_t ... Browse Code »

Fix printk format warning (from Linus's suggestion):

on i386:
fs/ecryptfs/miscdev.c:433:38: warning: format '%lu' expects type 'long unsigned int', but argument 4 has type 'unsigned int'

and on x86_64:
fs/ecryptfs/miscdev.c:433:38: warning: format '%u' expects type 'unsigned int', but argument 4 has type 'long unsigned int'

Signed-off-by: Randy Dunlap
Cc: Geert Uytterhoeven
Cc: Tyler Hicks
Cc: Dustin Kirkland
Cc: ecryptfs@vger.kernel.org
Signed-off-by: Linus Torvalds

Randy Dunlap
2012-02-29 08:55:30 +0800

28 Feb, 2012

4 commits

a365fbf35 GFS2: Read resource groups on mount ... Browse Code »

This makes mount take slightly longer, but at the same time, the first
write to the filesystem will be faster too. It also means that if there
is a problem in the resource index, then we can refuse to mount rather
than having to try and report that when the first write occurs.

In addition, to avoid recursive locking, we hvae to take account of
instances when the rindex glock may already be held when we are
trying to update the rbtree of resource groups.

Signed-off-by: Steven Whitehouse

Steven Whitehouse
2012-02-28 17:52:39 +0800
9e73f571e GFS2: Ensure rindex is uptodate for fallocate ... Browse Code »

This patch fixes a problem whereby gfs2_grow was failing and causing GFS2
to assert. The problem was that when GFS2's fallocate operation tried to
acquire an "allocation" it made sure the rindex was up to date, and if not,
it called gfs2_rindex_update. However, if the file being fallocated was
the rindex itself, it was already locked at that point. By calling
gfs2_rindex_update at an earlier point in time, we bring rindex up to date
and thereby avoid trying to lock it when the "allocation" is acquired.

Signed-off-by: Bob Peterson
Signed-off-by: Steven Whitehouse

Bob Peterson
2012-02-28 17:48:30 +0800
718b97bd6 GFS2: Read in rindex if necessary during unlink ... Browse Code »

This patch fixes a problem whereby you were unable to delete
files until other file system operations were done (such as
statfs, touch, writes, etc.) that caused the rindex to be
read in.

Signed-off-by: Bob Peterson
Signed-off-by: Steven Whitehouse

Bob Peterson
2012-02-28 17:48:02 +0800
4043b886b GFS2: Fix race between lru_list and glock ref count ... Browse Code »

This patch fixes a narrow race window between the glock ref count
hitting zero and glocks being removed from the lru_list.

Signed-off-by: Steven Whitehouse

Steven Whitehouse
2012-02-28 17:43:07 +0800

27 Feb, 2012

3 commits

f621c5334 Merge branch 'master' of /Volumes/CaseSensitiveDisk/linux Browse Code »

Anton Altaparmakov
2012-02-27 17:01:22 +0800
5bccda0eb cifs: fix dentry refcount leak when opening a FIFO on lookup ... Browse Code »
1

The cifs code will attempt to open files on lookup under certain
circumstances. What happens though if we find that the file we opened
was actually a FIFO or other special file?

Currently, the open filehandle just ends up being leaked leading to
a dentry refcount mismatch and oops on umount. Fix this by having the
code close the filehandle on the server if it turns out not to be a
regular file. While we're at it, change this spaghetti if statement
into a switch too.

Cc: stable@vger.kernel.org
Reported-by: CAI Qian
Tested-by: CAI Qian
Reviewed-by: Shirish Pargaonkar
Signed-off-by: Jeff Layton
Signed-off-by: Steve French

Jeff Layton
2012-02-27 13:16:26 +0800
6de2ce423 CIFS: Fix mkdir/rmdir bug for the non-POSIX case ... Browse Code »

Currently we do inc/drop_nlink for a parent directory for every
mkdir/rmdir calls. That's wrong when Unix extensions are disabled
because in this case a server doesn't follow the same semantic and
returns the old value on the next QueryInfo request. As the result,
we update our value with the server one and then decrement it on
every rmdir call - go to negative nlink values.

Fix this by removing inc/drop_nlink for the parent directory from
mkdir/rmdir, setting it for a revalidation and ignoring NumberOfLinks
for directories when Unix extensions are disabled.

Signed-off-by: Pavel Shilovsky
Reviewed-by: Jeff Layton
Signed-off-by: Steve French

Pavel Shilovsky
2012-02-27 12:59:43 +0800

26 Feb, 2012

1 commit

a32744d4a autofs: work around unhappy compat problem on x86-64 ... Browse Code »
89

When the autofs protocol version 5 packet type was added in commit
5c0a32fc2cd0 ("autofs4: add new packet type for v5 communications"), it
obvously tried quite hard to be word-size agnostic, and uses explicitly
sized fields that are all correctly aligned.

However, with the final "char name[NAME_MAX+1]" array at the end, the
actual size of the structure ends up being not very well defined:
because the struct isn't marked 'packed', doing a "sizeof()" on it will
align the size of the struct up to the biggest alignment of the members
it has.

And despite all the members being the same, the alignment of them is
different: a "__u64" has 4-byte alignment on x86-32, but native 8-byte
alignment on x86-64. And while 'NAME_MAX+1' ends up being a nice round
number (256), the name[] array starts out a 4-byte aligned.

End result: the "packed" size of the structure is 300 bytes: 4-byte, but
not 8-byte aligned.

As a result, despite all the fields being in the same place on all
architectures, sizeof() will round up that size to 304 bytes on
architectures that have 8-byte alignment for u64.

Note that this is *not* a problem for 32-bit compat mode on POWER, since
there __u64 is 8-byte aligned even in 32-bit mode. But on x86, 32-bit
and 64-bit alignment is different for 64-bit entities, and as a result
the structure that has exactly the same layout has different sizes.

So on x86-64, but no other architecture, we will just subtract 4 from
the size of the structure when running in a compat task. That way we
will write the properly sized packet that user mode expects.

Not pretty. Sadly, this very subtle, and unnecessary, size difference
has been encoded in user space that wants to read packets of *exactly*
the right size, and will refuse to touch anything else.

Reported-and-tested-by: Thomas Meyer
Signed-off-by: Ian Kent
Signed-off-by: Linus Torvalds

Ian Kent
2012-02-26 04:10:27 +0800

25 Feb, 2012

3 commits

971316f05 epoll: ep_unregister_pollwait() can use the freed pwq->whead ... Browse Code »
1

signalfd_cleanup() ensures that ->signalfd_wqh is not used, but
this is not enough. eppoll_entry->whead still points to the memory
we are going to free, ep_unregister_pollwait()->remove_wait_queue()
is obviously unsafe.

Change ep_poll_callback(POLLFREE) to set eppoll_entry->whead = NULL,
change ep_unregister_pollwait() to check pwq->whead != NULL under
rcu_read_lock() before remove_wait_queue(). We add the new helper,
ep_remove_wait_queue(), for this.

This works because sighand_cachep is SLAB_DESTROY_BY_RCU and because
->signalfd_wqh is initialized in sighand_ctor(), not in copy_sighand.
ep_unregister_pollwait()->remove_wait_queue() can play with already
freed and potentially reused ->sighand, but this is fine. This memory
must have the valid ->signalfd_wqh until rcu_read_unlock().

Reported-by: Maxime Bizon
Cc:
Signed-off-by: Oleg Nesterov
Signed-off-by: Linus Torvalds

Oleg Nesterov
2012-02-25 03:42:50 +0800
d80e731ec epoll: introduce POLLFREE to flush ->signalfd_wqh before kfree() ... Browse Code »
1

This patch is intentionally incomplete to simplify the review.
It ignores ep_unregister_pollwait() which plays with the same wqh.
See the next change.

epoll assumes that the EPOLL_CTL_ADD'ed file controls everything
f_op->poll() needs. In particular it assumes that the wait queue
can't go away until eventpoll_release(). This is not true in case
of signalfd, the task which does EPOLL_CTL_ADD uses its ->sighand
which is not connected to the file.

This patch adds the special event, POLLFREE, currently only for
epoll. It expects that init_poll_funcptr()'ed hook should do the
necessary cleanup. Perhaps it should be defined as EPOLLFREE in
eventpoll.

__cleanup_sighand() is changed to do wake_up_poll(POLLFREE) if
->signalfd_wqh is not empty, we add the new signalfd_cleanup()
helper.

ep_poll_callback(POLLFREE) simply does list_del_init(task_list).
This make this poll entry inconsistent, but we don't care. If you
share epoll fd which contains our sigfd with another process you
should blame yourself. signalfd is "really special". I simply do
not know how we can define the "right" semantics if it used with
epoll.

The main problem is, epoll calls signalfd_poll() once to establish
the connection with the wait queue, after that signalfd_poll(NULL)
returns the different/inconsistent results depending on who does
EPOLL_CTL_MOD/signalfd_read/etc. IOW: apart from sigmask, signalfd
has nothing to do with the file, it works with the current thread.

In short: this patch is the hack which tries to fix the symptoms.
It also assumes that nobody can take tasklist_lock under epoll
locks, this seems to be true.

Note:

- we do not have wake_up_all_poll() but wake_up_poll()
is fine, poll/epoll doesn't use WQ_FLAG_EXCLUSIVE.

- signalfd_cleanup() uses POLLHUP along with POLLFREE,
we need a couple of simple changes in eventpoll.c to
make sure it can't be "lost".

Reported-by: Maxime Bizon
Cc:
Signed-off-by: Oleg Nesterov
Signed-off-by: Linus Torvalds

Oleg Nesterov
2012-02-25 03:42:50 +0800
855a85f70 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs ... Browse Code »

Quoth Chris:
"This is later than I wanted because I got backed up running through
btrfs bugs from the Oracle QA teams. But they are all bug fixes that
we've queued and tested since rc1.

Nothing in particular stands out, this just reflects bug fixing and QA
done in parallel by all the btrfs developers. The most user visible
of these is:

Btrfs: clear the extent uptodate bits during parent transid failures

Because that helps deal with out of date drives (say an iscsi disk
that has gone away and come back). The old code wasn't always
properly retrying the other mirror for this type of failure."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (24 commits)
Btrfs: fix compiler warnings on 32 bit systems
Btrfs: increase the global block reserve estimates
Btrfs: clear the extent uptodate bits during parent transid failures
Btrfs: add extra sanity checks on the path names in btrfs_mksubvol
Btrfs: make sure we update latest_bdev
Btrfs: improve error handling for btrfs_insert_dir_item callers
Btrfs: be less strict on finding next node in clear_extent_bit
Btrfs: fix a bug on overcommit stuff
Btrfs: kick out redundant stuff in convert_extent_bit
Btrfs: skip states when they does not contain bits to clear
Btrfs: check return value of lookup_extent_mapping() correctly
Btrfs: fix deadlock on page lock when doing auto-defragment
Btrfs: fix return value check of extent_io_ops
btrfs: honor umask when creating subvol root
btrfs: silence warning in raid array setup
btrfs: fix structs where bitfields and spinlock/atomic share 8B word
btrfs: delalloc for page dirtied out-of-band in fixup worker
Btrfs: fix memory leak in load_free_space_cache()
btrfs: don't check DUP chunks twice
Btrfs: fix trim 0 bytes after a device delete
...

Linus Torvalds
2012-02-25 01:02:53 +0800

24 Feb, 2012

1 commit

e77266e4c Btrfs: fix compiler warnings on 32 bit systems ... Browse Code »

The enospc tracing code added some interesting uses of
u64 pointer casts.

Signed-off-by: Chris Mason

Chris Mason
2012-02-24 23:39:05 +0800