Eric Lee / smarc-fsl-linux-kernel

01 Dec, 2016

1 commit

a107bf8b3 isofs: add KERN_CONT to printing of ER records ... Browse Code »

The ER records are printed without explicit log level presuming line
continuation until "\n". After the commit 4bcc595ccd8 (printk:
reinstate KERN_CONT for printing continuation lines), the ER records are
printed a character per line.

Adding KERN_CONT to appropriate printk statements restores the printout
behavior.

Signed-off-by: Mike Rapoport
Signed-off-by: Linus Torvalds

Mike Rapoport
2016-12-01 02:41:26 +0800

18 Oct, 2016

1 commit

a2ed0b391 isofs: Do not return EACCES for unknown filesystems ... Browse Code »

When isofs_mount() is called to mount a device read-write, it returns
EACCES even before it checks that the device actually contains an isofs
filesystem. This may confuse mount(8) which then tries to mount all
subsequent filesystem types in read-only mode.

Fix the problem by returning EACCES only once we verify that the device
indeed contains an iso9660 filesystem.

CC: stable@vger.kernel.org
Fixes: 17b7f7cf58926844e1dd40f5eb5348d481deca6a
Reported-by: Kent Overstreet
Reported-by: Karel Zak
Signed-off-by: Jan Kara

Jan Kara
2016-10-18 17:28:21 +0800

01 Aug, 2016

1 commit

6fa67e707 get rid of 'parent' argument of ->d_compare() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2016-08-01 04:37:25 +0800

29 Jul, 2016

2 commits

6784725ab Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs updates from Al Viro:
"Assorted cleanups and fixes.

Probably the most interesting part long-term is ->d_init() - that will
have a bunch of followups in (at least) ceph and lustre, but we'll
need to sort the barrier-related rules before it can get used for
really non-trivial stuff.

Another fun thing is the merge of ->d_iput() callers (dentry_iput()
and dentry_unlink_inode()) and a bunch of ->d_compare() ones (all
except the one in __d_lookup_lru())"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
fs/dcache.c: avoid soft-lockup in dput()
vfs: new d_init method
vfs: Update lookup_dcache() comment
bdev: get rid of ->bd_inodes
Remove last traces of ->sync_page
new helper: d_same_name()
dentry_cmp(): use lockless_dereference() instead of smp_read_barrier_depends()
vfs: clean up documentation
vfs: document ->d_real()
vfs: merge .d_select_inode() into .d_real()
unify dentry_iput() and dentry_unlink_inode()
binfmt_misc: ->s_root is not going anywhere
drop redundant ->owner initializations
ufs: get rid of redundant checks
orangefs: constify inode_operations
missed comment updates from ->direct_IO() prototype change
file_inode(f)->i_mapping is f->f_mapping
trim fsnotify hooks a bit
9p: new helper - v9fs_parent_fid()
debugfs: ->d_parent is never NULL or negative
...

Linus Torvalds
2016-07-29 03:59:05 +0800
554828ee0 Merge branch 'salted-string-hash' ... Browse Code »

This changes the vfs dentry hashing to mix in the parent pointer at the
_beginning_ of the hash, rather than at the end.

That actually improves both the hash and the code generation, because we
can move more of the computation to the "static" part of the dcache
setup, and do less at lookup runtime.

It turns out that a lot of other hash users also really wanted to mix in
a base pointer as a 'salt' for the hash, and so the slightly extended
interface ends up working well for other cases too.

Users that want a string hash that is purely about the string pass in a
'salt' pointer of NULL.

* merge branch 'salted-string-hash':
fs/dcache.c: Save one 32-bit multiply in dcache lookup
vfs: make the string hashes salt the hash

Linus Torvalds
2016-07-29 03:26:31 +0800

01 Jul, 2016

1 commit

f4e6d844b Remove last traces of ->sync_page ... Browse Code »

Commit 7eaceaccab5f removed ->sync_page, but a few mentions of it still
existed in documentation and comments,

Signed-off-by: Matthew Wilcox
Signed-off-by: Al Viro

Matthew Wilcox
2016-07-01 11:30:52 +0800

11 Jun, 2016

1 commit

8387ff257 vfs: make the string hashes salt the hash ... Browse Code »

We always mixed in the parent pointer into the dentry name hash, but we
did it late at lookup time. It turns out that we can simplify that
lookup-time action by salting the hash with the parent pointer early
instead of late.

A few other users of our string hashes also wanted to mix in their own
pointers into the hash, and those are updated to use the same mechanism.

Hash users that don't have any particular initial salt can just use the
NULL pointer as a no-salt.

Cc: Vegard Nossum
Cc: George Spelvin
Cc: Al Viro
Signed-off-by: Linus Torvalds

Linus Torvalds
2016-06-11 11:21:46 +0800

08 Jun, 2016

1 commit

dfec8a14f fs: have ll_rw_block users pass in op and flags separately ... Browse Code »

This has ll_rw_block users pass in the operation and flags separately,
so ll_rw_block can setup the bio op and bi_rw flags on the bio that
is submitted.

Signed-off-by: Mike Christie
Reviewed-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Mike Christie
2016-06-08 03:41:38 +0800

10 May, 2016

1 commit

e89910899 isofs: switch to ->iterate_shared() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2016-05-10 00:53:03 +0800

09 May, 2016

2 commits

e17a21d3b get_acorn_filename(): deobfuscate a bit ... Browse Code »

Lots of Idiotic Silly Parentheses is -> that way... What that
condition checks is that there's exactly 32 bytes between the
end of name and the end of entire drectory record.

Signed-off-by: Al Viro

Al Viro
2016-05-09 23:42:20 +0800
a063ff1e4 Merge branch 'for-linus' into work.lookups Browse Code »

Al Viro
2016-05-09 23:41:30 +0800

08 May, 2016

1 commit

99d825822 get_rock_ridge_filename(): handle malformed NM entries ... Browse Code »

Payloads of NM entries are not supposed to contain NUL. When we run
into such, only the part prior to the first NUL goes into the
concatenation (i.e. the directory entry name being encoded by a bunch
of NM entries). We do stop when the amount collected so far + the
claimed amount in the current NM entry exceed 254. So far, so good,
but what we return as the total length is the sum of *claimed*
sizes, not the actual amount collected. And that can grow pretty
large - not unlimited, since you'd need to put CE entries in
between to be able to get more than the maximum that could be
contained in one isofs directory entry / continuation chunk and
we are stop once we'd encountered 32 CEs, but you can get about 8Kb
easily. And that's what will be passed to readdir callback as the
name length. 8Kb __copy_to_user() from a buffer allocated by
__get_free_page()

Cc: stable@vger.kernel.org # 0.98pl6+ (yes, really)
Signed-off-by: Al Viro

Al Viro
2016-05-08 10:52:39 +0800

05 Apr, 2016

1 commit

09cbfeaf1 mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros ... Browse Code »

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized. And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special. They are
not.

The changes are pretty straight-forward:

- << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

- >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

- page_cache_get() -> get_page();

- page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-04-05 01:41:08 +0800

15 Jan, 2016

1 commit

5d097056c kmemcg: account certain kmem allocations to memcg ... Browse Code »

Mark those kmem allocations that are known to be easily triggered from
userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
memcg. For the list, see below:

- threadinfo
- task_struct
- task_delay_info
- pid
- cred
- mm_struct
- vm_area_struct and vm_region (nommu)
- anon_vma and anon_vma_chain
- signal_struct
- sighand_struct
- fs_struct
- files_struct
- fdtable and fdtable->full_fds_bits
- dentry and external_name
- inode for all filesystems. This is the most tedious part, because
most filesystems overwrite the alloc_inode method.

The list is far from complete, so feel free to add more objects.
Nevertheless, it should be close to "account everything" approach and
keep most workloads within bounds. Malevolent users will be able to
breach the limit, but this was possible even with the former "account
everything" approach (simply because it did not account everything in
fact).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Tejun Heo
Cc: Greg Thelen
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-01-15 08:00:49 +0800

09 Dec, 2015

1 commit

21fc61c73 don't put symlink bodies in pagecache into highmem ... Browse Code »

kmap() in page_follow_link_light() needed to go - allowing to hold
an arbitrary number of kmaps for long is a great way to deadlocking
the system.

new helper (inode_nohighmem(inode)) needs to be used for pagecache
symlinks inodes; done for all in-tree cases. page_follow_link_light()
instrumented to yell about anything missed.

Signed-off-by: Al Viro

Al Viro
2015-12-09 11:41:36 +0800

16 Apr, 2015

1 commit

2b0143b5c VFS: normal filesystems (and lustre): d_inode() annotations ... Browse Code »

that's the bulk of filesystem drivers dealing with inodes of their own

Signed-off-by: David Howells
Signed-off-by: Al Viro

David Howells
2015-04-16 03:06:57 +0800

07 Jan, 2015

1 commit

e4a93be6c isofs: Fix bug in the way to check if the year is a leap year ... Browse Code »

Changed the whole algorithm for a call to mktime64 that takes
care of all that details.

Signed-off-by: Oscar Forner Martinez
Signed-off-by: Jan Kara

Oscar Forner Martinez
2015-01-07 16:51:49 +0800

19 Dec, 2014

1 commit

4e2024624 isofs: Fix unchecked printing of ER records ... Browse Code »

We didn't check length of rock ridge ER records before printing them.
Thus corrupted isofs image can cause us to access and print some memory
behind the buffer with obvious consequences.

Reported-and-tested-by: Carl Henrik Lunde
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara

Jan Kara
2014-12-19 18:29:24 +0800

17 Dec, 2014

1 commit

31f48fc8f Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs ... Browse Code »

Pull isofs and reiserfs fixes from Jan Kara:
"A reiserfs and an isofs fix. They arrived after I sent you my first
pull request and I don't want to delay them unnecessarily till rc2"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
isofs: Fix infinite looping over CE entries
reiserfs: destroy allocated commit workqueue

Linus Torvalds
2014-12-17 07:46:01 +0800

15 Dec, 2014

1 commit

f54e18f1b isofs: Fix infinite looping over CE entries ... Browse Code »

Rock Ridge extensions define so called Continuation Entries (CE) which
define where is further space with Rock Ridge data. Corrupted isofs
image can contain arbitrarily long chain of these, including a one
containing loop and thus causing kernel to end in an infinite loop when
traversing these entries.

Limit the traversal to 32 entries which should be more than enough space
to store all the Rock Ridge data.

Reported-by: P J P
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara

Jan Kara
2014-12-15 22:53:26 +0800

20 Nov, 2014

1 commit

7ca2f2344 isofs: avoid unused function warning ... Browse Code »

With the isofs_hash() function removed, isofs_hash_ms() is the only user
of isofs_hash_common(), but it's defined inside of an #ifdef, which triggers
this gcc warning in ARM axm55xx_defconfig starting with v3.18-rc3:

fs/isofs/inode.c:177:1: warning: 'isofs_hash_common' defined but not used [-Wunused-function]

This patch moves the function inside of the same #ifdef section to avoid that
warning, which seems the best compromise of a relatively harmless patch for
a late -rc.

Signed-off-by: Arnd Bergmann
Fixes: b0afd8e5db7b ("isofs: don't bother with ->d_op for normal case")
Signed-off-by: Al Viro

Arnd Bergmann
2014-11-20 02:09:37 +0800

31 Oct, 2014

1 commit

b0afd8e5d isofs: don't bother with ->d_op for normal case ... Browse Code »

we only need it for joliet and case-insensitive mounts

Signed-off-by: Al Viro

Al Viro
2014-10-31 18:33:17 +0800

29 Oct, 2014

1 commit

f643ff550 isofs_cmp(): we'll never see a dentry for . or .. ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2014-10-29 06:37:40 +0800

14 Oct, 2014

1 commit

a97df4277 isofs: replace strnicmp with strncasecmp ... Browse Code »

The kernel used to contain two functions for length-delimited,
case-insensitive string comparison, strnicmp with correct semantics and
a slightly buggy strncasecmp. The latter is the POSIX name, so strnicmp
was renamed to strncasecmp, and strnicmp made into a wrapper for the new
strncasecmp to avoid breaking existing users.

To allow the compat wrapper strnicmp to be removed at some point in the
future, and to avoid the extra indirection cost, do
s/strnicmp/strncasecmp/g.

Signed-off-by: Rasmus Villemoes
Reviewed-by: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rasmus Villemoes
2014-10-14 08:18:24 +0800

20 Aug, 2014

1 commit

410dd3cf4 isofs: Fix unbounded recursion when processing relocated directories ... Browse Code »

We did not check relocated directory in any way when processing Rock
Ridge 'CL' tag. Thus a corrupted isofs image can possibly have a CL
entry pointing to another CL entry leading to possibly unbounded
recursion in kernel code and thus stack overflow or deadlocks (if there
is a loop created from CL entries).

Fix the problem by not allowing CL entry to point to a directory entry
with CL entry (such use makes no good sense anyway) and by checking
whether CL entry doesn't point to itself.

CC: stable@vger.kernel.org
Reported-by: Chris Evans
Signed-off-by: Jan Kara

Jan Kara
2014-08-20 00:29:30 +0800

09 Aug, 2014

1 commit

d97b07c54 initramfs: support initramfs that is bigger than 2GiB ... Browse Code »

Now with 64bit bzImage and kexec tools, we support ramdisk that size is
bigger than 2g, as we could put it above 4G.

Found compressed initramfs image could not be decompressed properly. It
turns out that image length is int during decompress detection, and it
will become < 0 when length is more than 2G. Furthermore, during
decompressing len as int is used for inbuf count, that has problem too.

Change len to long, that should be ok as on 32 bit platform long is
32bits.

Tested with following compressed initramfs image as root with kexec.
gzip, bzip2, xz, lzma, lzop, lz4.
run time for populate_rootfs():
size name Nehalem-EX Westmere-EX Ivybridge-EX
9034400256 root_img : 26s 24s 30s
3561095057 root_img.lz4 : 28s 27s 27s
3459554629 root_img.lzo : 29s 29s 28s
3219399480 root_img.gz : 64s 62s 49s
2251594592 root_img.xz : 262s 260s 183s
2226366598 root_img.lzma: 386s 376s 277s
2901482513 root_img.bz2 : 635s 599s

Signed-off-by: Yinghai Lu
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: Rashika Kheria
Cc: Josh Triplett
Cc: Kyungsik Lee
Cc: P J P
Cc: Al Viro
Cc: Tetsuo Handa
Cc: "Daniel M. Weeks"
Cc: Alexandre Courbot
Cc: Jan Beulich
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yinghai Lu
2014-08-09 06:57:26 +0800

08 Apr, 2014

1 commit

a7963eb7f Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs ... Browse Code »

Pull ext3 improvements, cleanups, reiserfs fix from Jan Kara:
"various cleanups for ext2, ext3, udf, isofs, a documentation update
for quota, and a fix of a race in reiserfs readdir implementation"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
reiserfs: fix race in readdir
ext2: acl: remove unneeded include of linux/capability.h
ext3: explicitly remove inode from orphan list after failed direct io
fs/isofs/inode.c add __init to init_inodecache()
ext3: Speedup WB_SYNC_ALL pass
fs/quota/Kconfig: Update filesystems
ext3: Update outdated comment before ext3_ordered_writepage()
ext3: Update PF_MEMALLOC handling in ext3_write_inode()
ext2/3: use prandom_u32() instead of get_random_bytes()
ext3: remove an unneeded check in ext3_new_blocks()
ext3: remove unneeded check in ext3_ordered_writepage()
fs: Mark function as static in ext3/xattr_security.c
fs: Mark function as static in ext3/dir.c
fs: Mark function as static in ext2/xattr_security.c
ext3: Add __init macro to init_inodecache
ext2: Add __init macro to init_inodecache
udf: Add __init macro to init_inodecache
fs: udf: parse_options: blocksize check

Linus Torvalds
2014-04-08 08:59:17 +0800

13 Mar, 2014

2 commits

02b9984d6 fs: push sync_filesystem() down to the file system's remount_fs() ... Browse Code »

Previously, the no-op "mount -o mount /dev/xxx" operation when the
file system is already mounted read-write causes an implied,
unconditional syncfs(). This seems pretty stupid, and it's certainly
documented or guaraunteed to do this, nor is it particularly useful,
except in the case where the file system was mounted rw and is getting
remounted read-only.

However, it's possible that there might be some file systems that are
actually depending on this behavior. In most file systems, it's
probably fine to only call sync_filesystem() when transitioning from
read-write to read-only, and there are some file systems where this is
not needed at all (for example, for a pseudo-filesystem or something
like romfs).

Signed-off-by: "Theodore Ts'o"
Cc: linux-fsdevel@vger.kernel.org
Cc: Christoph Hellwig
Cc: Artem Bityutskiy
Cc: Adrian Hunter
Cc: Evgeniy Dushistov
Cc: Jan Kara
Cc: OGAWA Hirofumi
Cc: Anders Larsen
Cc: Phillip Lougher
Cc: Kees Cook
Cc: Mikulas Patocka
Cc: Petr Vandrovec
Cc: xfs@oss.sgi.com
Cc: linux-btrfs@vger.kernel.org
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Cc: codalist@coda.cs.cmu.edu
Cc: linux-ext4@vger.kernel.org
Cc: linux-f2fs-devel@lists.sourceforge.net
Cc: fuse-devel@lists.sourceforge.net
Cc: cluster-devel@redhat.com
Cc: linux-mtd@lists.infradead.org
Cc: jfs-discussion@lists.sourceforge.net
Cc: linux-nfs@vger.kernel.org
Cc: linux-nilfs@vger.kernel.org
Cc: linux-ntfs-dev@lists.sourceforge.net
Cc: ocfs2-devel@oss.oracle.com
Cc: reiserfs-devel@vger.kernel.org

Theodore Ts'o
2014-03-13 22:14:33 +0800
b3b749b7a fs/isofs/inode.c add __init to init_inodecache() ... Browse Code »

init_inodecache is only called by __init init_iso9660_fs

Signed-off-by: Fabian Frederick
Signed-off-by: Jan Kara

Fabian Frederick
2014-03-13 05:52:39 +0800

25 Oct, 2013

1 commit

966c1f75f isofs: don't pass dentry to isofs_hash{i,}_common() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2013-10-25 11:34:59 +0800

01 Aug, 2013

1 commit

17b7f7cf5 isofs: Refuse RW mount of the filesystem instead of making it RO ... Browse Code »

Refuse RW mount of isofs filesystem. So far we just silently changed it
to RO mount but when the media is writeable, block layer won't notice
this change and thus will think device is used RW and will block eject
button of the drive. That is unexpected by users because for
non-writeable media eject button works just fine.

Userspace mount(8) command handles this just fine and retries mounting
with MS_RDONLY set so userspace shouldn't see any regression. Plus any
tool mounting isofs is likely confronted with the case of read-only
media where block layer already refuses to mount the filesystem without
MS_RDONLY set so our behavior shouldn't be anything new for it.

Reported-by: Hui Wang
Signed-off-by: Jan Kara

Jan Kara
2013-08-01 04:14:50 +0800

29 Jun, 2013

2 commits

da53be12b Don't pass inode to ->d_hash() and ->d_compare() ... Browse Code »

Instances either don't look at it at all (the majority of cases) or
only want it to find the superblock (which can be had as dentry->d_sb).
A few cases that want more are actually safe with dentry->d_inode -
the only precaution needed is the check that it hadn't been replaced with
NULL by rmdir() or by overwriting rename(), which case should be simply
treated as cache miss.

Signed-off-by: Linus Torvalds
Signed-off-by: Al Viro

Linus Torvalds
2013-06-29 16:57:36 +0800
bfee7169c [readdir] convert isofs ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2013-06-29 16:56:47 +0800

13 Mar, 2013

1 commit

fa7614ddd fs: Readd the fs module aliases. ... Browse Code »

I had assumed that the only use of module aliases for filesystems
prior to "fs: Limit sys_mount to only request filesystem modules."
was in request_module. It turns out I was wrong. At least mkinitcpio
in Arch linux uses these aliases.

So readd the preexising aliases, to keep from breaking userspace.

Userspace eventually will have to follow and use the same aliases the
kernel does. So at some point we may be delete these aliases without
problems. However that day is not today.

Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2013-03-13 09:55:21 +0800

04 Mar, 2013

1 commit

7f78e0351 fs: Limit sys_mount to only request filesystem modules. ... Browse Code »

Modify the request_module to prefix the file system type with "fs-"
and add aliases to all of the filesystems that can be built as modules
to match.

A common practice is to build all of the kernel code and leave code
that is not commonly needed as modules, with the result that many
users are exposed to any bug anywhere in the kernel.

Looking for filesystems with a fs- prefix limits the pool of possible
modules that can be loaded by mount to just filesystems trivially
making things safer with no real cost.

Using aliases means user space can control the policy of which
filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
with blacklist and alias directives. Allowing simple, safe,
well understood work-arounds to known problematic software.

This also addresses a rare but unfortunate problem where the filesystem
name is not the same as it's module name and module auto-loading
would not work. While writing this patch I saw a handful of such
cases. The most significant being autofs that lives in the module
autofs4.

This is relevant to user namespaces because we can reach the request
module in get_fs_type() without having any special permissions, and
people get uncomfortable when a user specified string (in this case
the filesystem type) goes all of the way to request_module.

After having looked at this issue I don't think there is any
particular reason to perform any filtering or permission checks beyond
making it clear in the module request that we want a filesystem
module. The common pattern in the kernel is to call request_module()
without regards to the users permissions. In general all a filesystem
module does once loaded is call register_filesystem() and go to sleep.
Which means there is not much attack surface exposed by loading a
filesytem module unless the filesystem is mounted. In a user
namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
which most filesystems do not set today.

Acked-by: Serge Hallyn
Acked-by: Kees Cook
Reported-by: Kees Cook
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2013-03-04 11:36:31 +0800

26 Feb, 2013

1 commit

94e07a759 fs: encode_fh: return FILEID_INVALID if invalid fid_type ... Browse Code »

This patch is a follow up on below patch:

[PATCH] exportfs: add FILEID_INVALID to indicate invalid fid_type
commit: 216b6cbdcbd86b1db0754d58886b466ae31f5a63

Signed-off-by: Namjae Jeon
Signed-off-by: Vivek Trivedi
Acked-by: Steven Whitehouse
Acked-by: Sage Weil
Signed-off-by: Al Viro

Namjae Jeon
2013-02-26 15:46:10 +0800

23 Feb, 2013

1 commit

496ad9aa8 new helper: file_inode(file) ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2013-02-23 12:31:31 +0800

10 Oct, 2012

1 commit

35c2a7f49 tmpfs,ceph,gfs2,isofs,reiserfs,xfs: fix fh_len checking ... Browse Code »

Fuzzing with trinity oopsed on the 1st instruction of shmem_fh_to_dentry(),
u64 inum = fid->raw[2];
which is unhelpfully reported as at the end of shmem_alloc_inode():

BUG: unable to handle kernel paging request at ffff880061cd3000
IP: [] shmem_alloc_inode+0x40/0x40
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Call Trace:
[] ? exportfs_decode_fh+0x79/0x2d0
[] do_handle_open+0x163/0x2c0
[] sys_open_by_handle_at+0xc/0x10
[] tracesys+0xe1/0xe6

Right, tmpfs is being stupid to access fid->raw[2] before validating that
fh_len includes it: the buffer kmalloc'ed by do_sys_name_to_handle() may
fall at the end of a page, and the next page not be present.

But some other filesystems (ceph, gfs2, isofs, reiserfs, xfs) are being
careless about fh_len too, in fh_to_dentry() and/or fh_to_parent(), and
could oops in the same way: add the missing fh_len checks to those.

Reported-by: Sasha Levin
Signed-off-by: Hugh Dickins
Cc: Al Viro
Cc: Sage Weil
Cc: Steven Whitehouse
Cc: Christoph Hellwig
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro

Hugh Dickins
2012-10-10 11:33:55 +0800

03 Oct, 2012

2 commits

aab174f0d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs update from Al Viro:

- big one - consolidation of descriptor-related logics; almost all of
that is moved to fs/file.c

(BTW, I'm seriously tempted to rename the result to fd.c. As it is,
we have a situation when file_table.c is about handling of struct
file and file.c is about handling of descriptor tables; the reasons
are historical - file_table.c used to be about a static array of
struct file we used to have way back).

A lot of stray ends got cleaned up and converted to saner primitives,
disgusting mess in android/binder.c is still disgusting, but at least
doesn't poke so much in descriptor table guts anymore. A bunch of
relatively minor races got fixed in process, plus an ext4 struct file
leak.

- related thing - fget_light() partially unuglified; see fdget() in
there (and yes, it generates the code as good as we used to have).

- also related - bits of Cyrill's procfs stuff that got entangled into
that work; _not_ all of it, just the initial move to fs/proc/fd.c and
switch of fdinfo to seq_file.

- Alex's fs/coredump.c spiltoff - the same story, had been easier to
take that commit than mess with conflicts. The rest is a separate
pile, this was just a mechanical code movement.

- a few misc patches all over the place. Not all for this cycle,
there'll be more (and quite a few currently sit in akpm's tree)."

Fix up trivial conflicts in the android binder driver, and some fairly
simple conflicts due to two different changes to the sock_alloc_file()
interface ("take descriptor handling from sock_alloc_file() to callers"
vs "net: Providing protocol type via system.sockprotoname xattr of
/proc/PID/fd entries" adding a dentry name to the socket)

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
MAX_LFS_FILESIZE should be a loff_t
compat: fs: Generic compat_sys_sendfile implementation
fs: push rcu_barrier() from deactivate_locked_super() to filesystems
btrfs: reada_extent doesn't need kref for refcount
coredump: move core dump functionality into its own file
coredump: prevent double-free on an error path in core dumper
usb/gadget: fix misannotations
fcntl: fix misannotations
ceph: don't abuse d_delete() on failure exits
hypfs: ->d_parent is never NULL or negative
vfs: delete surplus inode NULL check
switch simple cases of fget_light to fdget
new helpers: fdget()/fdput()
switch o2hb_region_dev_write() to fget_light()
proc_map_files_readdir(): don't bother with grabbing files
make get_file() return its argument
vhost_set_vring(): turn pollstart/pollstop into bool
switch prctl_set_mm_exe_file() to fget_light()
switch xfs_find_handle() to fget_light()
switch xfs_swapext() to fget_light()
...

Linus Torvalds
2012-10-03 11:25:04 +0800
8c0a85377 fs: push rcu_barrier() from deactivate_locked_super() to filesystems ... Browse Code »

There's no reason to call rcu_barrier() on every
deactivate_locked_super(). We only need to make sure that all delayed rcu
free inodes are flushed before we destroy related cache.

Removing rcu_barrier() from deactivate_locked_super() affects some fast
paths. E.g. on my machine exit_group() of a last process in IPC
namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time.

Signed-off-by: Kirill A. Shutemov
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Al Viro

Kirill A. Shutemov
2012-10-03 09:35:55 +0800