Eric Lee / smarc-fsl-linux-kernel

22 Jan, 2014

27 commits

df32e43a5 Merge branch 'akpm' (incoming from Andrew) ... Browse Code »

Merge first patch-bomb from Andrew Morton:

- a couple of misc things

- inotify/fsnotify work from Jan

- ocfs2 updates (partial)

- about half of MM

* emailed patches from Andrew Morton : (117 commits)
mm/migrate: remove unused function, fail_migrate_page()
mm/migrate: remove putback_lru_pages, fix comment on putback_movable_pages
mm/migrate: correct failure handling if !hugepage_migration_support()
mm/migrate: add comment about permanent failure path
mm, page_alloc: warn for non-blockable __GFP_NOFAIL allocation failure
mm: compaction: reset scanner positions immediately when they meet
mm: compaction: do not mark unmovable pageblocks as skipped in async compaction
mm: compaction: detect when scanners meet in isolate_freepages
mm: compaction: reset cached scanner pfn's before reading them
mm: compaction: encapsulate defer reset logic
mm: compaction: trace compaction begin and end
memcg, oom: lock mem_cgroup_print_oom_info
sched: add tracepoints related to NUMA task migration
mm: numa: do not automatically migrate KSM pages
mm: numa: trace tasks that fail migration due to rate limiting
mm: numa: limit scope of lock for NUMA migrate rate limiting
mm: numa: make NUMA-migrate related functions static
lib/show_mem.c: show num_poisoned_pages when oom
mm/hwpoison: add '#' to hwpoison_inject
mm/memblock: use WARN_ONCE when MAX_NUMNODES passed as input parameter
...

Linus Torvalds
2014-01-22 11:05:45 +0800
ff0bc6cc7 Merge tag 'dlm-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm ... Browse Code »

Pull dlm update from David Teigland:
"A single change to speed up recovery times when using SCTP
connections"

* tag 'dlm-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: set zero linger time on sctp socket

Linus Torvalds
2014-01-22 09:42:55 +0800
2182c815f Merge tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw ... Browse Code »

Pull GFS2 updates from Steven Whitehouse:
"The main topics this time are allocation, in the form of Bob's
improvements when searching resource groups and several updates to
quotas which should increase scalability. The quota changes follow on
from those in the last merge window, and there will likely be further
work to come in this area in due course.

There are also a few patches which help to improve efficiency of
adding entries into directories, and clean up some of that code.

One on-disk change is included this time, which is to write some
additional information which should be useful to fsck and also
potentially for debugging.

Other than that, its just a few small random bug fixes and clean ups"

* tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw: (24 commits)
GFS2: revert "GFS2: d_splice_alias() can't return error"
GFS2: Small cleanup
GFS2: Don't use ENOBUFS when ENOMEM is the correct error code
GFS2: Fix kbuild test robot reported warning
GFS2: Move quota bitmap operations under their own lock
GFS2: Clean up quota slot allocation
GFS2: Only run logd and quota when mounted read/write
GFS2: Use RCU/hlist_bl based hash for quotas
GFS2: No need to invalidate pages for a dio read
GFS2: Add initialization for address space in super block
GFS2: Add hints to directory leaf blocks
GFS2: For exhash conversion, only one block is needed
GFS2: Increase i_writecount during gfs2_setattr_chown
GFS2: Remember directory insert point
GFS2: Consolidate transaction blocks calculation for dir add
GFS2: Add directory addition info structure
GFS2: Use only a single address space for rgrps
GFS2: Use range based functions for rgrp sync/invalidation
GFS2: Remove test which is always true
GFS2: Remove gfs2_quota_change_host structure
...

Linus Torvalds
2014-01-22 09:42:00 +0800
34e431b0a /proc/meminfo: provide estimated available memory ... Browse Code »

Many load balancing and workload placing programs check /proc/meminfo to
estimate how much free memory is available. They generally do this by
adding up "free" and "cached", which was fine ten years ago, but is
pretty much guaranteed to be wrong today.

It is wrong because Cached includes memory that is not freeable as page
cache, for example shared memory segments, tmpfs, and ramfs, and it does
not include reclaimable slab memory, which can take up a large fraction
of system memory on mostly idle systems with lots of files.

Currently, the amount of memory that is available for a new workload,
without pushing the system into swap, can be estimated from MemFree,
Active(file), Inactive(file), and SReclaimable, as well as the "low"
watermarks from /proc/zoneinfo.

However, this may change in the future, and user space really should not
be expected to know kernel internals to come up with an estimate for the
amount of free memory.

It is more convenient to provide such an estimate in /proc/meminfo. If
things change in the future, we only have to change it in one place.

Signed-off-by: Rik van Riel
Reported-by: Erik Mouw
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2014-01-22 08:19:43 +0800
af52b040e fs/ramfs: don't use module_init for non-modular core code ... Browse Code »

The ramfs is always built in. It will never be modular, so using
module_init as an alias for __initcall is rather misleading.

Fix this up now, so that we can relocate module_init from init.h into
module.h in the future. If we don't do this, we'd have to add module.h
to obviously non-modular code, and that would be a worse thing.

Note that direct use of __initcall is discouraged, vs. one of the
priority categorized subgroups. As __initcall gets mapped onto
device_initcall, our use of fs_initcall (which makes sense for fs code)
will thus change this registration from level 6-device to level 5-fs
(i.e. slightly earlier). However no observable impact of that small
difference has been observed during testing, or is expected.

Also note that this change uncovers a missing semicolon bug in the
registration of the initcall.

Signed-off-by: Paul Gortmaker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Gortmaker
2014-01-22 08:19:42 +0800
b5bd856a0 fs/super.c: fix WARN on alloc_super() fail path ... Browse Code »

On fail path alloc_super() calls destroy_super(), which issues a warning
if the sb's s_mounts list is not empty, in particular if it has not been
initialized. That said s_mounts must be initialized in alloc_super()
before any possible failure, but currently it is initialized close to
the end of the function leading to a useless warning dumped to log if
either percpu_counter_init() or list_lru_init() fails. Let's fix this.

Signed-off-by: Vladimir Davydov
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-01-22 08:19:42 +0800
4e4f9e66a fs/read_write.c:compat_readv(): remove bogus area verify ... Browse Code »

The compat_do_readv_writev() function was doing a verify_area on the
incoming iov, but the nr_segs value is not checked. If someone passes
in a -1 for nr_segs, for instance, the function should return an EINVAL.
However, it returns a EFAULT because the verify_area fails because it is
checking an array of size MAX_UINT. The check is bogus, anyway, because
the next check, compat_rw_copy_check_uvector(), will do all the
necessary checking, anyway. The non-compat do_readv_writev() function
doesn't do this check, so I think it's safe to just remove the code.

Signed-off-by: Corey Minyard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Corey Minyard
2014-01-22 08:19:42 +0800
38316c8ab fs/compat_ioctl.c: fix an underflow issue (harmless) ... Browse Code »

We cap "nmsgs" at I2C_RDRW_IOCTL_MAX_MSGS (42) but the current code
allows negative values. It's harmless but it makes my static checker
upset so I've made nsmgs unsigned.

Signed-off-by: Dan Carpenter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Carpenter
2014-01-22 08:19:42 +0800
0afaa1204 posix_acl: uninlining ... Browse Code »

Uninline vast tracts of nested inline functions in
include/linux/posix_acl.h.

This reduces the text+data+bss size of x86_64 allyesconfig vmlinux by
8026 bytes.

The patch also regularises the positioning of the EXPORT_SYMBOLs in
posix_acl.c.

Cc: Alexander Viro
Cc: J. Bruce Fields
Cc: Trond Myklebust
Tested-by: Benny Halevy
Cc: Benny Halevy
Cc: Andreas Gruenbacher
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2014-01-22 08:19:42 +0800
75f82eaa5 ocfs2: fix NULL pointer dereference when dismount and ocfs2rec simultaneously ... Browse Code »

2 nodes cluster, say Node A and Node B, mount the same ocfs2 volume, and
create a file 1.

Node A Node B
open 1, get open lock
rm 1, and then add 1 to orphan_dir
storage link down,
o2hb_write_timeout
->o2quo_disk_timeout
->emergency_restart
at the moment, Node B dismount and do
ocfs2rec simultaneously
1) ocfs2_dismount_volume
->ocfs2_recovery_exit
->wait_event(osb->recovery_event)
->flush_workqueue(ocfs2_wq)
2) ocfs2rec
->queue_work(&journal->j_recovery_work)
->ocfs2_recover_orphans
->ocfs2_commit_truncate
->queue_delayed_work(&osb->osb_truncate_log_wq)

In ocfs2_recovery_exit, it flushes workqueue and then releases system
inodes. When doing ocfs2rec, it will call ocfs2_flush_truncate_log
which will try to get sys_root_inode, and NULL pointer dereference
occurs.

Signed-off-by: Yiwen Jiang
Signed-off-by: joyce
Signed-off-by: Joseph Qi
Cc: Joel Becker
Cc: Mark Fasheh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yiwen Jiang
2014-01-22 08:19:42 +0800
a2a3b3982 ocfs2: punch hole should return EINVAL if the length argument in ioctl is negative ... Browse Code »

An unreserve space ioctl OCFS2_IOC_UNRESVSP/64 should reject a negative
length.

Orabug:14789508

Signed-off-by: Tariq Saseed
Signed-off-by: Srinivas Eeda
Cc: Joel Becker
Cc: Mark Fasheh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tariq Saeed
2014-01-22 08:19:42 +0800
16eac4be4 ocfs2: fix sparse non static symbol warning ... Browse Code »

Fixes the following sparse warning:

fs/ocfs2/stack_user.c:930:32: warning:
symbol 'ocfs2_ls_ops' was not declared. Should it be static?

Signed-off-by: Wei Yongjun
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wei Yongjun
2014-01-22 08:19:42 +0800
1ba2212bb ocfs2: adjust minlen with discard_granularity in the FITRIM ioctl ... Browse Code »

Adjust minlen with discard_granularity for FITRIM ioctl(2) if the given
minimum size in bytes is less than it because, discard granularity is
used to tell us that the minimum size of extent that can be discarded by
the storage device.

This is inspired by ext4 commit 5c2ed62fd447 ("ext4: Adjust minlen with
discard_granularity in the FITRIM ioctl") from Lukas Czerner.

Signed-off-by: Jie Liu
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jie Liu
2014-01-22 08:19:42 +0800
aa89762c5 ocfs2: return EINVAL if the given range to discard is less than block size ... Browse Code »

For FITRIM ioctl(2), we should not keep silence if the given range
length ls less than a block size as there is no data blocks would be
discareded. Hence it should return EINVAL instead. This issue can be
verified via xfstests/generic/288 which is used for FITRIM argument
handling tests.

Signed-off-by: Jie Liu
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jie Liu
2014-01-22 08:19:42 +0800
19e8ac272 ocfs2: return EOPNOTSUPP if the device does not support discard ... Browse Code »

For FITRIM ioctl(2), we should return EOPNOTSUPP to inform the user that
the storage device does not support discard if it is, otherwise return
success would confuse the user even though there is no free blocks were
trimmed at all.

Signed-off-by: Jie Liu
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jie Liu
2014-01-22 08:19:42 +0800
0a2fcd898 ocfs2: remove redundant ocfs2_alloc_dinode_update_counts() and ocfs2_block_group_set_bits() ... Browse Code »

ocfs2_alloc_dinode_update_counts() and ocfs2_block_group_set_bits() are
already provided in suballoc.c. So, the same functions in
move_extents.c are not needed any more.

Declare the functions in suballoc.h and remove redundant functions in
move_extents.c.

Signed-off-by: Younger Liu
Cc: Younger Liu
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Younger Liu
2014-01-22 08:19:42 +0800
c994c2ebd ocfs2: use the new DLM operation callbacks while requesting new lockspace ... Browse Code »

Attempt to use the new DLM operations. If it is not supported, use the
traditional ocfs2_controld.

To exchange ocfs2 versioning, we use the LVB of the version dlm lock.
It first attempts to take the lock in EX mode (non-blocking). If
successful (which means it is the first mount), it writes the version
number and downconverts to PR lock. If it is unsuccessful, it reads the
version from the lock.

If this becomes the standard (with o2cb as well), it could simplify
userspace tools to check if the filesystem is mounted on other nodes.

Dan: Since ocfs2_protocol_version are two u8 values, the additional
checks with LONG* don't make sense.

Signed-off-by: Goldwyn Rodrigues
Signed-off-by: Dan Carpenter
Reviewed-by: Mark Fasheh
Cc: Joel Becker
Cc: Dan Carpenter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Goldwyn Rodrigues
2014-01-22 08:19:42 +0800
415036303 ocfs2: framework for version LVB ... Browse Code »

Use the native DLM locks for version control negotiation. Most of the
framework is taken from gfs2/lock_dlm.c

Signed-off-by: Goldwyn Rodrigues
Reviewed-by: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Goldwyn Rodrigues
2014-01-22 08:19:41 +0800
3e8341516 ocfs2: pass ocfs2_cluster_connection to ocfs2_this_node ... Browse Code »

This is done to differentiate between using and not using controld and
use the connection information accordingly.

We need to be backward compatible. So, we use a new enum
ocfs2_connection_type to identify when controld is used and when it is
not.

Signed-off-by: Goldwyn Rodrigues
Reviewed-by: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Goldwyn Rodrigues
2014-01-22 08:19:41 +0800
24aa33861 ocfs2: shift allocation ocfs2_live_connection to user_connect() ... Browse Code »

We perform this because the DLM recovery callbacks will require the
ocfs2_live_connection structure to record the node information when
dlm_new_lockspace() is updated (in the last patch of the series).

Before calling dlm_new_lockspace(), we need the structure ready for the
.recover_done() callback, which would set oc_this_node. This is the
reason we allocate ocfs2_live_connection beforehand in user_connect().

[AKPM] rc initialization is not required because it assigned in case of
errors. It will be cleared by compiler anyways.

Signed-off-by: Goldwyn Rodrigues
Reveiwed-by: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Goldwyn Rodrigues
2014-01-22 08:19:41 +0800
66e188fc3 ocfs2: add DLM recovery callbacks ... Browse Code »

These are the callbacks called by the fs/dlm code in case the membership
changes. If there is a failure while/during calling any of these, the
DLM creates a new membership and relays to the rest of the nodes.

- recover_prep() is called when DLM understands a node is down.
- recover_slot() is called once all nodes have acknowledged
recover_prep and recovery can begin.
- recover_done() is called once the recovery is complete. It returns
the new membership.

Signed-off-by: Goldwyn Rodrigues
Reviewed-by: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Goldwyn Rodrigues
2014-01-22 08:19:41 +0800
c74a3bdd9 ocfs2: add clustername to cluster connection ... Browse Code »

This is an effort of removing ocfs2_controld.pcmk and getting ocfs2 DLM
handling up to the times with respect to DLM (>=4.0.1) and corosync
(2.3.x). AFAIK, cman also is being phased out for a unified corosync
cluster stack.

fs/dlm performs all the functions with respect to fencing and node
management and provides the API's to do so for ocfs2. For all future
references, DLM stands for fs/dlm code.

The advantages are:
+ No need to run an additional userspace daemon (ocfs2_controld)
+ No controld device handling and controld protocol
+ Shifting responsibilities of node management to DLM layer

For backward compatibility, we are keeping the controld handling code.
Once enough time has passed we can remove a significant portion of the
code. This was tested by using the kernel with changes on older
unmodified tools. The kernel used ocfs2_controld as expected, and
displayed the appropriate warning message.

This feature requires modification in the userspace ocfs2-tools. The
changes can be found at: https://github.com/goldwynr/ocfs2-tools branch:
nocontrold Currently, not many checks are present in the userspace code,
but that would change soon.

This patch (of 6):

Add clustername to cluster connection.

Signed-off-by: Goldwyn Rodrigues
Reviewed-by: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Goldwyn Rodrigues
2014-01-22 08:19:41 +0800
ff8fb3352 ocfs2: remove versioning information ... Browse Code »

The versioning information is confusing for end-users. The numbers are
stuck at 1.5.0 when the tools version have moved to 1.8.2. Remove the
versioning system in the OCFS2 modules and let the kernel version be the
guide to debug issues.

Signed-off-by: Goldwyn Rodrigues
Acked-by: Sunil Mushran
Cc: Mark Fasheh
Acked-by: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Goldwyn Rodrigues
2014-01-22 08:19:41 +0800
56b27cf60 fsnotify: remove pointless NULL initializers ... Browse Code »

We usually rely on the fact that struct members not specified in the
initializer are set to NULL. So do that with fsnotify function pointers
as well.

Signed-off-by: Jan Kara
Reviewed-by: Christoph Hellwig
Cc: Eric Paris
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2014-01-22 08:19:41 +0800
83c4c4b0a fsnotify: remove .should_send_event callback ... Browse Code »

After removing event structure creation from the generic layer there is
no reason for separate .should_send_event and .handle_event callbacks.
So just remove the first one.

Signed-off-by: Jan Kara
Reviewed-by: Christoph Hellwig
Cc: Eric Paris
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2014-01-22 08:19:41 +0800
7053aee26 fsnotify: do not share events between notification groups ... Browse Code »

Currently fsnotify framework creates one event structure for each
notification event and links this event into all interested notification
groups. This is done so that we save memory when several notification
groups are interested in the event. However the need for event
structure shared between inotify & fanotify bloats the event structure
so the result is often higher memory consumption.

Another problem is that fsnotify framework keeps path references with
outstanding events so that fanotify can return open file descriptors
with its events. This has the undesirable effect that filesystem cannot
be unmounted while there are outstanding events - a regression for
inotify compared to a situation before it was converted to fsnotify
framework. For fanotify this problem is hard to avoid and users of
fanotify should kind of expect this behavior when they ask for file
descriptors from notified files.

This patch changes fsnotify and its users to create separate event
structure for each group. This allows for much simpler code (~400 lines
removed by this patch) and also smaller event structures. For example
on 64-bit system original struct fsnotify_event consumes 120 bytes, plus
additional space for file name, additional 24 bytes for second and each
subsequent group linking the event, and additional 32 bytes for each
inotify group for private data. After the conversion inotify event
consumes 48 bytes plus space for file name which is considerably less
memory unless file names are long and there are several groups
interested in the events (both of which are uncommon). Fanotify event
fits in 56 bytes after the conversion (fanotify doesn't care about file
names so its events don't have to have it allocated). A win unless
there are four or more fanotify groups interested in the event.

The conversion also solves the problem with unmount when only inotify is
used as we don't have to grab path references for inotify events.

[hughd@google.com: fanotify: fix corruption preventing startup]
Signed-off-by: Jan Kara
Reviewed-by: Christoph Hellwig
Cc: Eric Paris
Cc: Al Viro
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2014-01-22 08:19:41 +0800
e9fe69045 inotify: provide function for name length rounding ... Browse Code »

Rounding of name length when passing it to userspace was done in several
places. Provide a function to do it and use it in all places.

Signed-off-by: Jan Kara
Reviewed-by: Christoph Hellwig
Cc: Eric Paris
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2014-01-22 08:19:41 +0800

21 Jan, 2014

1 commit

d3bad75a6 Merge tag 'driver-core-3.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core ... Browse Code »

Pull driver core / sysfs patches from Greg KH:
"Here's the big driver core and sysfs patch set for 3.14-rc1.

There's a lot of work here moving sysfs logic out into a "kernfs" to
allow other subsystems to also have a virtual filesystem with the same
attributes of sysfs (handle device disconnect, dynamic creation /
removal as needed / unneeded, etc)

This is primarily being done for the cgroups filesystem, but the goal
is to also move debugfs to it when it is ready, solving all of the
known issues in that filesystem as well. The code isn't completed
yet, but all should be stable now (there is a big section that was
reverted due to problems found when testing)

There's also some other smaller fixes, and a driver core addition that
allows for a "collection" of objects, that the DRM people will be
using soon (it's in this tree to make merges after -rc1 easier)

All of this has been in linux-next with no reported issues"

* tag 'driver-core-3.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (113 commits)
kernfs: associate a new kernfs_node with its parent on creation
kernfs: add struct dentry declaration in kernfs.h
kernfs: fix get_active failure handling in kernfs_seq_*()
Revert "kernfs: fix get_active failure handling in kernfs_seq_*()"
Revert "kernfs: replace kernfs_node->u.completion with kernfs_root->deactivate_waitq"
Revert "kernfs: remove KERNFS_ACTIVE_REF and add kernfs_lockdep()"
Revert "kernfs: remove KERNFS_REMOVED"
Revert "kernfs: restructure removal path to fix possible premature return"
Revert "kernfs: invoke kernfs_unmap_bin_file() directly from __kernfs_remove()"
Revert "kernfs: remove kernfs_addrm_cxt"
Revert "kernfs: make kernfs_get_active() block if the node is deactivated but not removed"
Revert "kernfs: implement kernfs_{de|re}activate[_self]()"
Revert "kernfs, sysfs, driver-core: implement kernfs_remove_self() and its wrappers"
Revert "pci: use device_remove_file_self() instead of device_schedule_callback()"
Revert "scsi: use device_remove_file_self() instead of device_schedule_callback()"
Revert "s390: use device_remove_file_self() instead of device_schedule_callback()"
Revert "sysfs, driver-core: remove unused {sysfs|device}_schedule_callback_owner()"
Revert "kernfs: remove unnecessary NULL check in __kernfs_remove()"
kernfs: remove unnecessary NULL check in __kernfs_remove()
drivers/base: provide an infrastructure for componentised subsystems
...

Linus Torvalds
2014-01-21 07:49:44 +0800

18 Jan, 2014

3 commits

d57b9c9a9 GFS2: revert "GFS2: d_splice_alias() can't return error" ... Browse Code »

0d0d110720d7960b77c03c9f2597faaff4b484ae asserts that "d_splice_alias()
can't return error unless it was given an IS_ERR(inode)".

That was true of the implementation of d_splice_alias, but this is
really a problem with d_splice_alias: at a minimum it should be able to
return -ELOOP in the case where inserting the given dentry would cause a
directory loop.

Signed-off-by: J. Bruce Fields
Signed-off-by: Steven Whitehouse

J. Bruce Fields
2014-01-18 17:50:53 +0800
48ba620aa Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull namespace fixes from Eric Biederman:
"This is a set of 3 regression fixes.

This fixes /proc/mounts when using "ip netns add " to display
the actual mount point.

This fixes a regression in clone that broke lxc-attach.

This fixes a regression in the permission checks for mounting /proc
that made proc unmountable if binfmt_misc was in use. Oops.

My apologies for sending this pull request so late. Al Viro gave
interesting review comments about the d_path fix that I wanted to
address in detail before I sent this pull request. Unfortunately a
bad round of colds kept from addressing that in detail until today.
The executive summary of the review was:

Al: Is patching d_path really sufficient?
The prepend_path, d_path, d_absolute_path, and __d_path family of
functions is a really mess.

Me: Yes, patching d_path is really sufficient. Yes, the code is mess.
No it is not appropriate to rewrite all of d_path for a regression
that has existed for entirely too long already, when a two line
change will do"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
vfs: Fix a regression in mounting proc
fork: Allow CLONE_PARENT after setns(CLONE_NEWPID)
vfs: In d_path don't call d_dname on a mount point

Linus Torvalds
2014-01-18 09:29:36 +0800
db4aad209 kernfs: associate a new kernfs_node with its parent on creation ... Browse Code »

Once created, a kernfs_node is always destroyed by kernfs_put().
Since ba7443bc656e ("sysfs, kernfs: implement
kernfs_create/destroy_root()"), kernfs_put() depends on kernfs_root()
to locate the ino_ida. kernfs_root() in turn depends on
kernfs_node->parent being set for !dir nodes. This means that
kernfs_put() of a !dir node requires its ->parent to be initialized.

This leads to oops when a newly created !dir node is destroyed without
going through kernfs_add_one() or after failing kernfs_add_one()
before ->parent is set. kernfs_root() invoked from kernfs_put() will
try to dereference NULL parent.

Fix it by moving parent association to kernfs_new_node() from
kernfs_add_one(). kernfs_new_node() now takes @parent instead of
@root and determines the root from the parent and also sets the new
node's parent properly. @parent parameter is removed from
kernfs_add_one(). As there's no parent when creating the root node,
__kernfs_new_node() which takes @root as before and doesn't set the
parent is used in that case.

This ensures that a kernfs_node in any stage in its life has its
parent associated and thus can be put.

Signed-off-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2014-01-18 03:50:07 +0800

16 Jan, 2014

3 commits

8b127d049 GFS2: Small cleanup ... Browse Code »

This is a small cleanup to function gfs2_rgrp_go_lock so that it
uses rgd instead of its more complicated twin.

Signed-off-by: Bob Peterson
Signed-off-by: Steven Whitehouse

Bob Peterson
2014-01-16 22:22:12 +0800
ac3beb6a5 GFS2: Don't use ENOBUFS when ENOMEM is the correct error code ... Browse Code »

Al Viro has tactfully pointed out that we are using the incorrect
error code in some cases. This patch fixes that, and also removes
the (unused) return value for glock dumping.

> * gfs2_iget() - ENOBUFS instead of ENOMEM. ENOBUFS is
> "No buffer space available (POSIX.1 (XSI STREAMS option))" and since
> we don't support STREAMS it's probably fair game, but... what the hell?

Signed-off-by: Steven Whitehouse
Cc: Al Viro

Steven Whitehouse
2014-01-16 18:31:13 +0800
70b23ce34 Merge tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux ... Browse Code »

Pull writeback fix from Wu Fengguang:
"Fix data corruption on NFS writeback.

It has been in linux-next for one month"

* tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: Fix data corruption on NFS

Linus Torvalds
2014-01-16 09:23:34 +0800

15 Jan, 2014

6 commits

1e3d36206 GFS2: Fix kbuild test robot reported warning ... Browse Code »

Well I don't get the same warning locally as the kbuild
robot, but I guess this should fix the problem, anyway.
Here is the warning:

head: 2d9e72303d538024627fb1fe2cbde48aec12acc0
commit: ee2411a8db49a21bc55dc124e1b434ba194c8903 [19/20] GFS2: Clean up quota slot allocation
config: make ARCH=powerpc allmodconfig

All error/warnings:

fs/gfs2/quota.c: In function 'gfs2_quota_init':
>> fs/gfs2/quota.c:1246:3: error: implicit declaration of function '__vmalloc' [-Werror=implicit-function-declaration]
sdp->sd_quota_bitmap = __vmalloc(bm_size, GFP_NOFS, PAGE_KERNEL);
^
>> fs/gfs2/quota.c:1246:24: warning: assignment makes pointer from integer without a cast [enabled by default]
sdp->sd_quota_bitmap = __vmalloc(bm_size, GFP_NOFS, PAGE_KERNEL);
^
fs/gfs2/quota.c: In function 'gfs2_quota_cleanup':
>> fs/gfs2/quota.c:1361:4: error: implicit declaration of function 'vfree' [-Werror=implicit-function-declaration]
vfree(sdp->sd_quota_bitmap);

Signed-off-by: Steven Whitehouse

Steven Whitehouse
2014-01-15 20:57:25 +0800
70f2fe3a2 nilfs2: fix segctor bug that causes file system corruption ... Browse Code »

There is a bug in the function nilfs_segctor_collect, which results in
active data being written to a segment, that is marked as clean. It is
possible, that this segment is selected for a later segment
construction, whereby the old data is overwritten.

The problem shows itself with the following kernel log message:

nilfs_sufile_do_cancel_free: segment 6533 must be clean

Usually a few hours later the file system gets corrupted:

NILFS: bad btree node (blocknr=8748107): level = 0, flags = 0x0, nchildren = 0
NILFS error (device sdc1): nilfs_bmap_last_key: broken bmap (inode number=114660)

The issue can be reproduced with a file system that is nearly full and
with the cleaner running, while some IO intensive task is running.
Although it is quite hard to reproduce.

This is what happens:

1. The cleaner starts the segment construction
2. nilfs_segctor_collect is called
3. sc_stage is on NILFS_ST_SUFILE and segments are freed
4. sc_stage is on NILFS_ST_DAT current segment is full
5. nilfs_segctor_extend_segments is called, which
allocates a new segment
6. The new segment is one of the segments freed in step 3
7. nilfs_sufile_cancel_freev is called and produces an error message
8. Loop around and the collection starts again
9. sc_stage is on NILFS_ST_SUFILE and segments are freed
including the newly allocated segment, which will contain active
data and can be allocated at a later time
10. A few hours later another segment construction allocates the
segment and causes file system corruption

This can be prevented by simply reordering the statements. If
nilfs_sufile_cancel_freev is called before nilfs_segctor_extend_segments
the freed segments are marked as dirty and cannot be allocated any more.

Signed-off-by: Andreas Rohner
Reviewed-by: Ryusuke Konishi
Tested-by: Andreas Rohner
Signed-off-by: Ryusuke Konishi
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andreas Rohner
2014-01-15 15:19:42 +0800
2d9e72303 GFS2: Move quota bitmap operations under their own lock ... Browse Code »

Gradually, the global qd_lock is being used for less and less.
After this patch it will only be used for the per super block
list whose purpose is to allow syncing of changes back to the
master quota file from the local quota changes file. Fixing
up that process to make it more efficient will be the subject
of a later patch, however this patch removes another barrier
to doing that.

Signed-off-by: Steven Whitehouse
Cc: Abhijith Das

Steven Whitehouse
2014-01-15 03:29:06 +0800
ee2411a8d GFS2: Clean up quota slot allocation ... Browse Code »

Quota slot allocation has historically used a vector of pages
and a set of homegrown find/test/set/clear bit functions. Since
the size of the bitmap is likely to be based on the default
qc file size, thats a couple of pages at most. So we ought
to be able to allocate that as a single chunk, with a vmalloc
fallback, just in case of memory fragmentation.

We are then able to use the kernel's own find/test/set/clear
bit functions, rather than rolling our own.

Signed-off-by: Steven Whitehouse
Cc: Abhijith Das

Steven Whitehouse
2014-01-15 03:28:49 +0800
8ad151c2a GFS2: Only run logd and quota when mounted read/write ... Browse Code »

While investigating a rather strange bit of code in the quota
clean up function, I spotted that the reason for its existence
was that when remounting read only, we were not stopping the
quotad thread, and thus it was possible for it to still have
a reference to some of the quotas in that case.

This patch moves the logd and quota thread start and stop into
the make_fs_rw/ro functions, so that we now stop those threads
when mounted read only.

This means that quotad will always be stopped before we call
the quota clean up function, and we can thus dispose of the
(rather hackish) code that waits for it to give up its
reference on the quotas.

Signed-off-by: Steven Whitehouse
Cc: Abhijith Das

Steven Whitehouse
2014-01-15 03:28:25 +0800
c754fbbb1 GFS2: Use RCU/hlist_bl based hash for quotas ... Browse Code »

Prior to this patch, GFS2 kept all the quotas for each
super block in a single linked list. This is rather slow
when there are large numbers of quotas.

This patch introduces a hlist_bl based hash table, similar
to the one used for glocks. The initial look up of the quota
is now lockless in the case where it is already cached,
although we still have to take the per quota spinlock in
order to bump the ref count. Either way though, this is a
big improvement on what was there before.

The qd_lock and the per super block list is preserved, for
the time being. However it is intended that since this is no
longer used for its original role, it should be possible to
shrink the number of items on that list in due course and
remove the requirement to take qd_lock in qd_get.

Signed-off-by: Steven Whitehouse
Cc: Abhijith Das
Cc: Paul E. McKenney

Steven Whitehouse
2014-01-15 03:27:56 +0800