14 Jan, 2012

6 commits

  • Since commit 080d676de095 ("aio: allocate kiocbs in batches") iocbs are
    allocated in a batch during processing of first iocbs. All iocbs in a
    batch are automatically added to ctx->active_reqs list and accounted in
    ctx->reqs_active.

    If one (not the last one) of iocbs submitted by an user fails, further
    iocbs are not processed, but they are still present in ctx->active_reqs
    and accounted in ctx->reqs_active. This causes process to stuck in a D
    state in wait_for_all_aios() on exit since ctx->reqs_active will never
    go down to zero. Furthermore since kiocb_batch_free() frees iocb
    without removing it from active_reqs list the list become corrupted
    which may cause oops.

    Fix this by removing iocb from ctx->active_reqs and updating
    ctx->reqs_active in kiocb_batch_free().

    Signed-off-by: Gleb Natapov
    Reviewed-by: Jeff Moyer
    Cc: stable@kernel.org # 3.2
    Signed-off-by: Linus Torvalds

    Gleb Natapov
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next:
    Squashfs: fix i_blocks calculation with extended regular files
    Squashfs: fix mount time sanity check for corrupted superblock
    Squashfs: optimise squashfs_cache_get entry search
    Squashfs: Update documentation to include xattrs
    Squashfs: add missing block release on error condition

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
    GFS2: Fix nlink setting on inode creation
    GFS2: fail mount if journal recovery fails
    GFS2: let spectator mount do read only recovery
    GFS2: Fix a use-after-free that coverity spotted
    GFS2: dlm based recovery coordination

    Linus Torvalds
     
  • * 'linux-next' of git://git.infradead.org/ubifs-2.6:
    UBIFS: fix key printing
    UBIFS: use snprintf instead of sprintf when printing keys
    UBIFS: fix debugging messages
    UBIFS: make debugging messages light again
    UBI: fix debugging messages
    UBI: make vid_hdr non-static

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    ceph: ensure prealloc_blob is in place when removing xattr
    rbd: initialize snap_rwsem in rbd_add()
    ceph: enable/disable dentry complete flags via mount option
    vfs: export symbol d_find_any_alias()
    ceph: always initialize the dentry in open_root_dentry()
    libceph: remove useless return value for osd_client __send_request()
    ceph: avoid iput() while holding spinlock in ceph_dir_fsync
    ceph: avoid useless dget/dput in encode_fh
    ceph: dereference pointer after checking for NULL
    crush: fix force for non-root TAKE
    ceph: remove unnecessary d_fsdata conditional checks
    ceph: Use kmemdup rather than duplicating its implementation

    Fix up conflicts in fs/ceph/super.c (d_alloc_root() failure handling vs
    always initialize the dentry in open_root_dentry)

    Linus Torvalds
     
  • I don't know how I missed this obvious mistake when I
    reviewed Als' patches, sorry.

    [ Quoting Al:

    Grr... Note to self: do git status *and* git stash show -p
    before git push. Nothing like "WTF? I'd fixed that braino"
    feeling ;-/

    Al sent the same patch - it got broken in commit d668dc56631d:
    "autofs4: deal with autofs4_write/autofs4_write races". ]

    Reported-and-tested-by: Dave Airlie
    Signed-off-by: Ian Kent
    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Ian Kent
     

13 Jan, 2012

16 commits

  • Before commit 56e46742e846e4de167dde0e1e1071ace1c882a5 we have had locking
    around all printing macros and we could use static buffers for creating
    key strings and printing them. However, now we do not have that locking and
    we cannot use static buffers. This commit removes the old DBGKEY() macros
    and introduces few new helper macros for printing debugging messages plus
    a key at the end. Thankfully, all the messages are already structures in
    a way that the key is printed in the end.

    Signed-off-by: Artem Bityutskiy

    Artem Bityutskiy
     
  • Switch to 'snprintf()' which is more secure and reliable. This is also a
    preparation to the subsequent key printing fixes.

    Signed-off-by: Artem Bityutskiy

    Artem Bityutskiy
     
  • Andrew explains:

    - various misc stuff

    - Most of the rest of MM: memcg, threaded hugepages, others.

    - cpumask

    - kexec

    - kdump

    - some direct-io performance tweaking

    - radix-tree optimisations

    - new selftests code

    A note on this: often people will develop a new userspace-visible
    feature and will develop userspace code to exercise/test that
    feature. Then they merge the patch and the selftest code dies.
    Sometimes we paste it into the changelog. Sometimes the code gets
    thrown into Documentation/(!).

    This saddens me. So this patch creates a bare-bones framework which
    will henceforth allow me to ask people to include their test apps in
    the kernel tree so we can keep them alive. Then when people enhance
    or fix the feature, I can ask them to update the test app too.

    The infrastruture is terribly trivial at present - let's see how it
    evolves.

    - checkpoint/restart feature work.

    A note on this: this is a project by various mad Russians to perform
    c/r mainly from userspace, with various oddball helper code added
    into the kernel where the need is demonstrated.

    So rather than some large central lump of code, what we have is
    little bits and pieces popping up in various places which either
    expose something new or which permit something which is normally
    kernel-private to be modified.

    The overall project is an ongoing thing. I've judged that the size
    and scope of the thing means that we're more likely to be successful
    with it if we integrate the support into mainline piecemeal rather
    than allowing it all to develop out-of-tree.

    However I'm less confident than the developers that it will all
    eventually work! So what I'm asking them to do is to wrap each piece
    of new code inside CONFIG_CHECKPOINT_RESTORE. So if it all
    eventually comes to tears and the project as a whole fails, it should
    be a simple matter to go through and delete all trace of it.

    This lot pretty much wraps up the -rc1 merge for me.

    * akpm: (96 commits)
    unlzo: fix input buffer free
    ramoops: update parameters only after successful init
    ramoops: fix use of rounddown_pow_of_two()
    c/r: prctl: add PR_SET_MM codes to set up mm_struct entries
    c/r: procfs: add start_data, end_data, start_brk members to /proc/$pid/stat v4
    c/r: introduce CHECKPOINT_RESTORE symbol
    selftests: new x86 breakpoints selftest
    selftests: new very basic kernel selftests directory
    radix_tree: take radix_tree_path off stack
    radix_tree: remove radix_tree_indirect_to_ptr()
    dio: optimize cache misses in the submission path
    vfs: cache request_queue in struct block_device
    fs/direct-io.c: calculate fs_count correctly in get_more_blocks()
    drivers/parport/parport_pc.c: fix warnings
    panic: don't print redundant backtraces on oops
    sysctl: add the kernel.ns_last_pid control
    kdump: add udev events for memory online/offline
    include/linux/crash_dump.h needs elf.h
    kdump: fix crash_kexec()/smp_send_stop() race in panic()
    kdump: crashk_res init check for /sys/kernel/kexec_crash_size
    ...

    Linus Torvalds
     
  • The mm->start_code/end_code, mm->start_data/end_data, mm->start_brk are
    involved into calculation of program text/data segment sizes (which might
    be seen in /proc//statm) and into brk() call final address.

    For restore we need to know all these values. While
    mm->start_code/end_code already present in /proc/$pid/stat, the rest
    members are not, so this patch brings them in.

    The restore procedure of these members is addressed in another patch using
    prctl().

    Signed-off-by: Cyrill Gorcunov
    Acked-by: Serge Hallyn
    Reviewed-by: Kees Cook
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Alexey Dobriyan
    Cc: Tejun Heo
    Cc: Andrew Vagin
    Cc: Vasiliy Kulikov
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Some investigation of a transaction processing workload showed that a
    major consumer of cycles in __blockdev_direct_IO is the cache miss while
    accessing the block size. This is because it has to walk the chain from
    block_dev to gendisk to queue.

    The block size is needed early on to check alignment and sizes. It's only
    done if the check for the inode block size fails. But the costly block
    device state is unconditionally fetched.

    - Reorganize the code to only fetch block dev state when actually
    needed.

    Then do a prefetch on the block dev early on in the direct IO path. This
    is worth it, because there is substantial code run before we actually
    touch the block dev now.

    - I also added some unlikelies to make it clear the compiler that block
    device fetch code is not normally executed.

    This gave a small, but measurable improvement on a large database
    benchmark (about 0.3%)

    [akpm@linux-foundation.org: coding-style fixes]
    [sfr@canb.auug.org.au: using prefetch requires including prefetch.h]
    Signed-off-by: Andi Kleen
    Cc: Jeff Moyer
    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • This makes it possible to get from the inode to the request_queue with one
    less cache miss. Used in followon optimization.

    The livetime of the pointer is the same as the gendisk.

    This assumes that the queue will always stay the same in the gendisk while
    it's visible to block_devices. I think that's safe correct?

    Signed-off-by: Andi Kleen
    Acked-by: Jeff Moyer
    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • In get_more_blocks(), we use dio_count to calcuate fs_count and do some
    tricky things to increase fs_count if dio_count isn't aligned. But
    actually it still has some corner cases that can't be coverd. See the
    following example:

    dio_write foo -s 1024 -w 4096

    (direct write 4096 bytes at offset 1024). The same goes if the offset
    isn't aligned to fs_blocksize.

    In this case, the old calculation counts fs_count to be 1, but actually we
    will write into 2 different blocks (if fs_blocksize=4096). The old code
    just works, since it will call get_block twice (and may have to allocate
    and create extents twice for filesystems like ext4). So we'd better call
    get_block just once with the proper fs_count.

    Signed-off-by: Tao Ma
    Cc: "Theodore Ts'o"
    Cc: Christoph Hellwig
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tao Ma
     
  • This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
    mode that avoids writing back pages to backing storage. Async compaction
    maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
    For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
    used.

    This avoids sync compaction stalling for an excessive length of time,
    particularly when copying files to a USB stick where there might be a
    large number of dirty pages backed by a filesystem that does not support
    ->writepages.

    [aarcange@redhat.com: This patch is heavily based on Andrea's work]
    [akpm@linux-foundation.org: fix fs/nfs/write.c build]
    [akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Asynchronous compaction is used when allocating transparent hugepages to
    avoid blocking for long periods of time. Due to reports of stalling,
    there was a debate on disabling synchronous compaction but this severely
    impacted allocation success rates. Part of the reason was that many dirty
    pages are skipped in asynchronous compaction by the following check;

    if (PageDirty(page) && !sync &&
    mapping->a_ops->migratepage != migrate_page)
    rc = -EBUSY;

    This skips over all mapping aops using buffer_migrate_page() even though
    it is possible to migrate some of these pages without blocking. This
    patch updates the ->migratepage callback with a "sync" parameter. It is
    the responsibility of the callback to fail gracefully if migration would
    block.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The current epoll code can be tickled to run basically indefinitely in
    both loop detection path check (on ep_insert()), and in the wakeup paths.
    The programs that tickle this behavior set up deeply linked networks of
    epoll file descriptors that cause the epoll algorithms to traverse them
    indefinitely. A couple of these sample programs have been previously
    posted in this thread: https://lkml.org/lkml/2011/2/25/297.

    To fix the loop detection path check algorithms, I simply keep track of
    the epoll nodes that have been already visited. Thus, the loop detection
    becomes proportional to the number of epoll file descriptor and links.
    This dramatically decreases the run-time of the loop check algorithm. In
    one diabolical case I tried it reduced the run-time from 15 mintues (all
    in kernel time) to .3 seconds.

    Fixing the wakeup paths could be done at wakeup time in a similar manner
    by keeping track of nodes that have already been visited, but the
    complexity is harder, since there can be multiple wakeups on different
    cpus...Thus, I've opted to limit the number of possible wakeup paths when
    the paths are created.

    This is accomplished, by noting that the end file descriptor points that
    are found during the loop detection pass (from the newly added link), are
    actually the sources for wakeup events. I keep a list of these file
    descriptors and limit the number and length of these paths that emanate
    from these 'source file descriptors'. In the current implemetation I
    allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of
    length 4 and 10 of length 5. Note that it is sufficient to check the
    'source file descriptors' reachable from the newly added link, since no
    other 'source file descriptors' will have newly added links. This allows
    us to check only the wakeup paths that may have gotten too long, and not
    re-check all possible wakeup paths on the system.

    In terms of the path limit selection, I think its first worth noting that
    the most common case for epoll, is probably the model where you have 1
    epoll file descriptor that is monitoring n number of 'source file
    descriptors'. In this case, each 'source file descriptor' has a 1 path of
    length 1. Thus, I believe that the limits I'm proposing are quite
    reasonable and in fact may be too generous. Thus, I'm hoping that the
    proposed limits will not prevent any workloads that currently work to
    fail.

    In terms of locking, I have extended the use of the 'epmutex' to all
    epoll_ctl add and remove operations. Currently its only used in a subset
    of the add paths. I need to hold the epmutex, so that we can correctly
    traverse a coherent graph, to check the number of paths. I believe that
    this additional locking is probably ok, since its in the setup/teardown
    paths, and doesn't affect the running paths, but it certainly is going to
    add some extra overhead. Also, worth noting is that the epmuex was
    recently added to the ep_ctl add operations in the initial path loop
    detection code using the argument that it was not on a critical path.

    Another thing to note here, is the length of epoll chains that is allowed.
    Currently, eventpoll.c defines:

    /* Maximum number of nesting allowed inside epoll sets */
    #define EP_MAX_NESTS 4

    This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS
    + 1). However, this limit is currently only enforced during the loop
    check detection code, and only when the epoll file descriptors are added
    in a certain order. Thus, this limit is currently easily bypassed. The
    newly added check for wakeup paths, stricly limits the wakeup paths to a
    length of 5, regardless of the order in which ep's are linked together.
    Thus, a side-effect of the new code is a more consistent enforcement of
    the graph depth.

    Thus far, I've tested this, using the sample programs previously
    mentioned, which now either return quickly or return -EINVAL. I've also
    testing using the piptest.c epoll tester, which showed no difference in
    performance. I've also created a number of different epoll networks and
    tested that they behave as expectded.

    I believe this solves the original diabolical test cases, while still
    preserving the sane epoll nesting.

    Signed-off-by: Jason Baron
    Cc: Nelson Elhage
    Cc: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Baron
     
  • When a user with the CAP_SYS_RESOURCE cap tries to F_SETPIPE_SZ a pipe
    with size bigger than kmalloc() can alloc it spits out an ugly warning:

    ------------[ cut here ]------------
    WARNING: at mm/page_alloc.c:2095 __alloc_pages_nodemask+0x5d3/0x7a0()
    Pid: 733, comm: a.out Not tainted 3.2.0-rc1+ #4
    Call Trace:
    warn_slowpath_common+0x75/0xb0
    warn_slowpath_null+0x15/0x20
    __alloc_pages_nodemask+0x5d3/0x7a0
    __get_free_pages+0x12/0x50
    __kmalloc+0x12b/0x150
    pipe_set_size+0x75/0x120
    pipe_fcntl+0xf8/0x140
    do_fcntl+0x2d4/0x410
    sys_fcntl+0x66/0xa0
    system_call_fastpath+0x16/0x1b
    ---[ end trace 432f702e6db7b5ee ]---

    Instead, make kcalloc() handle the overflow case and fail quietly.

    [akpm@linux-foundation.org: switch to sizeof(*bufs) for 80-column niceness]
    Signed-off-by: Sasha Levin
    Cc: Alexander Viro
    Acked-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     
  • get_proc_task() can fail to search the task and return NULL,
    put_task_struct() will then bomb the kernel with following oops:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
    IP: [] proc_pid_permission+0x64/0xe0
    PGD 112075067 PUD 112814067 PMD 0
    Oops: 0002 [#1] PREEMPT SMP

    This is a regression introduced by commit 0499680a ("procfs: add hidepid=
    and gid= mount options"). The kernel should return -ESRCH if
    get_proc_task() failed.

    Signed-off-by: Xiaotian Feng
    Cc: Al Viro
    Cc: Vasiliy Kulikov
    Cc: Stephen Wilson
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiaotian Feng
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
    FUSE: Notifying the kernel of deletion.
    fuse: support ioctl on directories
    fuse: Use kcalloc instead of kzalloc to allocate array
    fuse: llseek optimize SEEK_CUR and SEEK_SET

    Linus Torvalds
     
  • In __ceph_build_xattrs_blob(), if a ceph inode's extended attributes
    are marked dirty, all attributes recorded in its rb_tree index are
    formatted into a "blob" buffer. The target buffer is recorded in
    ceph_inode->i_xattrs.prealloc_blob, and it is expected to exist and
    be of sufficient size to hold the attributes.

    The extended attributes are marked dirty in two cases: when a new
    attribute is added to the inode; or when one is removed. In the
    former case work is done to ensure the prealloc_blob buffer is
    properly set up, but in the latter it is not.

    Change the logic in ceph_removexattr() so it matches what is
    done in ceph_setxattr(). Note that this is done in a way that
    keeps the two blocks of code nearly identical, in anticipation
    of a subsequent patch that encapsulates some of this logic into
    one or more helper routines.

    Signed-off-by: Alex Elder
    Signed-off-by: Sage Weil

    Alex Elder
     
  • Enable/disable use of the dentry dir 'complete' flag via a mount option.
    This lets the admin control whether ceph uses the dcache to satisfy
    negative lookups or readdir when it has the entire directory contents in
    its cache.

    This is purely a performance optimization; correctness is guaranteed
    whether it is enabled or not.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sage Weil

    Sage Weil
     
  • Ceph needs this.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sage Weil

    Sage Weil
     

12 Jan, 2012

3 commits

  • When open_root_dentry() gets a dentry via d_obtain_alias() it does
    not get initialized. If the dentry obtained came from the cache,
    this is OK. But if not, the result is an improperly initialized
    dentry.

    To fix this, call ceph_init_dentry() regardless of which path
    produced the dentry. That function returns immediately for a dentry
    that is already initialized, it is safe to use either way.

    (Credit to Sage, who suggested this fix.)

    Signed-off-by: Alex Elder

    Alex Elder
     
  • Patch 56e46742e846e4de167dde0e1e1071ace1c882a5 broke UBIFS debugging messages:
    before that commit when UBIFS debugging was enabled, users saw few useful
    debugging messages after mount. However, that patch turned 'dbg_msg()' into
    'pr_debug()', so to enable the debugging messages users have to enable them
    first via /sys/kernel/debug/dynamic_debug/control, which is very impractical.

    This commit makes 'dbg_msg()' to use 'printk()' instead of 'pr_debug()', just
    as it was before the breakage.

    Signed-off-by: Artem Bityutskiy
    Cc: stable@kernel.org [3.0+]

    Artem Bityutskiy
     
  • We switch to dynamic debugging in commit
    56e46742e846e4de167dde0e1e1071ace1c882a5 but did not take into account that
    now we do not control anymore whether a specific message is enabled or not.
    So now we lock the "dbg_lock" and release it in every debugging macro, which
    make them not so light-weight.

    This commit removes the "dbg_lock" protection from the debugging macros to
    fix the issue.

    The downside is that now our DBGKEY() stuff is broken, but this is not
    critical at all and will be fixed later.

    Signed-off-by: Artem Bityutskiy
    Cc: stable@kernel.org [3.0+]

    Artem Bityutskiy
     

11 Jan, 2012

15 commits

  • Since the nlink count will be 0, we need to use set_nlink rather
    than inc_nlink in order to avoid triggering the inc_nlink warning
    which was added recently.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • If the first mounter fails to recover one of the journals
    during mount, the mount should fail.

    Signed-off-by: David Teigland
    Signed-off-by: Steven Whitehouse

    David Teigland
     
  • Previously, a spectator mount would not even attempt to do
    journal recovery for a failed node. This meant that if all
    mounted nodes were spectators, everyone would be stuck after
    a node failed, all waiting for recovery to be performed.
    This is unnecessary since the failed node had a clean journal.

    Instead, allow a spectator mount to do a partial "read only"
    recovery, which means it will check if the failed journal is
    clean, and if so, report a successful recovery. If the failed
    journal is not clean, it reports that journal recovery failed.
    This makes it work the same as a read only mount on a read only
    block device.

    Signed-off-by: David Teigland
    Signed-off-by: Steven Whitehouse

    David Teigland
     
  • In function gfs2_inplace_release it was trying to unlock a gfs2_holder
    structure associated with a reservation, after said reservation was
    freed. The problem is that the statements have the wrong order.
    This patch corrects the order so that the reservation is freed after
    the gfs2_holder is unlocked.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     
  • This new method of managing recovery is an alternative to
    the previous approach of using the userland gfs_controld.

    - use dlm slot numbers to assign journal id's
    - use dlm recovery callbacks to initiate journal recovery
    - use a dlm lock to determine the first node to mount fs
    - use a dlm lock to track journals that need recovery

    Signed-off-by: David Teigland
    Signed-off-by: Steven Whitehouse

    David Teigland
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    autofs4: deal with autofs4_write/autofs4_write races
    autofs4: catatonic_mode vs. notify_daemon race
    autofs4: autofs4_wait() vs. autofs4_catatonic_mode() race
    hfsplus: creation of hidden dir on mount can fail
    block_dev: Suppress bdev_cache_init() kmemleak warninig
    fix shrink_dcache_parent() livelock
    coda: switch coda_cnode_make() to sane API as well, clean coda_lookup()
    coda: deal correctly with allocation failure from coda_cnode_makectl()
    securityfs: fix object creation races

    Linus Torvalds
     
  • Just serialize the actual writing of packets into pipe on
    a new mutex, independent from everything else in the locking
    hierarchy. As soon as something has started feeding a piece
    of packet into the pipe to daemon, we *want* everything else
    about to try the same to wait until we are done.

    Acked-by: Ian Kent
    Signed-off-by: Al Viro

    Al Viro
     
  • we need to hold ->wq_mutex while we are forming the packet to send,
    lest we have autofs4_catatonic_mode() setting wq->name.name to NULL
    just as autofs4_notify_daemon() decides to memcpy() from it...

    We do have check for catatonic mode immediately after that (under
    ->wq_mutex, as it ought to be) and packet won't be actually sent,
    but it'll be too late for us if we oops on that memcpy() from NULL...

    Fix is obvious - just extend the area covered by ->wq_mutex over
    that switch and check whether it's catatonic *before* doing anything
    else.

    Acked-by: Ian Kent
    Signed-off-by: Al Viro

    Al Viro
     
  • We need to recheck ->catatonic after autofs4_wait() got ->wq_mutex
    for good, or we might end up with wq inserted into queue after
    autofs4_catatonic_mode() had done its thing. It will stick there
    forever, since there won't be anything to clear its ->name.name.

    A bit of a complication: validate_request() drops and regains ->wq_mutex.
    It actually ends up the most convenient place to stick the check into...

    Acked-by: Ian Kent
    Signed-off-by: Al Viro

    Al Viro
     
  • * 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: move MIN_WRITEBACK_PAGES to fs-writeback.c
    writeback: balanced_rate cannot exceed write bandwidth
    writeback: do strict bdi dirty_exceeded
    writeback: avoid tiny dirty poll intervals
    writeback: max, min and target dirty pause time
    writeback: dirty ratelimit - think time compensation
    btrfs: fix dirtied pages accounting on sub-page writes
    writeback: fix dirtied pages accounting on redirty
    writeback: fix dirtied pages accounting on sub-page writes
    writeback: charge leaked page dirties to active tasks
    writeback: Include all dirty inodes in background writeback

    Linus Torvalds
     
  • Andrew elucidates:
    - First installmeant of MM. We have a HUGE number of MM patches this
    time. It's crazy.
    - MAINTAINERS updates
    - backlight updates
    - leds
    - checkpatch updates
    - misc ELF stuff
    - rtc updates
    - reiserfs
    - procfs
    - some misc other bits

    * akpm: (124 commits)
    user namespace: make signal.c respect user namespaces
    workqueue: make alloc_workqueue() take printf fmt and args for name
    procfs: add hidepid= and gid= mount options
    procfs: parse mount options
    procfs: introduce the /proc//map_files/ directory
    procfs: make proc_get_link to use dentry instead of inode
    signal: add block_sigmask() for adding sigmask to current->blocked
    sparc: make SA_NOMASK a synonym of SA_NODEFER
    reiserfs: don't lock root inode searching
    reiserfs: don't lock journal_init()
    reiserfs: delay reiserfs lock until journal initialization
    reiserfs: delete comments referring to the BKL
    drivers/rtc/interface.c: fix alarm rollover when day or month is out-of-range
    drivers/rtc/rtc-twl.c: add DT support for RTC inside twl4030/twl6030
    drivers/rtc/: remove redundant spi driver bus initialization
    drivers/rtc/rtc-jz4740.c: make jz4740_rtc_driver static
    drivers/rtc/rtc-mc13xxx.c: make mc13xxx_rtc_idtable static
    rtc: convert drivers/rtc/* to use module_platform_driver()
    drivers/rtc/rtc-wm831x.c: convert to devm_kzalloc()
    drivers/rtc/rtc-wm831x.c: remove unused period IRQ handler
    ...

    Linus Torvalds
     
  • Add support for mount options to restrict access to /proc/PID/
    directories. The default backward-compatible "relaxed" behaviour is left
    untouched.

    The first mount option is called "hidepid" and its value defines how much
    info about processes we want to be available for non-owners:

    hidepid=0 (default) means the old behavior - anybody may read all
    world-readable /proc/PID/* files.

    hidepid=1 means users may not access any /proc// directories, but
    their own. Sensitive files like cmdline, sched*, status are now protected
    against other users. As permission checking done in proc_pid_permission()
    and files' permissions are left untouched, programs expecting specific
    files' modes are not confused.

    hidepid=2 means hidepid=1 plus all /proc/PID/ will be invisible to other
    users. It doesn't mean that it hides whether a process exists (it can be
    learned by other means, e.g. by kill -0 $PID), but it hides process' euid
    and egid. It compicates intruder's task of gathering info about running
    processes, whether some daemon runs with elevated privileges, whether
    another user runs some sensitive program, whether other users run any
    program at all, etc.

    gid=XXX defines a group that will be able to gather all processes' info
    (as in hidepid=0 mode). This group should be used instead of putting
    nonroot user in sudoers file or something. However, untrusted users (like
    daemons, etc.) which are not supposed to monitor the tasks in the whole
    system should not be added to the group.

    hidepid=1 or higher is designed to restrict access to procfs files, which
    might reveal some sensitive private information like precise keystrokes
    timings:

    http://www.openwall.com/lists/oss-security/2011/11/05/3

    hidepid=1/2 doesn't break monitoring userspace tools. ps, top, pgrep, and
    conky gracefully handle EPERM/ENOENT and behave as if the current user is
    the only user running processes. pstree shows the process subtree which
    contains "pstree" process.

    Note: the patch doesn't deal with setuid/setgid issues of keeping
    preopened descriptors of procfs files (like
    https://lkml.org/lkml/2011/2/7/368). We rely on that the leaked
    information like the scheduling counters of setuid apps doesn't threaten
    anybody's privacy - only the user started the setuid program may read the
    counters.

    Signed-off-by: Vasiliy Kulikov
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Cc: Randy Dunlap
    Cc: "H. Peter Anvin"
    Cc: Greg KH
    Cc: Theodore Tso
    Cc: Alan Cox
    Cc: James Morris
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     
  • Add support for procfs mount options. Actual mount options are coming in
    the next patches.

    Signed-off-by: Vasiliy Kulikov
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Cc: Randy Dunlap
    Cc: "H. Peter Anvin"
    Cc: Greg KH
    Cc: Theodore Tso
    Cc: Alan Cox
    Cc: James Morris
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     
  • This one behaves similarly to the /proc//fd/ one - it contains
    symlinks one for each mapping with file, the name of a symlink is
    "vma->vm_start-vma->vm_end", the target is the file. Opening a symlink
    results in a file that point exactly to the same inode as them vma's one.

    For example the ls -l of some arbitrary /proc//map_files/

    | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
    | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
    | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
    | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
    | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so

    This *helps* checkpointing process in three ways:

    1. When dumping a task mappings we do know exact file that is mapped
    by particular region. We do this by opening
    /proc/$pid/map_files/$address symlink the way we do with file
    descriptors.

    2. This also helps in determining which anonymous shared mappings are
    shared with each other by comparing the inodes of them.

    3. When restoring a set of processes in case two of them has a mapping
    shared, we map the memory by the 1st one and then open its
    /proc/$pid/map_files/$address file and map it by the 2nd task.

    Using /proc/$pid/maps for this is quite inconvenient since it brings
    repeatable re-reading and reparsing for this text file which slows down
    restore procedure significantly. Also as being pointed in (3) it is a way
    easier to use top level shared mapping in children as
    /proc/$pid/map_files/$address when needed.

    [akpm@linux-foundation.org: coding-style fixes]
    [gorcunov@openvz.org: make map_files depend on CHECKPOINT_RESTORE]
    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Cyrill Gorcunov
    Reviewed-by: Vasiliy Kulikov
    Reviewed-by: "Kirill A. Shutemov"
    Cc: Tejun Heo
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Prepare the ground for the next "map_files" patch which needs a name of a
    link file to analyse.

    Signed-off-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Tejun Heo
    Cc: Vasiliy Kulikov
    Cc: "Kirill A. Shutemov"
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov