13 Jan, 2012

5 commits

  • This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
    mode that avoids writing back pages to backing storage. Async compaction
    maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
    For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
    used.

    This avoids sync compaction stalling for an excessive length of time,
    particularly when copying files to a USB stick where there might be a
    large number of dirty pages backed by a filesystem that does not support
    ->writepages.

    [aarcange@redhat.com: This patch is heavily based on Andrea's work]
    [akpm@linux-foundation.org: fix fs/nfs/write.c build]
    [akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Asynchronous compaction is used when allocating transparent hugepages to
    avoid blocking for long periods of time. Due to reports of stalling,
    there was a debate on disabling synchronous compaction but this severely
    impacted allocation success rates. Part of the reason was that many dirty
    pages are skipped in asynchronous compaction by the following check;

    if (PageDirty(page) && !sync &&
    mapping->a_ops->migratepage != migrate_page)
    rc = -EBUSY;

    This skips over all mapping aops using buffer_migrate_page() even though
    it is possible to migrate some of these pages without blocking. This
    patch updates the ->migratepage callback with a "sync" parameter. It is
    the responsibility of the callback to fail gracefully if migration would
    block.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The current epoll code can be tickled to run basically indefinitely in
    both loop detection path check (on ep_insert()), and in the wakeup paths.
    The programs that tickle this behavior set up deeply linked networks of
    epoll file descriptors that cause the epoll algorithms to traverse them
    indefinitely. A couple of these sample programs have been previously
    posted in this thread: https://lkml.org/lkml/2011/2/25/297.

    To fix the loop detection path check algorithms, I simply keep track of
    the epoll nodes that have been already visited. Thus, the loop detection
    becomes proportional to the number of epoll file descriptor and links.
    This dramatically decreases the run-time of the loop check algorithm. In
    one diabolical case I tried it reduced the run-time from 15 mintues (all
    in kernel time) to .3 seconds.

    Fixing the wakeup paths could be done at wakeup time in a similar manner
    by keeping track of nodes that have already been visited, but the
    complexity is harder, since there can be multiple wakeups on different
    cpus...Thus, I've opted to limit the number of possible wakeup paths when
    the paths are created.

    This is accomplished, by noting that the end file descriptor points that
    are found during the loop detection pass (from the newly added link), are
    actually the sources for wakeup events. I keep a list of these file
    descriptors and limit the number and length of these paths that emanate
    from these 'source file descriptors'. In the current implemetation I
    allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of
    length 4 and 10 of length 5. Note that it is sufficient to check the
    'source file descriptors' reachable from the newly added link, since no
    other 'source file descriptors' will have newly added links. This allows
    us to check only the wakeup paths that may have gotten too long, and not
    re-check all possible wakeup paths on the system.

    In terms of the path limit selection, I think its first worth noting that
    the most common case for epoll, is probably the model where you have 1
    epoll file descriptor that is monitoring n number of 'source file
    descriptors'. In this case, each 'source file descriptor' has a 1 path of
    length 1. Thus, I believe that the limits I'm proposing are quite
    reasonable and in fact may be too generous. Thus, I'm hoping that the
    proposed limits will not prevent any workloads that currently work to
    fail.

    In terms of locking, I have extended the use of the 'epmutex' to all
    epoll_ctl add and remove operations. Currently its only used in a subset
    of the add paths. I need to hold the epmutex, so that we can correctly
    traverse a coherent graph, to check the number of paths. I believe that
    this additional locking is probably ok, since its in the setup/teardown
    paths, and doesn't affect the running paths, but it certainly is going to
    add some extra overhead. Also, worth noting is that the epmuex was
    recently added to the ep_ctl add operations in the initial path loop
    detection code using the argument that it was not on a critical path.

    Another thing to note here, is the length of epoll chains that is allowed.
    Currently, eventpoll.c defines:

    /* Maximum number of nesting allowed inside epoll sets */
    #define EP_MAX_NESTS 4

    This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS
    + 1). However, this limit is currently only enforced during the loop
    check detection code, and only when the epoll file descriptors are added
    in a certain order. Thus, this limit is currently easily bypassed. The
    newly added check for wakeup paths, stricly limits the wakeup paths to a
    length of 5, regardless of the order in which ep's are linked together.
    Thus, a side-effect of the new code is a more consistent enforcement of
    the graph depth.

    Thus far, I've tested this, using the sample programs previously
    mentioned, which now either return quickly or return -EINVAL. I've also
    testing using the piptest.c epoll tester, which showed no difference in
    performance. I've also created a number of different epoll networks and
    tested that they behave as expectded.

    I believe this solves the original diabolical test cases, while still
    preserving the sane epoll nesting.

    Signed-off-by: Jason Baron
    Cc: Nelson Elhage
    Cc: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Baron
     
  • When a user with the CAP_SYS_RESOURCE cap tries to F_SETPIPE_SZ a pipe
    with size bigger than kmalloc() can alloc it spits out an ugly warning:

    ------------[ cut here ]------------
    WARNING: at mm/page_alloc.c:2095 __alloc_pages_nodemask+0x5d3/0x7a0()
    Pid: 733, comm: a.out Not tainted 3.2.0-rc1+ #4
    Call Trace:
    warn_slowpath_common+0x75/0xb0
    warn_slowpath_null+0x15/0x20
    __alloc_pages_nodemask+0x5d3/0x7a0
    __get_free_pages+0x12/0x50
    __kmalloc+0x12b/0x150
    pipe_set_size+0x75/0x120
    pipe_fcntl+0xf8/0x140
    do_fcntl+0x2d4/0x410
    sys_fcntl+0x66/0xa0
    system_call_fastpath+0x16/0x1b
    ---[ end trace 432f702e6db7b5ee ]---

    Instead, make kcalloc() handle the overflow case and fail quietly.

    [akpm@linux-foundation.org: switch to sizeof(*bufs) for 80-column niceness]
    Signed-off-by: Sasha Levin
    Cc: Alexander Viro
    Acked-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     
  • get_proc_task() can fail to search the task and return NULL,
    put_task_struct() will then bomb the kernel with following oops:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
    IP: [] proc_pid_permission+0x64/0xe0
    PGD 112075067 PUD 112814067 PMD 0
    Oops: 0002 [#1] PREEMPT SMP

    This is a regression introduced by commit 0499680a ("procfs: add hidepid=
    and gid= mount options"). The kernel should return -ESRCH if
    get_proc_task() failed.

    Signed-off-by: Xiaotian Feng
    Cc: Al Viro
    Cc: Vasiliy Kulikov
    Cc: Stephen Wilson
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiaotian Feng
     

11 Jan, 2012

31 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    autofs4: deal with autofs4_write/autofs4_write races
    autofs4: catatonic_mode vs. notify_daemon race
    autofs4: autofs4_wait() vs. autofs4_catatonic_mode() race
    hfsplus: creation of hidden dir on mount can fail
    block_dev: Suppress bdev_cache_init() kmemleak warninig
    fix shrink_dcache_parent() livelock
    coda: switch coda_cnode_make() to sane API as well, clean coda_lookup()
    coda: deal correctly with allocation failure from coda_cnode_makectl()
    securityfs: fix object creation races

    Linus Torvalds
     
  • Just serialize the actual writing of packets into pipe on
    a new mutex, independent from everything else in the locking
    hierarchy. As soon as something has started feeding a piece
    of packet into the pipe to daemon, we *want* everything else
    about to try the same to wait until we are done.

    Acked-by: Ian Kent
    Signed-off-by: Al Viro

    Al Viro
     
  • we need to hold ->wq_mutex while we are forming the packet to send,
    lest we have autofs4_catatonic_mode() setting wq->name.name to NULL
    just as autofs4_notify_daemon() decides to memcpy() from it...

    We do have check for catatonic mode immediately after that (under
    ->wq_mutex, as it ought to be) and packet won't be actually sent,
    but it'll be too late for us if we oops on that memcpy() from NULL...

    Fix is obvious - just extend the area covered by ->wq_mutex over
    that switch and check whether it's catatonic *before* doing anything
    else.

    Acked-by: Ian Kent
    Signed-off-by: Al Viro

    Al Viro
     
  • We need to recheck ->catatonic after autofs4_wait() got ->wq_mutex
    for good, or we might end up with wq inserted into queue after
    autofs4_catatonic_mode() had done its thing. It will stick there
    forever, since there won't be anything to clear its ->name.name.

    A bit of a complication: validate_request() drops and regains ->wq_mutex.
    It actually ends up the most convenient place to stick the check into...

    Acked-by: Ian Kent
    Signed-off-by: Al Viro

    Al Viro
     
  • * 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: move MIN_WRITEBACK_PAGES to fs-writeback.c
    writeback: balanced_rate cannot exceed write bandwidth
    writeback: do strict bdi dirty_exceeded
    writeback: avoid tiny dirty poll intervals
    writeback: max, min and target dirty pause time
    writeback: dirty ratelimit - think time compensation
    btrfs: fix dirtied pages accounting on sub-page writes
    writeback: fix dirtied pages accounting on redirty
    writeback: fix dirtied pages accounting on sub-page writes
    writeback: charge leaked page dirties to active tasks
    writeback: Include all dirty inodes in background writeback

    Linus Torvalds
     
  • Andrew elucidates:
    - First installmeant of MM. We have a HUGE number of MM patches this
    time. It's crazy.
    - MAINTAINERS updates
    - backlight updates
    - leds
    - checkpatch updates
    - misc ELF stuff
    - rtc updates
    - reiserfs
    - procfs
    - some misc other bits

    * akpm: (124 commits)
    user namespace: make signal.c respect user namespaces
    workqueue: make alloc_workqueue() take printf fmt and args for name
    procfs: add hidepid= and gid= mount options
    procfs: parse mount options
    procfs: introduce the /proc//map_files/ directory
    procfs: make proc_get_link to use dentry instead of inode
    signal: add block_sigmask() for adding sigmask to current->blocked
    sparc: make SA_NOMASK a synonym of SA_NODEFER
    reiserfs: don't lock root inode searching
    reiserfs: don't lock journal_init()
    reiserfs: delay reiserfs lock until journal initialization
    reiserfs: delete comments referring to the BKL
    drivers/rtc/interface.c: fix alarm rollover when day or month is out-of-range
    drivers/rtc/rtc-twl.c: add DT support for RTC inside twl4030/twl6030
    drivers/rtc/: remove redundant spi driver bus initialization
    drivers/rtc/rtc-jz4740.c: make jz4740_rtc_driver static
    drivers/rtc/rtc-mc13xxx.c: make mc13xxx_rtc_idtable static
    rtc: convert drivers/rtc/* to use module_platform_driver()
    drivers/rtc/rtc-wm831x.c: convert to devm_kzalloc()
    drivers/rtc/rtc-wm831x.c: remove unused period IRQ handler
    ...

    Linus Torvalds
     
  • Add support for mount options to restrict access to /proc/PID/
    directories. The default backward-compatible "relaxed" behaviour is left
    untouched.

    The first mount option is called "hidepid" and its value defines how much
    info about processes we want to be available for non-owners:

    hidepid=0 (default) means the old behavior - anybody may read all
    world-readable /proc/PID/* files.

    hidepid=1 means users may not access any /proc// directories, but
    their own. Sensitive files like cmdline, sched*, status are now protected
    against other users. As permission checking done in proc_pid_permission()
    and files' permissions are left untouched, programs expecting specific
    files' modes are not confused.

    hidepid=2 means hidepid=1 plus all /proc/PID/ will be invisible to other
    users. It doesn't mean that it hides whether a process exists (it can be
    learned by other means, e.g. by kill -0 $PID), but it hides process' euid
    and egid. It compicates intruder's task of gathering info about running
    processes, whether some daemon runs with elevated privileges, whether
    another user runs some sensitive program, whether other users run any
    program at all, etc.

    gid=XXX defines a group that will be able to gather all processes' info
    (as in hidepid=0 mode). This group should be used instead of putting
    nonroot user in sudoers file or something. However, untrusted users (like
    daemons, etc.) which are not supposed to monitor the tasks in the whole
    system should not be added to the group.

    hidepid=1 or higher is designed to restrict access to procfs files, which
    might reveal some sensitive private information like precise keystrokes
    timings:

    http://www.openwall.com/lists/oss-security/2011/11/05/3

    hidepid=1/2 doesn't break monitoring userspace tools. ps, top, pgrep, and
    conky gracefully handle EPERM/ENOENT and behave as if the current user is
    the only user running processes. pstree shows the process subtree which
    contains "pstree" process.

    Note: the patch doesn't deal with setuid/setgid issues of keeping
    preopened descriptors of procfs files (like
    https://lkml.org/lkml/2011/2/7/368). We rely on that the leaked
    information like the scheduling counters of setuid apps doesn't threaten
    anybody's privacy - only the user started the setuid program may read the
    counters.

    Signed-off-by: Vasiliy Kulikov
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Cc: Randy Dunlap
    Cc: "H. Peter Anvin"
    Cc: Greg KH
    Cc: Theodore Tso
    Cc: Alan Cox
    Cc: James Morris
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     
  • Add support for procfs mount options. Actual mount options are coming in
    the next patches.

    Signed-off-by: Vasiliy Kulikov
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Cc: Randy Dunlap
    Cc: "H. Peter Anvin"
    Cc: Greg KH
    Cc: Theodore Tso
    Cc: Alan Cox
    Cc: James Morris
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     
  • This one behaves similarly to the /proc//fd/ one - it contains
    symlinks one for each mapping with file, the name of a symlink is
    "vma->vm_start-vma->vm_end", the target is the file. Opening a symlink
    results in a file that point exactly to the same inode as them vma's one.

    For example the ls -l of some arbitrary /proc//map_files/

    | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
    | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
    | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
    | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
    | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so

    This *helps* checkpointing process in three ways:

    1. When dumping a task mappings we do know exact file that is mapped
    by particular region. We do this by opening
    /proc/$pid/map_files/$address symlink the way we do with file
    descriptors.

    2. This also helps in determining which anonymous shared mappings are
    shared with each other by comparing the inodes of them.

    3. When restoring a set of processes in case two of them has a mapping
    shared, we map the memory by the 1st one and then open its
    /proc/$pid/map_files/$address file and map it by the 2nd task.

    Using /proc/$pid/maps for this is quite inconvenient since it brings
    repeatable re-reading and reparsing for this text file which slows down
    restore procedure significantly. Also as being pointed in (3) it is a way
    easier to use top level shared mapping in children as
    /proc/$pid/map_files/$address when needed.

    [akpm@linux-foundation.org: coding-style fixes]
    [gorcunov@openvz.org: make map_files depend on CHECKPOINT_RESTORE]
    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Cyrill Gorcunov
    Reviewed-by: Vasiliy Kulikov
    Reviewed-by: "Kirill A. Shutemov"
    Cc: Tejun Heo
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Prepare the ground for the next "map_files" patch which needs a name of a
    link file to analyse.

    Signed-off-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Tejun Heo
    Cc: Vasiliy Kulikov
    Cc: "Kirill A. Shutemov"
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Nothing requires that we lock the filesystem until the root inode is
    provided.

    Also iget5_locked() triggers a warning because we are holding the
    filesystem lock while allocating the inode, which result in a lockdep
    suspicion that we have a lock inversion against the reclaim path:

    [ 1986.896979] =================================
    [ 1986.896990] [ INFO: inconsistent lock state ]
    [ 1986.896997] 3.1.1-main #8
    [ 1986.897001] ---------------------------------
    [ 1986.897007] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
    [ 1986.897016] kswapd0/16 [HC0[0]:SC0[0]:HE1:SE1] takes:
    [ 1986.897023] (&REISERFS_SB(s)->lock){+.+.?.}, at: [] reiserfs_write_lock+0x20/0x2a
    [ 1986.897044] {RECLAIM_FS-ON-W} state was registered at:
    [ 1986.897050] [] mark_held_locks+0xae/0xd0
    [ 1986.897060] [] lockdep_trace_alloc+0x7d/0x91
    [ 1986.897068] [] kmem_cache_alloc+0x1a/0x93
    [ 1986.897078] [] reiserfs_alloc_inode+0x13/0x3d
    [ 1986.897088] [] alloc_inode+0x14/0x5f
    [ 1986.897097] [] iget5_locked+0x62/0x13a
    [ 1986.897106] [] reiserfs_fill_super+0x410/0x8b9
    [ 1986.897114] [] mount_bdev+0x10b/0x159
    [ 1986.897123] [] get_super_block+0x10/0x12
    [ 1986.897131] [] mount_fs+0x59/0x12d
    [ 1986.897138] [] vfs_kern_mount+0x45/0x7a
    [ 1986.897147] [] do_kern_mount+0x2f/0xb0
    [ 1986.897155] [] do_mount+0x5c2/0x612
    [ 1986.897163] [] sys_mount+0x61/0x8f
    [ 1986.897170] [] sysenter_do_call+0x12/0x32
    [ 1986.897181] irq event stamp: 7509691
    [ 1986.897186] hardirqs last enabled at (7509691): [] kmem_cache_alloc+0x6e/0x93
    [ 1986.897197] hardirqs last disabled at (7509690): [] kmem_cache_alloc+0x24/0x93
    [ 1986.897209] softirqs last enabled at (7508896): [] __do_softirq+0xee/0xfd
    [ 1986.897222] softirqs last disabled at (7508859): [] do_softirq+0x50/0x9d
    [ 1986.897234]
    [ 1986.897235] other info that might help us debug this:
    [ 1986.897242] Possible unsafe locking scenario:
    [ 1986.897244]
    [ 1986.897250] CPU0
    [ 1986.897254] ----
    [ 1986.897257] lock(&REISERFS_SB(s)->lock);
    [ 1986.897265]
    [ 1986.897269] lock(&REISERFS_SB(s)->lock);
    [ 1986.897276]
    [ 1986.897277] *** DEADLOCK ***
    [ 1986.897278]
    [ 1986.897286] no locks held by kswapd0/16.
    [ 1986.897291]
    [ 1986.897292] stack backtrace:
    [ 1986.897299] Pid: 16, comm: kswapd0 Not tainted 3.1.1-main #8
    [ 1986.897306] Call Trace:
    [ 1986.897314] [] ? printk+0xf/0x11
    [ 1986.897324] [] print_usage_bug+0x20e/0x21a
    [ 1986.897332] [] ? print_irq_inversion_bug+0x172/0x172
    [ 1986.897341] [] mark_lock+0x27f/0x483
    [ 1986.897349] [] __lock_acquire+0x628/0x1472
    [ 1986.897358] [] lock_acquire+0x47/0x5e
    [ 1986.897366] [] ? reiserfs_write_lock+0x20/0x2a
    [ 1986.897384] [] ? reiserfs_write_lock+0x20/0x2a
    [ 1986.897397] [] mutex_lock_nested+0x35/0x26f
    [ 1986.897409] [] ? reiserfs_write_lock+0x20/0x2a
    [ 1986.897421] [] reiserfs_write_lock+0x20/0x2a
    [ 1986.897433] [] map_block_for_writepage+0xc9/0x590
    [ 1986.897448] [] ? create_empty_buffers+0x33/0x8f
    [ 1986.897461] [] ? get_parent_ip+0xb/0x31
    [ 1986.897472] [] ? sub_preempt_count+0x81/0x8e
    [ 1986.897485] [] ? _raw_spin_unlock+0x27/0x3d
    [ 1986.897496] [] ? get_parent_ip+0xb/0x31
    [ 1986.897508] [] reiserfs_writepage+0x1b9/0x3e7
    [ 1986.897521] [] ? clear_page_dirty_for_io+0xcb/0xde
    [ 1986.897533] [] ? trace_hardirqs_on_caller+0x108/0x138
    [ 1986.897546] [] ? trace_hardirqs_on+0xb/0xd
    [ 1986.897559] [] shrink_page_list+0x34f/0x5e2
    [ 1986.897572] [] shrink_inactive_list+0x172/0x22c
    [ 1986.897585] [] shrink_zone+0x303/0x3b1
    [ 1986.897597] [] ? _raw_spin_unlock+0x27/0x3d
    [ 1986.897611] [] kswapd+0x3b7/0x5f2

    The deadlock shouldn't happen since we are doing that allocation in the
    mount path, the filesystem is not available for any reclaim. Still the
    warning is annoying.

    To solve this, acquire the lock later only where we need it, right before
    calling reiserfs_read_locked_inode() that wants to lock to walk the tree.

    Reported-by: Knut Petersen
    Signed-off-by: Frederic Weisbecker
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Jeff Mahoney
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Frederic Weisbecker
     
  • journal_init() doesn't need the lock since no operation on the filesystem
    is involved there. journal_read() and get_list_bitmap() have yet to be
    reviewed carefully though before removing the lock there. Just keep the
    it around these two calls for safety.

    Signed-off-by: Frederic Weisbecker
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Jeff Mahoney
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Frederic Weisbecker
     
  • In the mount path, transactions that are made before journal
    initialization don't involve the filesystem. We can delay the reiserfs
    lock until we play with the journal.

    Signed-off-by: Frederic Weisbecker
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Jeff Mahoney
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Frederic Weisbecker
     
  • Signed-off-by: Davidlohr Bueso
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Randomization of PIE load address is hard coded in binfmt_elf.c for X86
    and ARM. Create a new Kconfig variable
    (CONFIG_ARCH_BINFMT_ELF_RANDOMIZE_PIE) for this and use it instead. Thus
    architecture specific policy is pushed out of the generic binfmt_elf.c and
    into the architecture Kconfig files.

    X86 and ARM Kconfigs are modified to select the new variable so there is
    no change in behavior. A follow on patch will select it for MIPS too.

    Signed-off-by: David Daney
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Alexander Viro
    Acked-by: H. Peter Anvin
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Daney
     
  • oom_score_adj is used for guarding processes from OOM-Killer. One of
    problem is that it's inherited at fork(). When a daemon set oom_score_adj
    and make children, it's hard to know where the value is set.

    This patch adds some tracepoints useful for debugging. This patch adds
    3 trace points.
    - creating new task
    - renaming a task (exec)
    - set oom_score_adj

    To debug, users need to enable some trace pointer. Maybe filtering is useful as

    # EVENT=/sys/kernel/debug/tracing/events/task/
    # echo "oom_score_adj != 0" > $EVENT/task_newtask/filter
    # echo "oom_score_adj != 0" > $EVENT/task_rename/filter
    # echo 1 > $EVENT/enable
    # EVENT=/sys/kernel/debug/tracing/events/oom/
    # echo 1 > $EVENT/enable

    output will be like this.
    # grep oom /sys/kernel/debug/tracing/trace
    bash-7699 [007] d..3 5140.744510: oom_score_adj_update: pid=7699 comm=bash oom_score_adj=-1000
    bash-7699 [007] ...1 5151.818022: task_newtask: pid=7729 comm=bash clone_flags=1200011 oom_score_adj=-1000
    ls-7729 [003] ...2 5151.818504: task_rename: pid=7729 oldcomm=bash newcomm=ls oom_score_adj=-1000
    bash-7699 [002] ...1 5175.701468: task_newtask: pid=7730 comm=bash clone_flags=1200011 oom_score_adj=-1000
    grep-7730 [007] ...2 5175.701993: task_rename: pid=7730 oldcomm=bash newcomm=grep oom_score_adj=-1000

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Tell the page allocator that pages allocated for a buffered write are
    expected to become dirty soon.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Cc: Minchan Kim
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Hellwig
    Cc: Wu Fengguang
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Shaohua Li
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Inode cache pruning indirectly reclaims page-cache by invalidating mapping
    pages. Let's account them into reclaim-state to notice this progress in
    memory reclaimer.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Ext4 commits for 3.3 merge window

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (32 commits)
    ext4: fix undefined behavior in ext4_fill_flex_info()
    ext4: make more symbols static
    ext4: make local symbol ext4_initxattrs static
    jbd2: fix hung processes in jbd2_journal_lock_updates()
    ext4: reserve new feature flag codepoints
    ext4: Report max_batch_time option correctly
    ext4: add missing ext4_resize_end on error paths
    ext4: let ext4_group_add() use common code
    ext4: let ext4_group_extend() use common code
    ext4: add new online resize interface
    ext4: add a new function which adds a flex group to a fs
    ext4: add a new function which allocates bitmaps and inode tables
    ext4: pass verify_reserved_gdb() the number of group decriptors
    ext4: add a function which updates the super block during online resizing
    ext4: add a function which sets up a block group descriptors of a flex bg
    ext4: add a function which sets up group blocks of a flex bg
    ext4: add a structure which will be used by 64bit-resize interface
    ext4: add a function which adds a new group descriptors to a fs
    ext4: add a function which extends a group without checking parameters
    ext4: use proper little-endian bitops
    ...

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
    fs/9p: iattr_valid flags are kernel internal flags map them to 9p values.
    fs/9p: We should not allocate a new inode when creating hardlines.
    fs/9p: v9fs_stat2inode should update suid/sgid bits.
    9p: Reduce object size with CONFIG_NET_9P_DEBUG
    fs/9p: check schedule_timeout_interruptible return value

    Fix up trivial conflicts in fs/9p/{vfs_inode.c,vfs_inode_dotl.c} due to
    debug messages having changed to use p9_debug() on one hand, and the
    changes for umode_t on the other.

    Linus Torvalds
     
  • * 'nfs-for-3.3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    NFSv4: Change the default setting of the nfs4_disable_idmapping parameter
    NFSv4: Save the owner/group name string when doing open
    NFS: Remove pNFS bloat from the generic write path
    pnfs-obj: Must return layout on IO error
    pnfs-obj: pNFS errors are communicated on iodata->pnfs_error
    NFS: Cache state owners after files are closed
    NFS: Clean up nfs4_find_state_owners_locked()
    NFSv4: include bitmap in nfsv4 get acl data
    nfs: fix a minor do_div portability issue
    NFSv4.1: cleanup comment and debug printk
    NFSv4.1: change nfs4_free_slot parameters for dynamic slots
    NFSv4.1: cleanup init and reset of session slot tables
    NFSv4.1: fix backchannel slotid off-by-one bug
    nfs: fix regression in handling of context= option in NFSv4
    NFS - fix recent breakage to NFS error handling.
    NFS: Retry mounting NFSROOT
    SUNRPC: Clean up the RPCSEC_GSS service ticket requests

    Linus Torvalds
     
  • * 'linux-next' of git://git.infradead.org/ubifs-2.6:
    UBI: fix use-after-free on error path
    UBI: fix missing scrub when there is a bit-flip
    UBIFS: Use kmemdup rather than duplicating its implementation

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
    dlm: add recovery callbacks
    dlm: add node slots and generation
    dlm: move recovery barrier calls
    dlm: convert rsb list to rb_tree

    Linus Torvalds
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • MTD pull for 3.3

    * tag 'for-linus-3.3' of git://git.infradead.org/mtd-2.6: (113 commits)
    mtd: Fix dependency for MTD_DOC200x
    mtd: do not use mtd->block_markbad directly
    logfs: do not use 'mtd->block_isbad' directly
    mtd: introduce mtd_can_have_bb helper
    mtd: do not use mtd->suspend and mtd->resume directly
    mtd: do not use mtd->lock, unlock and is_locked directly
    mtd: do not use mtd->sync directly
    mtd: harmonize mtd_writev usage
    mtd: do not use mtd->lock_user_prot_reg directly
    mtd: mtd->write_user_prot_reg directly
    mtd: do not use mtd->read_*_prot_reg directly
    mtd: do not use mtd->get_*_prot_info directly
    mtd: do not use mtd->read_oob directly
    mtd: mtdoops: do not use mtd->panic_write directly
    romfs: do not use mtd->get_unmapped_area directly
    mtd: do not use mtd->get_unmapped_area directly
    mtd: do use mtd->point directly
    mtd: introduce mtd_has_oob helper
    mtd: mtdcore: export symbols cleanup
    mtd: clean-up the default_mtd_writev function
    ...

    Fix up trivial edit/remove conflict in drivers/staging/spectra/lld_mtd.c

    Linus Torvalds
     
  • Kmemleak reports the following warning in bdev_cache_init()
    [ 0.003738] kmemleak: Object 0xffff880153035200 (size 256):
    [ 0.003823] kmemleak: comm "swapper/0", pid 0, jiffies 4294667299
    [ 0.003909] kmemleak: min_count = 1
    [ 0.003988] kmemleak: count = 0
    [ 0.004066] kmemleak: flags = 0x1
    [ 0.004144] kmemleak: checksum = 0
    [ 0.004224] kmemleak: backtrace:
    [ 0.004303] [] kmemleak_alloc+0x21/0x3e
    [ 0.004446] [] kmem_cache_alloc+0xca/0x1dc
    [ 0.004592] [] alloc_vfsmnt+0x1f/0x198
    [ 0.004736] [] vfs_kern_mount+0x36/0xd2
    [ 0.004879] [] kern_mount_data+0x18/0x32
    [ 0.005025] [] bdev_cache_init+0x51/0x81
    [ 0.005169] [] vfs_caches_init+0x101/0x10d
    [ 0.005313] [] start_kernel+0x344/0x383
    [ 0.005456] [] x86_64_start_reservations+0xae/0xb2
    [ 0.005602] [] x86_64_start_kernel+0x102/0x111
    [ 0.005747] [] 0xffffffffffffffff
    [ 0.008653] kmemleak: Trying to color unknown object at 0xffff880153035220 as Grey
    [ 0.008754] Pid: 0, comm: swapper/0 Not tainted 3.3.0-rc0-dbg-04200-g8180888-dirty #888
    [ 0.008856] Call Trace:
    [ 0.008934] [] ? find_and_get_object+0x44/0x118
    [ 0.009023] [] paint_ptr+0x57/0x8f
    [ 0.009109] [] kmemleak_not_leak+0x23/0x42
    [ 0.009195] [] bdev_cache_init+0x72/0x81
    [ 0.009282] [] vfs_caches_init+0x101/0x10d
    [ 0.009368] [] start_kernel+0x344/0x383
    [ 0.009466] [] x86_64_start_reservations+0xae/0xb2
    [ 0.009555] [] ? early_idt_handlers+0x140/0x140
    [ 0.009643] [] x86_64_start_kernel+0x102/0x111

    due to attempt to mark pointer to `struct vfsmount' as a gray object, which
    is embedded into `struct mount' returned from alloc_vfsmnt().

    Make `bd_mnt' static, avoiding need to tell kmemleak to mark it gray, as
    suggested by Al Viro.

    Signed-off-by: Sergey Senozhatsky
    Signed-off-by: Al Viro

    Sergey Senozhatsky
     
  • Two (or more) concurrent calls of shrink_dcache_parent() on the same dentry may
    cause shrink_dcache_parent() to loop forever.

    Here's what appears to happen:

    1 - CPU0: select_parent(P) finds C and puts it on dispose list, returns 1

    2 - CPU1: select_parent(P) locks P->d_lock

    3 - CPU0: shrink_dentry_list() locks C->d_lock
    dentry_kill(C) tries to lock P->d_lock but fails, unlocks C->d_lock

    4 - CPU1: select_parent(P) locks C->d_lock,
    moves C from dispose list being processed on CPU0 to the new
    dispose list, returns 1

    5 - CPU0: shrink_dentry_list() finds dispose list empty, returns

    6 - Goto 2 with CPU0 and CPU1 switched

    Basically select_parent() steals the dentry from shrink_dentry_list() and thinks
    it found a new one, causing shrink_dentry_list() to think it's making progress
    and loop over and over.

    One way to trigger this is to make udev calls stat() on the sysfs file while it
    is going away.

    Having a file in /lib/udev/rules.d/ with only this one rule seems to the trick:

    ATTR{vendor}=="0x8086", ATTR{device}=="0x10ca", ENV{PCI_SLOT_NAME}="%k", ENV{MATCHADDR}="$attr{address}", RUN+="/bin/true"

    Then execute the following loop:

    while true; do
    echo -bond0 > /sys/class/net/bonding_masters
    echo +bond0 > /sys/class/net/bonding_masters
    echo -bond1 > /sys/class/net/bonding_masters
    echo +bond1 > /sys/class/net/bonding_masters
    done

    One fix would be to check all callers and prevent concurrent calls to
    shrink_dcache_parent(). But I think a better solution is to stop the
    stealing behavior.

    This patch adds a new dentry flag that is set when the dentry is added to the
    dispose list. The flag is cleared in dentry_lru_del() in case the dentry gets a
    new reference just before being pruned.

    If the dentry has this flag, select_parent() will skip it and let
    shrink_dentry_list() retry pruning it. With select_parent() skipping those
    dentries there will not be the appearance of progress (new dentries found) when
    there is none, hence shrink_dcache_parent() will not loop forever.

    Set the flag is also set in prune_dcache_sb() for consistency as suggested by
    Linus.

    Signed-off-by: Miklos Szeredi
    CC: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • Conflicts:
    fs/ext4/ioctl.c

    Theodore Ts'o
     
  • Commit 503358ae01b70ce6909d19dd01287093f6b6271c ("ext4: avoid divide by
    zero when trying to mount a corrupted file system") fixes CVE-2009-4307
    by performing a sanity check on s_log_groups_per_flex, since it can be
    set to a bogus value by an attacker.

    sbi->s_log_groups_per_flex = sbi->s_es->s_log_groups_per_flex;
    groups_per_flex = 1 << sbi->s_log_groups_per_flex;

    if (groups_per_flex < 2) { ... }

    This patch fixes two potential issues in the previous commit.

    1) The sanity check might only work on architectures like PowerPC.
    On x86, 5 bits are used for the shifting amount. That means, given a
    large s_log_groups_per_flex value like 36, groups_per_flex = 1 << 36
    is essentially 1 << 4 = 16, rather than 0. This will bypass the check,
    leaving s_log_groups_per_flex and groups_per_flex inconsistent.

    2) The sanity check relies on undefined behavior, i.e., oversized shift.
    A standard-confirming C compiler could rewrite the check in unexpected
    ways. Consider the following equivalent form, assuming groups_per_flex
    is unsigned for simplicity.

    groups_per_flex = 1 << sbi->s_log_groups_per_flex;
    if (groups_per_flex == 0 || groups_per_flex == 1) {

    We compile the code snippet using Clang 3.0 and GCC 4.6. Clang will
    completely optimize away the check groups_per_flex == 0, leaving the
    patched code as vulnerable as the original. GCC keeps the check, but
    there is no guarantee that future versions will do the same.

    Signed-off-by: Xi Wang
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@vger.kernel.org

    Xi Wang
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • lookup should fail with ENOMEM, not silently make dentry negative.
    Switched to saner calling conventions, while we are at it.

    Signed-off-by: Al Viro

    Al Viro
     

10 Jan, 2012

4 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    vfs: new helper - d_make_root()
    dcache: use a dispose list in select_parent
    ceph: d_alloc_root() may fail
    ext4: fix failure exits
    isofs: inode leak on mount failure

    Linus Torvalds
     
  • d_alloc_root() with iput() in case of allocation failure...

    Signed-off-by: Al Viro

    Al Viro
     
  • select_parent currently abuses the dentry cache LRU to provide
    cleanup features for child dentries that need to be freed. It moves
    them to the tail of the LRU, then tells shrink_dcache_parent() to
    calls __shrink_dcache_sb to unconditionally move them to a dispose
    list (as DCACHE_REFERENCED is ignored). __shrink_dcache_sb() has to
    relock the dentries to move them off the LRU onto the dispose list,
    but otherwise does not touch the dentries that select_parent() moved
    to the tail of the LRU. It then passses the dispose list to
    shrink_dentry_list() which tries to free the dentries.

    IOWs, the use of __shrink_dcache_sb() is superfluous - we can build
    exactly the same list of dentries for disposal directly in
    select_parent() and call shrink_dentry_list() instead of calling
    __shrink_dcache_sb() to do that. This means that we avoid long holds
    on the lru lock walking the LRU moving dentries to the dispose list
    We also avoid the need to relock each dentry just to move it off the
    LRU, reducing the numebr of times we lock each dentry to dispose of
    them in shrink_dcache_parent() from 3 to 2 times.

    Further, we remove one of the two callers of __shrink_dcache_sb().
    This also means that __shrink_dcache_sb can be moved into back into
    prune_dcache_sb() and we no longer have to handle referenced
    dentries conditionally, simplifying the code.

    Signed-off-by: Dave Chinner
    Signed-off-by: Linus Torvalds
    Signed-off-by: Al Viro

    Dave Chinner
     
  • ... and ceph_init_dentry(NULL) will oops

    Signed-off-by: Al Viro

    Al Viro