12 Jan, 2012

1 commit


11 Jan, 2012

39 commits

  • lib: use generic pci_iomap on all architectures

    Many architectures don't want to pull in iomap.c,
    so they ended up duplicating pci_iomap from that file.
    That function isn't trivial, and we are going to modify it
    https://lkml.org/lkml/2011/11/14/183
    so the duplication hurts.

    This reduces the scope of the problem significantly,
    by moving pci_iomap to a separate file and
    referencing that from all architectures.

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
    alpha: drop pci_iomap/pci_iounmap from pci-noop.c
    mn10300: switch to GENERIC_PCI_IOMAP
    mn10300: add missing __iomap markers
    frv: switch to GENERIC_PCI_IOMAP
    tile: switch to GENERIC_PCI_IOMAP
    tile: don't panic on iomap
    sparc: switch to GENERIC_PCI_IOMAP
    sh: switch to GENERIC_PCI_IOMAP
    powerpc: switch to GENERIC_PCI_IOMAP
    parisc: switch to GENERIC_PCI_IOMAP
    mips: switch to GENERIC_PCI_IOMAP
    microblaze: switch to GENERIC_PCI_IOMAP
    arm: switch to GENERIC_PCI_IOMAP
    alpha: switch to GENERIC_PCI_IOMAP
    lib: add GENERIC_PCI_IOMAP
    lib: move GENERIC_IOMAP to lib/Kconfig

    Fix up trivial conflicts due to changes nearby in arch/{m68k,score}/Kconfig

    Linus Torvalds
     
  • * tag 'for-linux-3.3-merge-window' of git://linux-c6x.org/git/projects/linux-c6x-upstreaming: (29 commits)
    C6X: replace tick_nohz_stop/restart_sched_tick calls
    C6X: add register_cpu call
    C6X: deal with memblock API changes
    C6X: fix timer64 initialization
    C6X: fix layout of EMIFA registers
    C6X: MAINTAINERS
    C6X: DSCR - Device State Configuration Registers
    C6X: EMIF - External Memory Interface
    C6X: general SoC support
    C6X: library code
    C6X: headers
    C6X: ptrace support
    C6X: loadable module support
    C6X: cache control
    C6X: clocks
    C6X: build infrastructure
    C6X: syscalls
    C6X: interrupt handling
    C6X: time management
    C6X: signal management
    ...

    Linus Torvalds
     
  • * 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: move MIN_WRITEBACK_PAGES to fs-writeback.c
    writeback: balanced_rate cannot exceed write bandwidth
    writeback: do strict bdi dirty_exceeded
    writeback: avoid tiny dirty poll intervals
    writeback: max, min and target dirty pause time
    writeback: dirty ratelimit - think time compensation
    btrfs: fix dirtied pages accounting on sub-page writes
    writeback: fix dirtied pages accounting on redirty
    writeback: fix dirtied pages accounting on sub-page writes
    writeback: charge leaked page dirties to active tasks
    writeback: Include all dirty inodes in background writeback

    Linus Torvalds
     
  • Andrew elucidates:
    - First installmeant of MM. We have a HUGE number of MM patches this
    time. It's crazy.
    - MAINTAINERS updates
    - backlight updates
    - leds
    - checkpatch updates
    - misc ELF stuff
    - rtc updates
    - reiserfs
    - procfs
    - some misc other bits

    * akpm: (124 commits)
    user namespace: make signal.c respect user namespaces
    workqueue: make alloc_workqueue() take printf fmt and args for name
    procfs: add hidepid= and gid= mount options
    procfs: parse mount options
    procfs: introduce the /proc//map_files/ directory
    procfs: make proc_get_link to use dentry instead of inode
    signal: add block_sigmask() for adding sigmask to current->blocked
    sparc: make SA_NOMASK a synonym of SA_NODEFER
    reiserfs: don't lock root inode searching
    reiserfs: don't lock journal_init()
    reiserfs: delay reiserfs lock until journal initialization
    reiserfs: delete comments referring to the BKL
    drivers/rtc/interface.c: fix alarm rollover when day or month is out-of-range
    drivers/rtc/rtc-twl.c: add DT support for RTC inside twl4030/twl6030
    drivers/rtc/: remove redundant spi driver bus initialization
    drivers/rtc/rtc-jz4740.c: make jz4740_rtc_driver static
    drivers/rtc/rtc-mc13xxx.c: make mc13xxx_rtc_idtable static
    rtc: convert drivers/rtc/* to use module_platform_driver()
    drivers/rtc/rtc-wm831x.c: convert to devm_kzalloc()
    drivers/rtc/rtc-wm831x.c: remove unused period IRQ handler
    ...

    Linus Torvalds
     
  • alloc_workqueue() currently expects the passed in @name pointer to remain
    accessible. This is inconvenient and a bit silly given that the whole wq
    is being dynamically allocated. This patch updates alloc_workqueue() and
    friends to take printf format string instead of opaque string and matching
    varargs at the end. The name is allocated together with the wq and
    formatted.

    alloc_ordered_workqueue() is converted to a macro to unify varargs
    handling with alloc_workqueue(), and, while at it, add comment to
    alloc_workqueue().

    None of the current in-kernel users pass in string with '%' as constant
    name and this change shouldn't cause any problem.

    [akpm@linux-foundation.org: use __printf]
    Signed-off-by: Tejun Heo
    Suggested-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • Add support for mount options to restrict access to /proc/PID/
    directories. The default backward-compatible "relaxed" behaviour is left
    untouched.

    The first mount option is called "hidepid" and its value defines how much
    info about processes we want to be available for non-owners:

    hidepid=0 (default) means the old behavior - anybody may read all
    world-readable /proc/PID/* files.

    hidepid=1 means users may not access any /proc// directories, but
    their own. Sensitive files like cmdline, sched*, status are now protected
    against other users. As permission checking done in proc_pid_permission()
    and files' permissions are left untouched, programs expecting specific
    files' modes are not confused.

    hidepid=2 means hidepid=1 plus all /proc/PID/ will be invisible to other
    users. It doesn't mean that it hides whether a process exists (it can be
    learned by other means, e.g. by kill -0 $PID), but it hides process' euid
    and egid. It compicates intruder's task of gathering info about running
    processes, whether some daemon runs with elevated privileges, whether
    another user runs some sensitive program, whether other users run any
    program at all, etc.

    gid=XXX defines a group that will be able to gather all processes' info
    (as in hidepid=0 mode). This group should be used instead of putting
    nonroot user in sudoers file or something. However, untrusted users (like
    daemons, etc.) which are not supposed to monitor the tasks in the whole
    system should not be added to the group.

    hidepid=1 or higher is designed to restrict access to procfs files, which
    might reveal some sensitive private information like precise keystrokes
    timings:

    http://www.openwall.com/lists/oss-security/2011/11/05/3

    hidepid=1/2 doesn't break monitoring userspace tools. ps, top, pgrep, and
    conky gracefully handle EPERM/ENOENT and behave as if the current user is
    the only user running processes. pstree shows the process subtree which
    contains "pstree" process.

    Note: the patch doesn't deal with setuid/setgid issues of keeping
    preopened descriptors of procfs files (like
    https://lkml.org/lkml/2011/2/7/368). We rely on that the leaked
    information like the scheduling counters of setuid apps doesn't threaten
    anybody's privacy - only the user started the setuid program may read the
    counters.

    Signed-off-by: Vasiliy Kulikov
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Cc: Randy Dunlap
    Cc: "H. Peter Anvin"
    Cc: Greg KH
    Cc: Theodore Tso
    Cc: Alan Cox
    Cc: James Morris
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     
  • This one behaves similarly to the /proc//fd/ one - it contains
    symlinks one for each mapping with file, the name of a symlink is
    "vma->vm_start-vma->vm_end", the target is the file. Opening a symlink
    results in a file that point exactly to the same inode as them vma's one.

    For example the ls -l of some arbitrary /proc//map_files/

    | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
    | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
    | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
    | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
    | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so

    This *helps* checkpointing process in three ways:

    1. When dumping a task mappings we do know exact file that is mapped
    by particular region. We do this by opening
    /proc/$pid/map_files/$address symlink the way we do with file
    descriptors.

    2. This also helps in determining which anonymous shared mappings are
    shared with each other by comparing the inodes of them.

    3. When restoring a set of processes in case two of them has a mapping
    shared, we map the memory by the 1st one and then open its
    /proc/$pid/map_files/$address file and map it by the 2nd task.

    Using /proc/$pid/maps for this is quite inconvenient since it brings
    repeatable re-reading and reparsing for this text file which slows down
    restore procedure significantly. Also as being pointed in (3) it is a way
    easier to use top level shared mapping in children as
    /proc/$pid/map_files/$address when needed.

    [akpm@linux-foundation.org: coding-style fixes]
    [gorcunov@openvz.org: make map_files depend on CHECKPOINT_RESTORE]
    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Cyrill Gorcunov
    Reviewed-by: Vasiliy Kulikov
    Reviewed-by: "Kirill A. Shutemov"
    Cc: Tejun Heo
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Prepare the ground for the next "map_files" patch which needs a name of a
    link file to analyse.

    Signed-off-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Tejun Heo
    Cc: Vasiliy Kulikov
    Cc: "Kirill A. Shutemov"
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Abstract the code sequence for adding a signal handler's sa_mask to
    current->blocked because the sequence is identical for all architectures.
    Furthermore, in the past some architectures actually got this code wrong,
    so introduce a wrapper that all architectures can use.

    Signed-off-by: Matt Fleming
    Signed-off-by: Oleg Nesterov
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Cc: Tejun Heo
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Fleming
     
  • TI's TCA6507 is the LED driver in the GTA04 Openmoko motherboard. The
    driver provides full support for brightness levels and hardware blinking.

    This driver can drive each of 7 outputs as an LED or a GPIO output,
    and provides hardware-assist blinking.

    [akpm@linux-foundation.org: fix __mod_i2c_device_table alias]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: NeilBrown
    Cc: Richard Purdie
    Cc: Randy Dunlap
    Cc: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • mpol_equal() logically returns a boolean. Use a bool type to slightly
    improve readability.

    Signed-off-by: KOSAKI Motohiro
    Cc: Stephen Wilson
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • oom_score_adj is used for guarding processes from OOM-Killer. One of
    problem is that it's inherited at fork(). When a daemon set oom_score_adj
    and make children, it's hard to know where the value is set.

    This patch adds some tracepoints useful for debugging. This patch adds
    3 trace points.
    - creating new task
    - renaming a task (exec)
    - set oom_score_adj

    To debug, users need to enable some trace pointer. Maybe filtering is useful as

    # EVENT=/sys/kernel/debug/tracing/events/task/
    # echo "oom_score_adj != 0" > $EVENT/task_newtask/filter
    # echo "oom_score_adj != 0" > $EVENT/task_rename/filter
    # echo 1 > $EVENT/enable
    # EVENT=/sys/kernel/debug/tracing/events/oom/
    # echo 1 > $EVENT/enable

    output will be like this.
    # grep oom /sys/kernel/debug/tracing/trace
    bash-7699 [007] d..3 5140.744510: oom_score_adj_update: pid=7699 comm=bash oom_score_adj=-1000
    bash-7699 [007] ...1 5151.818022: task_newtask: pid=7729 comm=bash clone_flags=1200011 oom_score_adj=-1000
    ls-7729 [003] ...2 5151.818504: task_rename: pid=7729 oldcomm=bash newcomm=ls oom_score_adj=-1000
    bash-7699 [002] ...1 5175.701468: task_newtask: pid=7730 comm=bash clone_flags=1200011 oom_score_adj=-1000
    grep-7730 [007] ...2 5175.701993: task_rename: pid=7730 oldcomm=bash newcomm=grep oom_score_adj=-1000

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • migrate was doing an rmap_walk with speculative lock-less access on
    pagetables. That could lead it to not serializing properly against mremap
    PT locks. But a second problem remains in the order of vmas in the
    same_anon_vma list used by the rmap_walk.

    If vma_merge succeeds in copy_vma, the src vma could be placed after the
    dst vma in the same_anon_vma list. That could still lead to migrate
    missing some pte.

    This patch adds an anon_vma_moveto_tail() function to force the dst vma at
    the end of the list before mremap starts to solve the problem.

    If the mremap is very large and there are a lots of parents or childs
    sharing the anon_vma root lock, this should still scale better than taking
    the anon_vma root lock around every pte copy practically for the whole
    duration of mremap.

    Update: Hugh noticed special care is needed in the error path where
    move_page_tables goes in the reverse direction, a second
    anon_vma_moveto_tail() call is needed in the error path.

    This program exercises the anon_vma_moveto_tail:

    ===

    int main()
    {
    static struct timeval oldstamp, newstamp;
    long diffsec;
    char *p, *p2, *p3, *p4;
    if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);
    if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);
    if (posix_memalign((void **)&p3, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);

    memset(p, 0xff, SIZE);
    printf("%p\n", p);
    memset(p2, 0xff, SIZE);
    memset(p3, 0x77, 4096);
    if (memcmp(p, p2, SIZE))
    printf("error\n");
    p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
    if (p4 != p3)
    perror("mremap"), exit(1);
    p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2);
    if (p4 != p+SIZE/2)
    perror("mremap"), exit(1);
    if (memcmp(p, p2, SIZE))
    printf("error\n");
    printf("ok\n");

    return 0;
    }
    ===

    $ perf probe -a anon_vma_moveto_tail
    Add new event:
    probe:anon_vma_moveto_tail (on anon_vma_moveto_tail)

    You can now use it on all perf tools, such as:

    perf record -e probe:anon_vma_moveto_tail -aR sleep 1

    $ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail
    0x7f2ca2800000
    ok
    [ perf record: Woken up 1 times to write data ]
    [ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ]
    $ perf report --stdio
    100.00% anon_vma_moveto [kernel.kallsyms] [k] anon_vma_moveto_tail

    Signed-off-by: Andrea Arcangeli
    Reported-by: Nai Xia
    Acked-by: Mel Gorman
    Cc: Hugh Dickins
    Cc: Pawel Sikora
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • The maximum number of dirty pages that exist in the system at any time is
    determined by a number of pages considered dirtyable and a user-configured
    percentage of those, or an absolute number in bytes.

    This number of dirtyable pages is the sum of memory provided by all the
    zones in the system minus their lowmem reserves and high watermarks, so
    that the system can retain a healthy number of free pages without having
    to reclaim dirty pages.

    But there is a flaw in that we have a zoned page allocator which does not
    care about the global state but rather the state of individual memory
    zones. And right now there is nothing that prevents one zone from filling
    up with dirty pages while other zones are spared, which frequently leads
    to situations where kswapd, in order to restore the watermark of free
    pages, does indeed have to write pages from that zone's LRU list. This
    can interfere so badly with IO from the flusher threads that major
    filesystems (btrfs, xfs, ext4) mostly ignore write requests from reclaim
    already, taking away the VM's only possibility to keep such a zone
    balanced, aside from hoping the flushers will soon clean pages from that
    zone.

    Enter per-zone dirty limits. They are to a zone's dirtyable memory what
    the global limit is to the global amount of dirtyable memory, and try to
    make sure that no single zone receives more than its fair share of the
    globally allowed dirty pages in the first place. As the number of pages
    considered dirtyable excludes the zones' lowmem reserves and high
    watermarks, the maximum number of dirty pages in a zone is such that the
    zone can always be balanced without requiring page cleaning.

    As this is a placement decision in the page allocator and pages are
    dirtied only after the allocation, this patch allows allocators to pass
    __GFP_WRITE when they know in advance that the page will be written to and
    become dirty soon. The page allocator will then attempt to allocate from
    the first zone of the zonelist - which on NUMA is determined by the task's
    NUMA memory policy - that has not exceeded its dirty limit.

    At first glance, it would appear that the diversion to lower zones can
    increase pressure on them, but this is not the case. With a full high
    zone, allocations will be diverted to lower zones eventually, so it is
    more of a shift in timing of the lower zone allocations. Workloads that
    previously could fit their dirty pages completely in the higher zone may
    be forced to allocate from lower zones, but the amount of pages that
    "spill over" are limited themselves by the lower zones' dirty constraints,
    and thus unlikely to become a problem.

    For now, the problem of unfair dirty page distribution remains for NUMA
    configurations where the zones allowed for allocation are in sum not big
    enough to trigger the global dirty limits, wake up the flusher threads and
    remedy the situation. Because of this, an allocation that could not
    succeed on any of the considered zones is allowed to ignore the dirty
    limits before going into direct reclaim or even failing the allocation,
    until a future patch changes the global dirty throttling and flusher
    thread activation so that they take individual zone states into account.

    Test results

    15M DMA + 3246M DMA32 + 504 Normal = 3765M memory
    40% dirty ratio
    16G USB thumb drive
    10 runs of dd if=/dev/zero of=disk/zeroes bs=32k count=$((10 << 15))

    seconds nr_vmscan_write
    (stddev) min| median| max
    xfs
    vanilla: 549.747( 3.492) 0.000| 0.000| 0.000
    patched: 550.996( 3.802) 0.000| 0.000| 0.000

    fuse-ntfs
    vanilla: 1183.094(53.178) 54349.000| 59341.000| 65163.000
    patched: 558.049(17.914) 0.000| 0.000| 43.000

    btrfs
    vanilla: 573.679(14.015) 156657.000| 460178.000| 606926.000
    patched: 563.365(11.368) 0.000| 0.000| 1362.000

    ext4
    vanilla: 561.197(15.782) 0.000|2725438.000|4143837.000
    patched: 568.806(17.496) 0.000| 0.000| 0.000

    Signed-off-by: Johannes Weiner
    Reviewed-by: Minchan Kim
    Acked-by: Mel Gorman
    Reviewed-by: Michal Hocko
    Tested-by: Wu Fengguang
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Shaohua Li
    Cc: Rik van Riel
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Per-zone dirty limits try to distribute page cache pages allocated for
    writing across zones in proportion to the individual zone sizes, to reduce
    the likelihood of reclaim having to write back individual pages from the
    LRU lists in order to make progress.

    This patch:

    The amount of dirtyable pages should not include the full number of free
    pages: there is a number of reserved pages that the page allocator and
    kswapd always try to keep free.

    The closer (reclaimable pages - dirty pages) is to the number of reserved
    pages, the more likely it becomes for reclaim to run into dirty pages:

    +----------+ ---
    | anon | |
    +----------+ |
    | | |
    | | -- dirty limit new -- flusher new
    | file | | |
    | | | |
    | | -- dirty limit old -- flusher old
    | | |
    +----------+ --- reclaim
    | reserved |
    +----------+
    | kernel |
    +----------+

    This patch introduces a per-zone dirty reserve that takes both the lowmem
    reserve as well as the high watermark of the zone into account, and a
    global sum of those per-zone values that is subtracted from the global
    amount of dirtyable pages. The lowmem reserve is unavailable to page
    cache allocations and kswapd tries to keep the high watermark free. We
    don't want to end up in a situation where reclaim has to clean pages in
    order to balance zones.

    Not treating reserved pages as dirtyable on a global level is only a
    conceptual fix. In reality, dirty pages are not distributed equally
    across zones and reclaim runs into dirty pages on a regular basis.

    But it is important to get this right before tackling the problem on a
    per-zone level, where the distance between reclaim and the dirty pages is
    mostly much smaller in absolute numbers.

    [akpm@linux-foundation.org: fix highmem build]
    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Michal Hocko
    Reviewed-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Hellwig
    Cc: Wu Fengguang
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Shaohua Li
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Calling alloc_pages_exact_node() means the allocation only passes the
    zonelist of a single node into the page allocator. If that node isn't
    online, it's zonelist may never have been initialized causing a strange
    oops that may not immediately be clear.

    I recently debugged an issue where node 0 wasn't online and an allocator
    was passing 0 to alloc_pages_exact_node() and it resulted in a NULL
    pointer on zonelist->_zoneref. If CONFIG_DEBUG_VM is enabled, though, it
    would be nice to catch this a bit earlier.

    Signed-off-by: David Rientjes
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • With CONFIG_DEBUG_PAGEALLOC configured, the CPU will generate an exception
    on access (read,write) to an unallocated page, which permits us to catch
    code which corrupts memory. However the kernel is trying to maximise
    memory usage, hence there are usually few free pages in the system and
    buggy code usually corrupts some crucial data.

    This patch changes the buddy allocator to keep more free/protected pages
    and to interlace free/protected and allocated pages to increase the
    probability of catching corruption.

    When the kernel is compiled with CONFIG_DEBUG_PAGEALLOC,
    debug_guardpage_minorder defines the minimum order used by the page
    allocator to grant a request. The requested size will be returned with
    the remaining pages used as guard pages.

    The default value of debug_guardpage_minorder is zero: no change from
    current behaviour.

    [akpm@linux-foundation.org: tweak documentation, s/flg/flag/]
    Signed-off-by: Stanislaw Gruszka
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: "Rafael J. Wysocki"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stanislaw Gruszka
     
  • We can place this in definitions that we expect the compiler to remove by
    dead code elimination. If this assertion fails, we get a nice error
    message at build time.

    The GCC function attribute error("message") was added in version 4.3, so
    we define a new macro __linktime_error(message) to expand to this for
    GCC-4.3 and later. This will give us an error diagnostic from the
    compiler on the line that fails. For other compilers
    __linktime_error(message) expands to nothing, and we have to be content
    with a link time error, but at least we will still get a build error.

    BUILD_BUG() expands to the undefined function __build_bug_failed() and
    will fail at link time if the compiler ever emits code for it. On GCC-4.3
    and later, attribute((error())) is used so that the failure will be noted
    at compile time instead.

    Signed-off-by: David Daney
    Acked-by: David Rientjes
    Cc: DM
    Cc: Ralf Baechle
    Acked-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Daney
     
  • Colin Cross reported;

    Under the following conditions, __alloc_pages_slowpath can loop forever:
    gfp_mask & __GFP_WAIT is true
    gfp_mask & __GFP_FS is false
    reclaim and compaction make no progress
    order
    Signed-off-by: Mel Gorman
    Acked-by: David Rientjes
    Cc: Minchan Kim
    Cc: Pekka Enberg
    Cc: KAMEZAWA Hiroyuki
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Rename mm_page_free_direct into mm_page_free and mm_pagevec_free into
    mm_page_free_batched

    Since v2.6.33-5426-gc475dab the kernel triggers mm_page_free_direct for
    all freed pages, not only for directly freed. So, let's name it properly.
    For pages freed via page-list we also trigger mm_page_free_batched event.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • It not exported and now nobody uses it.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • This patch adds helper free_hot_cold_page_list() to free list of 0-order
    pages. It frees pages directly from list without temporary page-vector.
    It also calls trace_mm_pagevec_free() to simulate pagevec_free()
    behaviour.

    bloat-o-meter:

    add/remove: 1/1 grow/shrink: 1/3 up/down: 267/-295 (-28)
    function old new delta
    free_hot_cold_page_list - 264 +264
    get_page_from_freelist 2129 2132 +3
    __pagevec_free 243 239 -4
    split_free_page 380 373 -7
    release_pages 606 510 -96
    free_page_list 188 - -188

    Signed-off-by: Konstantin Khlebnikov
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Acked-by: Minchan Kim
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • The tracing ring-buffer used this function briefly, but not anymore.
    Make it local to the writeback code again.

    Also, move the function so that no forward declaration needs to be
    reintroduced.

    Signed-off-by: Johannes Weiner
    Acked-by: Mel Gorman
    Reviewed-by: Michal Hocko
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Ext4 commits for 3.3 merge window

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (32 commits)
    ext4: fix undefined behavior in ext4_fill_flex_info()
    ext4: make more symbols static
    ext4: make local symbol ext4_initxattrs static
    jbd2: fix hung processes in jbd2_journal_lock_updates()
    ext4: reserve new feature flag codepoints
    ext4: Report max_batch_time option correctly
    ext4: add missing ext4_resize_end on error paths
    ext4: let ext4_group_add() use common code
    ext4: let ext4_group_extend() use common code
    ext4: add new online resize interface
    ext4: add a new function which adds a flex group to a fs
    ext4: add a new function which allocates bitmaps and inode tables
    ext4: pass verify_reserved_gdb() the number of group decriptors
    ext4: add a function which updates the super block during online resizing
    ext4: add a function which sets up a block group descriptors of a flex bg
    ext4: add a function which sets up group blocks of a flex bg
    ext4: add a structure which will be used by 64bit-resize interface
    ext4: add a function which adds a new group descriptors to a fs
    ext4: add a function which extends a group without checking parameters
    ext4: use proper little-endian bitops
    ...

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
    fs/9p: iattr_valid flags are kernel internal flags map them to 9p values.
    fs/9p: We should not allocate a new inode when creating hardlines.
    fs/9p: v9fs_stat2inode should update suid/sgid bits.
    9p: Reduce object size with CONFIG_NET_9P_DEBUG
    fs/9p: check schedule_timeout_interruptible return value

    Fix up trivial conflicts in fs/9p/{vfs_inode.c,vfs_inode_dotl.c} due to
    debug messages having changed to use p9_debug() on one hand, and the
    changes for umode_t on the other.

    Linus Torvalds
     
  • * 'nfs-for-3.3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    NFSv4: Change the default setting of the nfs4_disable_idmapping parameter
    NFSv4: Save the owner/group name string when doing open
    NFS: Remove pNFS bloat from the generic write path
    pnfs-obj: Must return layout on IO error
    pnfs-obj: pNFS errors are communicated on iodata->pnfs_error
    NFS: Cache state owners after files are closed
    NFS: Clean up nfs4_find_state_owners_locked()
    NFSv4: include bitmap in nfsv4 get acl data
    nfs: fix a minor do_div portability issue
    NFSv4.1: cleanup comment and debug printk
    NFSv4.1: change nfs4_free_slot parameters for dynamic slots
    NFSv4.1: cleanup init and reset of session slot tables
    NFSv4.1: fix backchannel slotid off-by-one bug
    nfs: fix regression in handling of context= option in NFSv4
    NFS - fix recent breakage to NFS error handling.
    NFS: Retry mounting NFSROOT
    SUNRPC: Clean up the RPCSEC_GSS service ticket requests

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
    dlm: add recovery callbacks
    dlm: add node slots and generation
    dlm: move recovery barrier calls
    dlm: convert rsb list to rb_tree

    Linus Torvalds
     
  • MTD pull for 3.3

    * tag 'for-linus-3.3' of git://git.infradead.org/mtd-2.6: (113 commits)
    mtd: Fix dependency for MTD_DOC200x
    mtd: do not use mtd->block_markbad directly
    logfs: do not use 'mtd->block_isbad' directly
    mtd: introduce mtd_can_have_bb helper
    mtd: do not use mtd->suspend and mtd->resume directly
    mtd: do not use mtd->lock, unlock and is_locked directly
    mtd: do not use mtd->sync directly
    mtd: harmonize mtd_writev usage
    mtd: do not use mtd->lock_user_prot_reg directly
    mtd: mtd->write_user_prot_reg directly
    mtd: do not use mtd->read_*_prot_reg directly
    mtd: do not use mtd->get_*_prot_info directly
    mtd: do not use mtd->read_oob directly
    mtd: mtdoops: do not use mtd->panic_write directly
    romfs: do not use mtd->get_unmapped_area directly
    mtd: do not use mtd->get_unmapped_area directly
    mtd: do use mtd->point directly
    mtd: introduce mtd_has_oob helper
    mtd: mtdcore: export symbols cleanup
    mtd: clean-up the default_mtd_writev function
    ...

    Fix up trivial edit/remove conflict in drivers/staging/spectra/lld_mtd.c

    Linus Torvalds
     
  • * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (53 commits)
    iommu/amd: Set IOTLB invalidation timeout
    iommu/amd: Init stats for iommu=pt
    iommu/amd: Remove unnecessary cache flushes in amd_iommu_resume
    iommu/amd: Add invalidate-context call-back
    iommu/amd: Add amd_iommu_device_info() function
    iommu/amd: Adapt IOMMU driver to PCI register name changes
    iommu/amd: Add invalid_ppr callback
    iommu/amd: Implement notifiers for IOMMUv2
    iommu/amd: Implement IO page-fault handler
    iommu/amd: Add routines to bind/unbind a pasid
    iommu/amd: Implement device aquisition code for IOMMUv2
    iommu/amd: Add driver stub for AMD IOMMUv2 support
    iommu/amd: Add stat counter for IOMMUv2 events
    iommu/amd: Add device errata handling
    iommu/amd: Add function to get IOMMUv2 domain for pdev
    iommu/amd: Implement function to send PPR completions
    iommu/amd: Implement functions to manage GCR3 table
    iommu/amd: Implement IOMMUv2 TLB flushing routines
    iommu/amd: Add support for IOMMUv2 domain mode
    iommu/amd: Add amd_iommu_domain_direct_map function
    ...

    Linus Torvalds
     
  • * 'drm-core-next' of git://people.freedesktop.org/~airlied/linux: (307 commits)
    drm/nouveau/pm: fix build with HWMON off
    gma500: silence gcc warnings in mid_get_vbt_data()
    drm/ttm: fix condition (and vs or)
    drm/radeon: double lock typo in radeon_vm_bo_rmv()
    drm/radeon: use after free in radeon_vm_bo_add()
    drm/sis|via: don't return stack garbage from free_mem ioctl
    drm/radeon/kms: remove pointless CS flags priority struct
    drm/radeon/kms: check if vm is supported in VA ioctl
    drm: introduce drm_can_sleep and use in intel/radeon drivers. (v2)
    radeon: Fix disabling PCI bus mastering on big endian hosts.
    ttm: fix agp since ttm tt rework
    agp: Fix multi-line warning message whitespace
    drm/ttm/dma: Fix accounting error when calling ttm_mem_global_free_page and don't try to free freed pages.
    drm/ttm/dma: Only call set_pages_array_wb when the page is not in WB pool.
    drm/radeon/kms: sync across multiple rings when doing bo moves v3
    drm/radeon/kms: Add support for multi-ring sync in CS ioctl (v2)
    drm/radeon: GPU virtual memory support v22
    drm: make DRM_UNLOCKED ioctls with their own mutex
    drm: no need to hold global mutex for static data
    drm/radeon/benchmark: common modes sweep ignores 640x480@32
    ...

    Fix up trivial conflicts in radeon/evergreen.c and vmwgfx/vmwgfx_kms.c

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (64 commits)
    Input: tc3589x-keypad - add missing kerneldoc
    Input: ucb1400-ts - switch to using dev_xxx() for diagnostic messages
    Input: ucb1400_ts - convert to threaded IRQ
    Input: ucb1400_ts - drop inline annotations
    Input: usb1400_ts - add __devinit/__devexit section annotations
    Input: ucb1400_ts - set driver owner
    Input: ucb1400_ts - convert to use dev_pm_ops
    Input: psmouse - make sure we do not use stale methods
    Input: evdev - do not block waiting for an event if fd is nonblock
    Input: evdev - if no events and non-block, return EAGAIN not 0
    Input: evdev - only allow reading events if a full packet is present
    Input: add driver for pixcir i2c touchscreens
    Input: samsung-keypad - implement runtime power management support
    Input: tegra-kbc - report wakeup key for some platforms
    Input: tegra-kbc - add device tree bindings
    Input: add driver for AUO In-Cell touchscreens using pixcir ICs
    Input: mpu3050 - configure the sampling method
    Input: mpu3050 - ensure we enable interrupts
    Input: mpu3050 - add of_match table for device-tree probing
    Input: sentelic - document the latest hardware
    ...

    Fix up fairly trivial conflicts (device tree matching conflicting with
    some independent cleanups) in drivers/input/keyboard/samsung-keypad.c

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (68 commits)
    hid-input/battery: add FEATURE quirk
    hid-input/battery: remove battery_val
    hid-input/battery: power-supply type really *is* a battery
    hid-input/battery: make the battery setup common for INPUTs and FEATUREs
    hid-input/battery: deal with both FEATURE and INPUT report batteries
    hid-input/battery: add quirks for battery
    hid-input/battery: remove apparently redundant kmalloc
    hid-input: add support for HID devices reporting Battery Strength
    HID: hid-multitouch: add support 9 new Xiroku devices
    HID: multitouch: add support for 3M 32"
    HID: multitouch: add support of Atmel multitouch panels
    HID: usbhid: defer LED setting to a workqueue
    HID: usbhid: hid-core: submit queued urbs before suspend
    HID: usbhid: remove LED_ON
    HID: emsff: use symbolic name instead of hardcoded PID constant
    HID: Enable HID_QUIRK_MULTI_INPUT for Trio Linker Plus II
    HID: Kconfig: fix syntax
    HID: introduce proper dependency of HID_BATTERY on POWER_SUPPLY
    HID: multitouch: support PixArt optical touch screen
    HID: make parser more verbose about parsing errors by default
    ...

    Fix up rename/delete conflict in drivers/hid/hid-hyperv.c (removed in
    staging, moved in this branch) and similarly for the rules for same file
    in drivers/staging/hv/{Kconfig,Makefile}.

    Linus Torvalds
     
  • SCSI updates for post 3.2 merge window

    * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (67 commits)
    [SCSI] lpfc 8.3.28: Update driver version to 8.3.28
    [SCSI] lpfc 8.3.28: Add Loopback support for SLI4 adapters
    [SCSI] lpfc 8.3.28: Critical Miscellaneous fixes
    [SCSI] Lpfc 8.3.28: FC and SCSI Discovery Fixes
    [SCSI] lpfc 8.3.28: Add support for ABTS failure handling
    [SCSI] lpfc 8.3.28: SLI fixes and added SLI4 support
    [SCSI] lpfc 8.3.28: Miscellaneous fixes in sysfs and mgmt interfaces
    [SCSI] mpt2sas: Removed redundant calling of _scsih_probe_devices() from _scsih_probe
    [SCSI] mac_scsi: Remove obsolete IRQ_FLG_* users
    [SCSI] qla4xxx: Update driver version to 5.02.00-k10
    [SCSI] qla4xxx: check for FW alive before calling chip_reset
    [SCSI] qla4xxx: Fix qla4xxx_dump_buffer to dump buffer correctly
    [SCSI] qla4xxx: Fix the IDC locking mechanism
    [SCSI] qla4xxx: Wait for disable_acb before doing set_acb
    [SCSI] qla4xxx: Don't recover adapter if device state is FAILED
    [SCSI] qla4xxx: fix call trace on rmmod with ql4xdontresethba=1
    [SCSI] qla4xxx: Fix CPU lockups when ql4xdontresethba set
    [SCSI] qla4xxx: Perform context resets in case of context failures.
    [SCSI] iscsi class: export pid of process that created
    [SCSI] mpt2sas: Remove unused duplicate diag_buffer_enable param
    ...

    Linus Torvalds
     
  • * git://www.linux-watchdog.org/linux-watchdog:
    watchdog: omap_wdt.c: fix the WDIOC_GETBOOTSTATUS ioctl if not implemented.
    watchdog: new driver for VIA chipsets
    watchdog: ath79_wdt: flush register writes
    drivers/watchdog/lantiq_wdt.c: drop iounmap for devm_ allocated data
    watchdog: documentation: describe nowayout in coversion-guide
    watchdog: documentation: update index file
    watchdog: Convert wm831x driver to devm_kzalloc()
    watchdog: add nowayout helpers to Watchdog Timer Driver Kernel API
    watchdog: convert drivers/watchdog/* to use module_platform_driver()
    watchdog: Use DEFINE_SPINLOCK() for static spinlocks
    watchdog: Convert Wolfson drivers to module_platform_driver

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: (40 commits)
    regulator: set constraints.apply_uV to 0 in of_get_fixed_voltage_config
    regulator: max8925: fix enabled/disabled judgement mistake
    regulator: add regulator_bulk_force_disable function
    regulator: pass regulator_register of_node in fixed voltage driver
    regulator: add regulator_force_disable() definition for !CONFIG_REGULATOR
    regulator: Enable supply regulator if child rail is enabled.
    regulator: mc13892: Convert to devm_kzalloc()
    regulator: mc13783: Convert to devm_kzalloc()
    regulator: Fix checking return value of create_regulator
    regulator: Fix the error handling if create_regulator fails
    regulator: Export regulator_is_supported_voltage()
    regulator: mc13892: add device tree probe support
    regulator: mc13892: remove the unnecessary prefix from regulator name
    regulator: Convert wm831x regulator drivers to devm_kzalloc()
    regulator: da9052: Staticize non-exported symbols
    regulator: Replace kzalloc with devm_kzalloc and if-else with a switch-case for da9052-regulator
    regulator: Update da9052-regulator for DT changes
    regulator: DA9052/53 Regulator support
    regulator: pass device_node to of_get_regulator_init_data()
    regulator: If a single voltage is set with device tree then set apply_uV
    ...

    Linus Torvalds
     
  • * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (31 commits)
    pinctrl: remove unnecessary max pin number
    pinctrl: correct a offset while enumerating pins
    pinctrl: some typo fixes
    pinctrl: rename U300 and SIRF pin controllers
    pinctrl: pass name instead of device to pin_config_*
    pinctrl: add "struct seq_file;" to pinconf.h
    pinctrl: conjure names for unnamed pins
    pinctrl: add a group-specific hog macro
    pinctrl: don't create a device for each pin controller
    arm/u300: don't use PINMUX_MAP_PRIMARY*
    pinctrl: implement PINMUX_MAP_SYS_HOG
    pinctrl: add a pin config interface
    pinctrl/coh901: driver to request its pins
    pinctrl: u300-pinmux: register proper GPIO ranges
    pinctrl: move the U300 GPIO driver to pinctrl
    ARM: u300: localize GPIO assignments
    pinctrl: make it possible to add multiple maps
    pinctrl: make a copy of pinmux map
    pinctrl: GPIO direction support for muxing
    pinctrl: print pin range in GPIO range debugs
    ...

    Linus Torvalds
     
  • * 'upstream-linus' of git://github.com/jgarzik/libata-dev:
    ahci: support the STA2X11 I/O Hub
    pata_bf54x: fix BMIDE status register emulation
    ata: add ata port hibernate callbacks
    ata: update ata port's runtime status during system resume
    [SCSI] runtime resume parent for child's system-resume
    ahci: platform support for suspend/resume
    libata-core: kill duplicate statement in ata_do_set_mode()
    pata_of_platform: remove direct dependency on OF_IRQ
    SATA/PATA: convert drivers/ata/* to use module_platform_driver()
    pata_cs5536: forward port changes from cs5536
    libata-sff: use ATAPI_{COD|IO}
    ata: add ata port runtime PM callbacks
    ata: add ata port system PM callbacks
    [SCSI] sd: check runtime PM status in sd_shutdown
    [SCSI] check runtime PM status in system PM
    [SCSI] add flag to skip the runtime PM calls on the host
    ata: make ata port as parent device of scsi host
    ahci: start engine only during soft/hard resets

    Linus Torvalds
     
  • * 'stable/for-linus-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: (37 commits)
    xen/pciback: Expand the warning message to include domain id.
    xen/pciback: Fix "device has been assigned to X domain!" warning
    xen/pciback: Move the PCI_DEV_FLAGS_ASSIGNED ops to the "[un|]bind"
    xen/xenbus: don't reimplement kvasprintf via a fixed size buffer
    xenbus: maximum buffer size is XENSTORE_PAYLOAD_MAX
    xen/xenbus: Reject replies with payload > XENSTORE_PAYLOAD_MAX.
    Xen: consolidate and simplify struct xenbus_driver instantiation
    xen-gntalloc: introduce missing kfree
    xen/xenbus: Fix compile error - missing header for xen_initial_domain()
    xen/netback: Enable netback on HVM guests
    xen/grant-table: Support mappings required by blkback
    xenbus: Use grant-table wrapper functions
    xenbus: Support HVM backends
    xen/xenbus-frontend: Fix compile error with randconfig
    xen/xenbus-frontend: Make error message more clear
    xen/privcmd: Remove unused support for arch specific privcmp mmap
    xen: Add xenbus_backend device
    xen: Add xenbus device driver
    xen: Add privcmd device driver
    xen/gntalloc: fix reference counts on multi-page mappings
    ...

    Linus Torvalds
     
  • * 'kvm-updates/3.3' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (74 commits)
    KVM: PPC: Whitespace fix for kvm.h
    KVM: Fix whitespace in kvm_para.h
    KVM: PPC: annotate kvm_rma_init as __init
    KVM: x86 emulator: implement RDPMC (0F 33)
    KVM: x86 emulator: fix RDPMC privilege check
    KVM: Expose the architectural performance monitoring CPUID leaf
    KVM: VMX: Intercept RDPMC
    KVM: SVM: Intercept RDPMC
    KVM: Add generic RDPMC support
    KVM: Expose a version 2 architectural PMU to a guests
    KVM: Expose kvm_lapic_local_deliver()
    KVM: x86 emulator: Use opcode::execute for Group 9 instruction
    KVM: x86 emulator: Use opcode::execute for Group 4/5 instructions
    KVM: x86 emulator: Use opcode::execute for Group 1A instruction
    KVM: ensure that debugfs entries have been created
    KVM: drop bsp_vcpu pointer from kvm struct
    KVM: x86: Consolidate PIT legacy test
    KVM: x86: Do not rely on implicit inclusions
    KVM: Make KVM_INTEL depend on CPU_SUP_INTEL
    KVM: Use memdup_user instead of kmalloc/copy_from_user
    ...

    Linus Torvalds