03 Nov, 2011

36 commits

  • include/linux/sem.h contains several structures that are only used within
    ipc/sem.c.

    The patch moves them into ipc/sem.c - there is no need to expose the
    structures to the whole kernel.

    No functional changes, only whitespace cleanups and 80-char per line
    fixes.

    Signed-off-by: Manfred Spraul
    Acked-by: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Mike Galbraith
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • semtimedop() does not handle spurious wakeups, it returns -EINTR to user
    space. Most other schedule() users would just loop and not return to user
    space. The patch adds such a loop to semtimedop()

    Signed-off-by: Manfred Spraul
    Reported-by: Peter Zijlstra
    Acked-by: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Mike Galbraith
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • sys_semtimedop() may return -EIDRM although the semaphore operation
    completed successfully:

    thread 1: thread 2:
    semtimedop(), sleeps
    semop():
    * acquires sem_lock()
    semtimedop() woken up due to timeout
    sem_lock() loops
    * notices that thread 2 could be completed.
    * performs the operations that thread 2 is sleeping on.
    * marks the semaphore operation as IN_WAKEUP
    * drops sem_lock(), does wakeup, sets return code to 0
    * thread delayed due to interrupt, whatever
    * returns to user space
    * thread still delayed
    semctl(IPC_RMID)
    * acquires sem_lock()
    * ipc_rmid(), ipcp->deleted=1
    * drops sem_lock()
    * thread finally continues - but seem_lock()
    now fails due to ipcp->deleted == 1
    * returns -EIDRM instead of 0

    The fix is trivial: Always use the return code in queue.status.

    In real world, the race probably doesn't matter:
    If the semaphore array is destroyed, the app is probably not interested
    if the last operation succeeded or was already cancelled.

    Signed-off-by: Manfred Spraul
    Cc: Thomas Gleixner
    Cc: Mike Galbraith
    Acked-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • It's often convenient to be able to release resource from IRQ context.
    Make ida_simple_*() use irqsave/restore spin ops so that they are IRQ
    safe.

    Signed-off-by: Tejun Heo
    Acked-by: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • fd* files are restricted to the task's owner, and other users may not get
    direct access to them. But one may open any of these files and run any
    setuid program, keeping opened file descriptors. As there are permission
    checks on open(), but not on readdir() and read(), operations on the kept
    file descriptors will not be checked. It makes it possible to violate
    procfs permission model.

    Reading fdinfo/* may disclosure current fds' position and flags, reading
    directory contents of fdinfo/ and fd/ may disclosure the number of opened
    files by the target task. This information is not sensible per se, but it
    can reveal some private information (like length of a password stored in a
    file) under certain conditions.

    Used existing (un)lock_trace functions to check for ptrace_may_access(),
    but instead of using EPERM return code from it use EACCES to be consistent
    with existing proc_pid_follow_link()/proc_pid_readlink() return code. If
    they differ, attacker can guess what fds exist by analyzing stat() return
    code. Patched handlers: stat() for fd/*, stat() and read() for fdindo/*,
    readdir() and lookup() for fd/ and fdinfo/.

    Signed-off-by: Vasiliy Kulikov
    Cc: Cyrill Gorcunov
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     
  • On reading sysctl dirs we should return -EISDIR instead of -EINVAL.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Cyrill Gorcunov
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • {get,put}_mems_allowed() exist so that general kernel code may locklessly
    access a task's set of allowable nodes without having the chance that a
    concurrent write will cause the nodemask to be empty on configurations
    where MAX_NUMNODES > BITS_PER_LONG.

    This could incur a significant delay, however, especially in low memory
    conditions because the page allocator is blocking and reclaim requires
    get_mems_allowed() itself. It is not atypical to see writes to
    cpuset.mems take over 2 seconds to complete, for example. In low memory
    conditions, this is problematic because it's one of the most imporant
    times to change cpuset.mems in the first place!

    The only way a task's set of allowable nodes may change is through cpusets
    by writing to cpuset.mems and when attaching a task to a generic code is
    not reading the nodemask with get_mems_allowed() at the same time, and
    then clearing all the old nodes. This prevents the possibility that a
    reader will see an empty nodemask at the same time the writer is storing a
    new nodemask.

    If at least one node remains unchanged, though, it's possible to simply
    set all new nodes and then clear all the old nodes. Changing a task's
    nodemask is protected by cgroup_mutex so it's guaranteed that two threads
    are not changing the same task's nodemask at the same time, so the
    nodemask is guaranteed to be stored before another thread changes it and
    determines whether a node remains set or not.

    Signed-off-by: David Rientjes
    Cc: Miao Xie
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • warning: symbol 'swap_cgroup_ctrl' was not declared. Should it be static?

    Signed-off-by: H Hartley Sweeten
    Cc: Paul Menage
    Cc: Li Zefan
    Acked-by: Balbir Singh
    Cc: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    H Hartley Sweeten
     
  • Various code in memcontrol.c () calls this_cpu_read() on the calculations
    to be done from two different percpu variables, or does an open-coded
    read-modify-write on a single percpu variable.

    Disable preemption throughout these operations so that the writes go to
    the correct palces.

    [hannes@cmpxchg.org: added this_cpu to __this_cpu conversion]
    Signed-off-by: Johannes Weiner
    Signed-off-by: Steven Rostedt
    Cc: Greg Thelen
    Cc: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     
  • There is a potential race between a thread charging a page and another
    thread putting it back to the LRU list:

    charge: putback:
    SetPageCgroupUsed SetPageLRU
    PageLRU && add to memcg LRU PageCgroupUsed && add to memcg LRU

    The order of setting one flag and checking the other is crucial, otherwise
    the charge may observe !PageLRU while the putback observes !PageCgroupUsed
    and the page is not linked to the memcg LRU at all.

    Global memory pressure may fix this by trying to isolate and putback the
    page for reclaim, where that putback would link it to the memcg LRU again.
    Without that, the memory cgroup is undeletable due to a charge whose
    physical page can not be found and moved out.

    Signed-off-by: Johannes Weiner
    Cc: Ying Han
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Reclaim decides to skip scanning an active list when the corresponding
    inactive list is above a certain size in comparison to leave the assumed
    working set alone while there are still enough reclaim candidates around.

    The memcg implementation of comparing those lists instead reports whether
    the whole memcg is low on the requested type of inactive pages,
    considering all nodes and zones.

    This can lead to an oversized active list not being scanned because of the
    state of the other lists in the memcg, as well as an active list being
    scanned while its corresponding inactive list has enough pages.

    Not only is this wrong, it's also a scalability hazard, because the global
    memory state over all nodes and zones has to be gathered for each memcg
    and zone scanned.

    Make these calculations purely based on the size of the two LRU lists
    that are actually affected by the outcome of the decision.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: KOSAKI Motohiro
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Reviewed-by: Minchan Kim
    Reviewed-by: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • If somebody is touching data too early, it might be easier to diagnose a
    problem when dereferencing NULL at mem->info.nodeinfo[node] than trying to
    understand why mem_cgroup_per_zone is [un|partly]initialized.

    Signed-off-by: Igor Mammedov
    Acked-by: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Igor Mammedov
     
  • Before calling schedule_timeout(), task state should be changed.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • The memcg code sometimes uses "struct mem_cgroup *mem" and sometimes uses
    "struct mem_cgroup *memcg". Rename all mem variables to memcg in source
    file.

    Signed-off-by: Raghavendra K T
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Raghavendra K T
     
  • When the cgroup base was allocated with kmalloc, it was necessary to
    annotate the variable with kmemleak_not_leak(). But because it has
    recently been changed to be allocated with alloc_page() (which skips
    kmemleak checks) causes a warning on boot up.

    I was triggering this output:

    allocated 8388608 bytes of page_cgroup
    please try 'cgroup_disable=memory' option if you don't want memory cgroups
    kmemleak: Trying to color unknown object at 0xf5840000 as Grey
    Pid: 0, comm: swapper Not tainted 3.0.0-test #12
    Call Trace:
    [] ? printk+0x1d/0x1f^M
    [] paint_ptr+0x4f/0x78
    [] kmemleak_not_leak+0x58/0x7d
    [] ? __rcu_read_unlock+0x9/0x7d
    [] kmemleak_init+0x19d/0x1e9
    [] start_kernel+0x346/0x3ec
    [] ? loglevel+0x18/0x18
    [] i386_start_kernel+0xaa/0xb0

    After a bit of debugging I tracked the object 0xf840000 (and others) down
    to the cgroup code. The change from allocating base with kmalloc to
    alloc_page() has the base not calling kmemleak_alloc() which adds the
    pointer to the object_tree_root, but kmemleak_not_leak() adds it to the
    crt_early_log[] table. On kmemleak_init(), the entry is found in the
    early_log[] but not the object_tree_root, and this error message is
    displayed.

    If alloc_page() fails then it defaults back to vmalloc() which still uses
    the kmemleak_alloc() which makes us still need the kmemleak_not_leak()
    call. The solution is to call the kmemleak_alloc() directly if the
    alloc_page() succeeds.

    Reviewed-by: Michal Hocko
    Signed-off-by: Steven Rostedt
    Acked-by: Catalin Marinas
    Signed-off-by: Jonathan Nieder
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     
  • If a task has exited to the point it has called cgroup_exit() already,
    then we can't migrate it to another cgroup anymore.

    This can happen when we are attaching a task to a new cgroup between the
    call to ->can_attach_task() on subsystems and the migration that is
    eventually tried in cgroup_task_migrate().

    In this case cgroup_task_migrate() returns -ESRCH and we don't want to
    attach the task to the subsystems because the attachment to the new cgroup
    itself failed.

    Fix this by only calling ->attach_task() on the subsystems if the cgroup
    migration succeeded.

    Reported-by: Oleg Nesterov
    Signed-off-by: Ben Blum
    Acked-by: Paul Menage
    Cc: Li Zefan
    Cc: Tejun Heo
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • Fix unstable tasklist locking in cgroup_attach_proc.

    According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is
    not sufficient to guarantee the tasklist is stable w.r.t. de_thread and
    exit. Taking tasklist_lock for reading, instead of rcu_read_lock, ensures
    proper exclusion.

    Signed-off-by: Ben Blum
    Acked-by: Paul Menage
    Cc: Oleg Nesterov
    Cc: Frederic Weisbecker
    Cc: "Paul E. McKenney"
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • Clement Lecigne reports a filesystem which causes a kernel oops in
    hfs_find_init() trying to dereference sb->ext_tree which is NULL.

    This proves to be because the filesystem has a corrupted MDB extent
    record, where the extents file does not fit into the first three extents
    in the file record (the first blocks).

    In hfs_get_block() when looking up the blocks for the extent file
    (HFS_EXT_CNID), it fails the first blocks special case, and falls
    through to the extent code (which ultimately calls hfs_find_init())
    which is in the process of being initialised.

    Hfs avoids this scenario by always having the extents b-tree fitting
    into the first blocks (the extents B-tree can't have overflow extents).

    The fix is to check at mount time that the B-tree fits into first
    blocks, i.e. fail if HFS_I(inode)->alloc_blocks >=
    HFS_I(inode)->first_blocks

    Note, the existing commit 47f365eb57573 ("hfs: fix oops on mount with
    corrupted btree extent records") becomes subsumed into this as a special
    case, but only for the extents B-tree (HFS_EXT_CNID), it is perfectly
    acceptable for the catalog B-Tree file to grow beyond three extents,
    with the remaining extent descriptors in the extents overfow.

    This fixes CVE-2011-2203

    Reported-by: Clement LECIGNE
    Signed-off-by: Phillip Lougher
    Cc: Jeff Mahoney
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Phillip Lougher
     
  • Use mpage_readpages() instead of multiple calls to isofs_readpage() to
    reduce the CPU utilization and make performance higher.

    Signed-off-by: Namjae Jeon
    Cc: Al Viro
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namjae Jeon
     
  • One can get this information from minix/inode.c, but adding the
    explanations at the definition sites is more appropriate.

    Signed-off-by: Sami Kerola
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sami Kerola
     
  • The driver is added using platform_driver_probe(), so the callbacks can be
    discarded more aggessively.

    Signed-off-by: Uwe Kleine-König
    Cc: Alessandro Zummo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uwe Kleine-König
     
  • Add initial support for the microchip mcp7941x series of real time clocks.

    The mcp7941x series is generally compatible with the ds1307 and ds1337 rtc
    devices from dallas semiconductor. minor differences include a backup
    battery enable bit, and the polarity of the oscillator enable bit.

    Signed-off-by: David Anders
    Cc: Alessandro Zummo
    Reviewed-by: Wolfram Sang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Anders
     
  • This is the one use of an ida that doesn't retry on receiving -EAGAIN.
    I'm assuming do so will cause no harm and may help on a rare occasion.

    Signed-off-by: Jonathan Cameron
    Cc: Alessandro Zummo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jonathan Cameron
     
  • When a cramfs ramdisk padded with 512 bytes is given to the kernel, the
    current identify_ramdisk_image function fails to identify it.

    Tested with a padded cramfs image on an ARM based board.

    Signed-off-by: Neil Armstrong
    Cc: Namhyung Kim
    Cc: Davidlohr Bueso
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Armstrong
     
  • Since ramfs is hard-selected to "y", the module leftovers make no sense.

    Signed-off-by: Richard Weinberger
    Reviewed-by: WANG Cong
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Weinberger
     
  • The case of address space randomization being disabled in runtime through
    randomize_va_space sysctl is not treated properly in load_elf_binary(),
    resulting in SIGKILL coming at exec() time for certain PIE-linked binaries
    in case the randomization has been disabled at runtime prior to calling
    exec().

    Handle the randomize_va_space == 0 case the same way as if we were not
    supporting .text randomization at all.

    Based on original patch by H.J. Lu and Josh Boyer.

    Signed-off-by: Jiri Kosina
    Cc: Ingo Molnar
    Cc: Russell King
    Cc: H.J. Lu
    Cc:
    Tested-by: Josh Boyer
    Acked-by: Nicolas Pitre
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Kosina
     
  • This avoids duplicating the function in every arch gup_fast.

    Signed-off-by: Andrea Arcangeli
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Benjamin Herrenschmidt
    Cc: David Gibson
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Up to this point the code assumed old refcounting for hugepages (pre-thp).
    This updates the code directly to the thp mapcount tail page refcounting.

    Signed-off-by: Andrea Arcangeli
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Benjamin Herrenschmidt
    Cc: David Gibson
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Acked-by: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • s390 didn't return 0 in that case, if it's rolling back the *nr pointer it
    should also return zero to avoid adding pages to the array at the wrong
    offset.

    Signed-off-by: Andrea Arcangeli
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Benjamin Herrenschmidt
    Cc: David Gibson
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Up to this point the code assumed old refcounting for hugepages (pre-thp).
    This updates the code directly to the thp mapcount tail page refcounting.

    Signed-off-by: Andrea Arcangeli
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Benjamin Herrenschmidt
    Cc: David Gibson
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • powerpc didn't return 0 in that case, if it's rolling back the *nr pointer
    it should also return zero to avoid adding pages to the array at the wrong
    offset.

    Signed-off-by: Andrea Arcangeli
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Benjamin Herrenschmidt
    Acked-by: David Gibson
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Up to this point the code assumed old refcounting for hugepages (pre-thp).
    This updates the code directly to the thp mapcount tail page refcounting.

    Signed-off-by: Andrea Arcangeli
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Benjamin Herrenschmidt
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • We only taken "refs" pins on the head page not "*nr" pins.

    Signed-off-by: Andrea Arcangeli
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Benjamin Herrenschmidt
    Acked-by: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • "page" may have changed to point to the next hugepage after the loop
    completed, The references have been taken on the head page, so the
    put_page must happen there too.

    This is a longstanding issue pre-thp inclusion.

    It's totally unclear how these page_cache_add_speculative and
    pte_val(pte) != pte_val(*ptep) checks are necessary across all the
    powerpc gup_fast code, when x86 doesn't need any of that: there's no way
    the page can be freed with irq disabled so we're guaranteed the
    atomic_inc will happen on a page with page_count > 0 (so not needing the
    speculative check).

    The pte check is also meaningless on x86: no need to rollback on x86 if
    the pte changed, because the pte can still change a CPU tick after the
    check succeeded and it won't be rolled back in that case. The important
    thing is we got a reference on a valid page that was mapped there a CPU
    tick ago. So not knowing the soft tlb refill code of ppc64 in great
    detail I'm not removing the "speculative" page_count increase and the
    pte checks across all the code, but unless there's a strong reason for
    it they should be later cleaned up too.

    If a pte can change from huge to non-huge (like it could happen with
    THP) passing a pte_t *ptep to gup_hugepte() would also require to repeat
    the is_hugepd in gup_hugepte(), but that shouldn't happen with hugetlbfs
    only so I'm not altering that.

    Signed-off-by: Andrea Arcangeli
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Benjamin Herrenschmidt
    Acked-by: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This part of gup_fast doesn't seem capable of handling hugetlbfs ptes,
    those should be handled by gup_hugepd only, so these checks are
    superfluous.

    Plus if this wasn't a noop, it would have oopsed because, the insistence
    of using the speculative refcounting would trigger a VM_BUG_ON if a tail
    page was encountered in the page_cache_get_speculative().

    Signed-off-by: Andrea Arcangeli
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Benjamin Herrenschmidt
    Acked-by: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Michel while working on the working set estimation code, noticed that
    calling get_page_unless_zero() on a random pfn_to_page(random_pfn)
    wasn't safe, if the pfn ended up being a tail page of a transparent
    hugepage under splitting by __split_huge_page_refcount().

    He then found the problem could also theoretically materialize with
    page_cache_get_speculative() during the speculative radix tree lookups
    that uses get_page_unless_zero() in SMP if the radix tree page is freed
    and reallocated and get_user_pages is called on it before
    page_cache_get_speculative has a chance to call get_page_unless_zero().

    So the best way to fix the problem is to keep page_tail->_count zero at
    all times. This will guarantee that get_page_unless_zero() can never
    succeed on any tail page. page_tail->_mapcount is guaranteed zero and
    is unused for all tail pages of a compound page, so we can simply
    account the tail page references there and transfer them to
    tail_page->_count in __split_huge_page_refcount() (in addition to the
    head_page->_mapcount).

    While debugging this s/_count/_mapcount/ change I also noticed get_page is
    called by direct-io.c on pages returned by get_user_pages. That wasn't
    entirely safe because the two atomic_inc in get_page weren't atomic. As
    opposed to other get_user_page users like secondary-MMU page fault to
    establish the shadow pagetables would never call any superflous get_page
    after get_user_page returns. It's safer to make get_page universally safe
    for tail pages and to use get_page_foll() within follow_page (inside
    get_user_pages()). get_page_foll() is safe to do the refcounting for tail
    pages without taking any locks because it is run within PT lock protected
    critical sections (PT lock for pte and page_table_lock for
    pmd_trans_huge).

    The standard get_page() as invoked by direct-io instead will now take
    the compound_lock but still only for tail pages. The direct-io paths
    are usually I/O bound and the compound_lock is per THP so very
    finegrined, so there's no risk of scalability issues with it. A simple
    direct-io benchmarks with all lockdep prove locking and spinlock
    debugging infrastructure enabled shows identical performance and no
    overhead. So it's worth it. Ideally direct-io should stop calling
    get_page() on pages returned by get_user_pages(). The spinlock in
    get_page() is already optimized away for no-THP builds but doing
    get_page() on tail pages returned by GUP is generally a rare operation
    and usually only run in I/O paths.

    This new refcounting on page_tail->_mapcount in addition to avoiding new
    RCU critical sections will also allow the working set estimation code to
    work without any further complexity associated to the tail page
    refcounting with THP.

    Signed-off-by: Andrea Arcangeli
    Reported-by: Michel Lespinasse
    Reviewed-by: Michel Lespinasse
    Reviewed-by: Minchan Kim
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Benjamin Herrenschmidt
    Cc: David Gibson
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

02 Nov, 2011

4 commits

  • * 'next/soc' of git://git.linaro.org/people/arnd/arm-soc: (21 commits)
    MAINTAINERS: add ARM/FREESCALE IMX6 entry
    arm/imx: merge i.MX3 and i.MX6
    arm/imx6q: add suspend/resume support
    arm/imx6q: add device tree machine support
    arm/imx6q: add smp and cpu hotplug support
    arm/imx6q: add core drivers clock, gpc, mmdc and src
    arm/imx: add gic_handle_irq function
    arm/imx6q: add core definitions and low-level debug uart
    arm/imx6q: add device tree source
    ARM: highbank: add suspend support
    ARM: highbank: Add cpu hotplug support
    ARM: highbank: add SMP support
    MAINTAINERS: add Calxeda Highbank ARM platform
    ARM: add Highbank core platform support
    ARM: highbank: add devicetree source
    ARM: l2x0: add empty l2x0_of_init
    picoxcell: add a definition of VMALLOC_END
    picoxcell: remove custom ioremap implementation
    picoxcell: add the DTS for the PC7302 board
    picoxcell: add the DTS for pc3x2 and pc3x3 devices
    ...

    Fix up trivial conflicts in arch/arm/Kconfig, and some more header file
    conflicts in arch/arm/mach-omap2/board-generic.c (as per an ealier merge
    by Arnd).

    Linus Torvalds
     
  • * 'next/dt' of git://git.linaro.org/people/arnd/arm-soc:
    ARM: gic: use module.h instead of export.h
    ARM: gic: fix irq_alloc_descs handling for sparse irq
    ARM: gic: add OF based initialization
    ARM: gic: add irq_domain support
    irq: support domains with non-zero hwirq base
    of/irq: introduce of_irq_init
    ARM: at91: add at91sam9g20 and Calao USB A9G20 DT support
    ARM: at91: dt: at91sam9g45 family and board device tree files
    arm/mx5: add device tree support for imx51 babbage
    arm/mx5: add device tree support for imx53 boards
    ARM: msm: Add devicetree support for msm8660-surf
    msm_serial: Add devicetree support
    msm_serial: Use relative resources for iomem

    Fix up conflicts in arch/arm/mach-at91/{at91sam9260.c,at91sam9g45.c}

    Linus Torvalds
     
  • * 'next/cleanup2' of git://git.linaro.org/people/arnd/arm-soc: (31 commits)
    ARM: OMAP: Warn if omap_ioremap is called before SoC detection
    ARM: OMAP: Move set_globals initialization to happen in init_early
    ARM: OMAP: Map SRAM later on with ioremap_exec()
    ARM: OMAP: Remove calls to SRAM allocations for framebuffer
    ARM: OMAP: Avoid cpu_is_omapxxxx usage until map_io is done
    ARM: OMAP1: Use generic map_io, init_early and init_irq
    arm/dts: OMAP3+: Add mpu, dsp and iva nodes
    arm/dts: OMAP4: Add a main ocp entry bound to l3-noc driver
    ARM: OMAP2+: l3-noc: Add support for device-tree
    ARM: OMAP2+: board-generic: Add i2c static init
    ARM: OMAP2+: board-generic: Add DT support to generic board
    arm/dts: Add support for OMAP3 Beagle board
    arm/dts: Add initial device tree support for OMAP3 SoC
    arm/dts: Add support for OMAP4 SDP board
    arm/dts: Add support for OMAP4 PandaBoard
    arm/dts: Add initial device tree support for OMAP4 SoC
    ARM: OMAP: omap_device: Add a method to build an omap_device from a DT node
    ARM: OMAP: omap_device: Add omap_device_[alloc|delete] for DT integration
    of: Add helpers to get one string in multiple strings property
    ARM: OMAP2+: devices: Remove all omap_device_pm_latency structures
    ...

    Fix up trivial header file conflicts in arch/arm/mach-omap2/board-generic.c

    Linus Torvalds
     
  • * 'next/cross-platform' of git://git.linaro.org/people/arnd/arm-soc:
    arm/imx: use Kconfig choice for low-level debug UART selection
    ARM: realview: use Kconfig choice for debug UART selection
    ARM: plat-samsung: use Kconfig choice for debug UART selection
    ARM: versatile: convert logical CPU numbers to physical numbers
    ARM: ux500: convert logical CPU numbers to physical numbers
    ARM: shmobile: convert logical CPU numbers to physical numbers
    ARM: msm: convert logical CPU numbers to physical numbers
    ARM: exynos4: convert logical CPU numbers to physical numbers

    Fix up trivial conflict (config DEBUG_S3C_UART move/split vs addition of
    ARM_KPROBES_TEST option) in arch/arm/Kconfig.debug

    Linus Torvalds