15 Mar, 2013

1 commit

  • The vm_flags introduced in 6d7825b10dbe ("mm/fremap.c: fix oops on error
    path") is supposed to avoid a compiler warning about unitialized
    vm_flags without changing the generated code.

    However I am concerned that this is going to be very brittle, and fail
    with some compiler versions. The failure could be either of:

    - compiler could actually load vma->vm_flags before checking for the
    !vma condition, thus reintroducing the oops

    - compiler could optimize out the !vma check, since the pointer just got
    dereferenced shortly before (so the compiler knows it can't be NULL!)

    I propose reversing this part of the change and initializing vm_flags to 0
    just to avoid the bogus uninitialized use warning.

    Signed-off-by: Michel Lespinasse
    Cc: Tommi Rantala
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

14 Mar, 2013

2 commits

  • If find_vma() fails, sys_remap_file_pages() will dereference `vma', which
    contains NULL. Fix it by checking the pointer.

    (We could alternatively check for err==0, but this seems more direct)

    (The vm_flags change is to squish a bogus used-uninitialised warning
    without adding extra code).

    Reported-by: Tommi Rantala
    Cc: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • remove_memory() calls walk_memory_range() with [start_pfn, end_pfn), where
    end_pfn is exclusive in this range. Therefore, end_pfn needs to be set to
    the next page of the end address.

    Signed-off-by: Toshi Kani
    Cc: Wen Congyang
    Cc: Tang Chen
    Cc: Kamezawa Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshi Kani
     

13 Mar, 2013

2 commits

  • In commit 887cbce0adea ("arch Kconfig: centralise ARCH_NO_VIRT_TO_BUS")
    I introduced the config sybmol HAVE_VIRT_TO_BUS and selected that where
    needed. I am not sure what I was thinking. Instead, just directly
    select VIRT_TO_BUS where it is needed.

    Signed-off-by: Stephen Rothwell
    Signed-off-by: Linus Torvalds

    Stephen Rothwell
     
  • Looking at mm/process_vm_access.c:process_vm_rw() and comparing it to
    compat_process_vm_rw() shows that the compatibility code requires an
    explicit "access_ok()" check before calling
    compat_rw_copy_check_uvector(). The same difference seems to appear when
    we compare fs/read_write.c:do_readv_writev() to
    fs/compat.c:compat_do_readv_writev().

    This subtle difference between the compat and non-compat requirements
    should probably be debated, as it seems to be error-prone. In fact,
    there are two others sites that use this function in the Linux kernel,
    and they both seem to get it wrong:

    Now shifting our attention to fs/aio.c, we see that aio_setup_iocb()
    also ends up calling compat_rw_copy_check_uvector() through
    aio_setup_vectored_rw(). Unfortunately, the access_ok() check appears to
    be missing. Same situation for
    security/keys/compat.c:compat_keyctl_instantiate_key_iov().

    I propose that we add the access_ok() check directly into
    compat_rw_copy_check_uvector(), so callers don't have to worry about it,
    and it therefore makes the compat call code similar to its non-compat
    counterpart. Place the access_ok() check in the same location where
    copy_from_user() can trigger a -EFAULT error in the non-compat code, so
    the ABI behaviors are alike on both compat and non-compat.

    While we are here, fix compat_do_readv_writev() so it checks for
    compat_rw_copy_check_uvector() negative return values.

    And also, fix a memory leak in compat_keyctl_instantiate_key_iov() error
    handling.

    Acked-by: Linus Torvalds
    Acked-by: Al Viro
    Signed-off-by: Mathieu Desnoyers
    Signed-off-by: Linus Torvalds

    Mathieu Desnoyers
     

09 Mar, 2013

4 commits

  • Fix a warning from lockdep caused by calling cancel_work_sync() for
    uninitialized struct work. This path has been triggered by destructon
    kmem-cache hierarchy via destroying its root kmem-cache.

    cache ffff88003c072d80
    obj ffff88003b410000 cache ffff88003c072d80
    obj ffff88003b924000 cache ffff88003c20bd40
    INFO: trying to register non-static key.
    the code is fine but needs lockdep annotation.
    turning off the locking correctness validator.
    Pid: 2825, comm: insmod Tainted: G O 3.9.0-rc1-next-20130307+ #611
    Call Trace:
    __lock_acquire+0x16a2/0x1cb0
    lock_acquire+0x8a/0x120
    flush_work+0x38/0x2a0
    __cancel_work_timer+0x89/0xf0
    cancel_work_sync+0xb/0x10
    kmem_cache_destroy_memcg_children+0x81/0xb0
    kmem_cache_destroy+0xf/0xe0
    init_module+0xcb/0x1000 [kmem_test]
    do_one_initcall+0x11a/0x170
    load_module+0x19b0/0x2320
    SyS_init_module+0xc6/0xf0
    system_call_fastpath+0x16/0x1b

    Example module to demonstrate:

    #include
    #include
    #include
    #include

    int __init mod_init(void)
    {
    int size = 256;
    struct kmem_cache *cache;
    void *obj;
    struct page *page;

    cache = kmem_cache_create("kmem_cache_test", size, size, 0, NULL);
    if (!cache)
    return -ENOMEM;

    printk("cache %p\n", cache);

    obj = kmem_cache_alloc(cache, GFP_KERNEL);
    if (obj) {
    page = virt_to_head_page(obj);
    printk("obj %p cache %p\n", obj, page->slab_cache);
    kmem_cache_free(cache, obj);
    }

    flush_scheduled_work();

    obj = kmem_cache_alloc(cache, GFP_KERNEL);
    if (obj) {
    page = virt_to_head_page(obj);
    printk("obj %p cache %p\n", obj, page->slab_cache);
    kmem_cache_free(cache, obj);
    }

    kmem_cache_destroy(cache);

    return -EBUSY;
    }

    module_init(mod_init);
    MODULE_LICENSE("GPL");

    Signed-off-by: Konstantin Khlebnikov
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • A CONFIG_DISCONTIGMEM=y m68k config gave

    mm/ksm.c: In function `get_kpfn_nid':
    mm/ksm.c:492: error: implicit declaration of function `pfn_to_nid'

    linux/mmzone.h declares it for CONFIG_SPARSEMEM and CONFIG_FLATMEM, but
    expects the arch's asm/mmzone.h to declare it for CONFIG_DISCONTIGMEM
    (see arch/mips/include/asm/mmzone.h for example).

    Or perhaps it is only expected when CONFIG_NUMA=y: too much of a maze,
    and m68k got away without it so far, so fix the build in mm/ksm.c.

    Signed-off-by: Hugh Dickins
    Reported-by: Geert Uytterhoeven
    Cc: Petr Holasek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Currently, n_new is wrongly initialized. start and end parameter are
    inverted. Let's fix it.

    Signed-off-by: KOSAKI Motohiro
    Cc: Hillf Danton
    Cc: Sasha Levin
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Dave Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • n->end is accessed in sp_insert(). Thus it should be update
    before calling sp_insert(). This mistake may make kernel panic.

    Signed-off-by: Hillf Danton
    Signed-off-by: KOSAKI Motohiro
    Cc: Sasha Levin
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Dave Jones
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     

04 Mar, 2013

1 commit

  • Pull more VFS bits from Al Viro:
    "Unfortunately, it looks like xattr series will have to wait until the
    next cycle ;-/

    This pile contains 9p cleanups and fixes (races in v9fs_fid_add()
    etc), fixup for nommu breakage in shmem.c, several cleanups and a bit
    more file_inode() work"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    constify path_get/path_put and fs_struct.c stuff
    fix nommu breakage in shmem.c
    cache the value of file_inode() in struct file
    9p: if v9fs_fid_lookup() gets to asking server, it'd better have hashed dentry
    9p: make sure ->lookup() adds fid to the right dentry
    9p: untangle ->lookup() a bit
    9p: double iput() in ->lookup() if d_materialise_unique() fails
    9p: v9fs_fid_add() can't fail now
    v9fs: get rid of v9fs_dentry
    9p: turn fid->dlist into hlist
    9p: don't bother with private lock in ->d_fsdata; dentry->d_lock will do just fine
    more file_inode() open-coded instances
    selinux: opened file can't have NULL or negative ->f_path.dentry

    (In the meantime, the hlist traversal macros have changed, so this
    required a semantic conflict fixup for the newly hlistified fid->dlist)

    Linus Torvalds
     

03 Mar, 2013

1 commit

  • Tim found:

    WARNING: at arch/x86/kernel/smpboot.c:324 topology_sane.isra.2+0x6f/0x80()
    Hardware name: S2600CP
    sched: CPU #1's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
    smpboot: Booting Node 1, Processors #1
    Modules linked in:
    Pid: 0, comm: swapper/1 Not tainted 3.9.0-0-generic #1
    Call Trace:
    set_cpu_sibling_map+0x279/0x449
    start_secondary+0x11d/0x1e5

    Don Morris reproduced on a HP z620 workstation, and bisected it to
    commit e8d195525809 ("acpi, memory-hotplug: parse SRAT before memblock
    is ready")

    It turns out movable_map has some problems, and it breaks several things

    1. numa_init is called several times, NOT just for srat. so those
    nodes_clear(numa_nodes_parsed)
    memset(&numa_meminfo, 0, sizeof(numa_meminfo))
    can not be just removed. Need to consider sequence is: numaq, srat, amd, dummy.
    and make fall back path working.

    2. simply split acpi_numa_init to early_parse_srat.
    a. that early_parse_srat is NOT called for ia64, so you break ia64.
    b. for (i = 0; i < MAX_LOCAL_APIC; i++)
    set_apicid_to_node(i, NUMA_NO_NODE)
    still left in numa_init. So it will just clear result from early_parse_srat.
    it should be moved before that....
    c. it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
    early before override from INITRD is settled.

    3. that patch TITLE is total misleading, there is NO x86 in the title,
    but it changes critical x86 code. It caused x86 guys did not
    pay attention to find the problem early. Those patches really should
    be routed via tip/x86/mm.

    4. after that commit, following range can not use movable ram:
    a. real_mode code.... well..funny, legacy Node0 [0,1M) could be hot-removed?
    b. initrd... it will be freed after booting, so it could be on movable...
    c. crashkernel for kdump...: looks like we can not put kdump kernel above 4G
    anymore.
    d. init_mem_mapping: can not put page table high anymore.
    e. initmem_init: vmemmap can not be high local node anymore. That is
    not good.

    If node is hotplugable, the mem related range like page table and
    vmemmap could be on the that node without problem and should be on that
    node.

    We have workaround patch that could fix some problems, but some can not
    be fixed.

    So just remove that offending commit and related ones including:

    f7210e6c4ac7 ("mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to
    protect movablecore_map in memblock_overlaps_region().")

    01a178a94e8e ("acpi, memory-hotplug: support getting hotplug info from
    SRAT")

    27168d38fa20 ("acpi, memory-hotplug: extend movablemem_map ranges to
    the end of node")

    e8d195525809 ("acpi, memory-hotplug: parse SRAT before memblock is
    ready")

    fb06bc8e5f42 ("page_alloc: bootmem limit with movablecore_map")

    42f47e27e761 ("page_alloc: make movablemem_map have higher priority")

    6981ec31146c ("page_alloc: introduce zone_movable_limit[] to keep
    movable limit for nodes")

    34b71f1e04fc ("page_alloc: add movable_memmap kernel parameter")

    4d59a75125d5 ("x86: get pg_data_t's memory from other node")

    Later we should have patches that will make sure kernel put page table
    and vmemmap on local node ram instead of push them down to node0. Also
    need to find way to put other kernel used ram to local node ram.

    Reported-by: Tim Gardner
    Reported-by: Don Morris
    Bisected-by: Don Morris
    Tested-by: Don Morris
    Signed-off-by: Yinghai Lu
    Cc: Tony Luck
    Cc: Thomas Renninger
    Cc: Tejun Heo
    Cc: Tang Chen
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     

02 Mar, 2013

1 commit


01 Mar, 2013

2 commits

  • Pull writeback fixes from Wu Fengguang:
    "Two writeback fixes

    - fix negative (setpoint - dirty) in 32bit archs

    - use down_read_trylock() in writeback_inodes_sb(_nr)_if_idle()"

    * tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    Negative (setpoint-dirty) in bdi_position_ratio()
    vfs: re-implement writeback_inodes_sb(_nr)_if_idle() and rename them

    Linus Torvalds
     
  • Pull block IO core bits from Jens Axboe:
    "Below are the core block IO bits for 3.9. It was delayed a few days
    since my workstation kept crashing every 2-8h after pulling it into
    current -git, but turns out it is a bug in the new pstate code (divide
    by zero, will report separately). In any case, it contains:

    - The big cfq/blkcg update from Tejun and and Vivek.

    - Additional block and writeback tracepoints from Tejun.

    - Improvement of the should sort (based on queues) logic in the plug
    flushing.

    - _io() variants of the wait_for_completion() interface, using
    io_schedule() instead of schedule() to contribute to io wait
    properly.

    - Various little fixes.

    You'll get two trivial merge conflicts, which should be easy enough to
    fix up"

    Fix up the trivial conflicts due to hlist traversal cleanups (commit
    b67bfe0d42ca: "hlist: drop the node parameter from iterators").

    * 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits)
    block: remove redundant check to bd_openers()
    block: use i_size_write() in bd_set_size()
    cfq: fix lock imbalance with failed allocations
    drivers/block/swim3.c: fix null pointer dereference
    block: don't select PERCPU_RWSEM
    block: account iowait time when waiting for completion of IO request
    sched: add wait_for_completion_io[_timeout]
    writeback: add more tracepoints
    block: add block_{touch|dirty}_buffer tracepoint
    buffer: make touch_buffer() an exported function
    block: add @req to bio_{front|back}_merge tracepoints
    block: add missing block_bio_complete() tracepoint
    block: Remove should_sort judgement when flush blk_plug
    block,elevator: use new hashtable implementation
    cfq-iosched: add hierarchical cfq_group statistics
    cfq-iosched: collect stats from dead cfqgs
    cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
    blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
    block: RCU free request_queue
    blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
    ...

    Linus Torvalds
     

28 Feb, 2013

4 commits

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     
  • Change it to CONFIG_HAVE_VIRT_TO_BUS and set it in all architecures
    that already provide virt_to_bus().

    Signed-off-by: Stephen Rothwell
    Reviewed-by: James Hogan
    Cc: Bjorn Helgaas
    Cc: H Hartley Sweeten
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "David S. Miller"
    Cc: Paul Mundt
    Cc: Vineet Gupta
    Cc: James Bottomley
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Rothwell
     
  • munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE
    at a time. When munlocking THP pages (or the huge zero page), this
    resulted in taking the mm->page_table_lock 512 times in a row.

    We can do better by making use of the page_mask returned by
    follow_page_mask (for the huge zero page case), or the size of the page
    munlock_vma_page() operated on (for the true THP page case).

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • The stack vma is designed to grow automatically (marked with VM_GROWSUP
    or VM_GROWSDOWN depending on architecture) when an access is made beyond
    the existing boundary. However, particularly if you have not limited
    your stack at all ("ulimit -s unlimited"), this can cause the stack to
    grow even if the access was really just one past *another* segment.

    And that's wrong, especially since we first grow the segment, but then
    immediately later enforce the stack guard page on the last page of the
    segment. So _despite_ first growing the stack segment as a result of
    the access, the kernel will then make the access cause a SIGSEGV anyway!

    So do the same logic as the guard page check does, and consider an
    access to within one page of the next segment to be a bad access, rather
    than growing the stack to abut the next segment.

    Reported-and-tested-by: Heiko Carstens
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

27 Feb, 2013

1 commit

  • Pull vfs pile (part one) from Al Viro:
    "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
    locking violations, etc.

    The most visible changes here are death of FS_REVAL_DOT (replaced with
    "has ->d_weak_revalidate()") and a new helper getting from struct file
    to inode. Some bits of preparation to xattr method interface changes.

    Misc patches by various people sent this cycle *and* ocfs2 fixes from
    several cycles ago that should've been upstream right then.

    PS: the next vfs pile will be xattr stuff."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
    saner proc_get_inode() calling conventions
    proc: avoid extra pde_put() in proc_fill_super()
    fs: change return values from -EACCES to -EPERM
    fs/exec.c: make bprm_mm_init() static
    ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
    ocfs2: fix possible use-after-free with AIO
    ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
    get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
    target: writev() on single-element vector is pointless
    export kernel_write(), convert open-coded instances
    fs: encode_fh: return FILEID_INVALID if invalid fid_type
    kill f_vfsmnt
    vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
    nfsd: handle vfs_getattr errors in acl protocol
    switch vfs_getattr() to struct path
    default SET_PERSONALITY() in linux/elf.h
    ceph: prepopulate inodes only when request is aborted
    d_hash_and_lookup(): export, switch open-coded instances
    9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
    9p: split dropping the acls from v9fs_set_create_acl()
    ...

    Linus Torvalds
     

26 Feb, 2013

4 commits

  • This patch is a follow up on below patch:

    [PATCH] exportfs: add FILEID_INVALID to indicate invalid fid_type
    commit: 216b6cbdcbd86b1db0754d58886b466ae31f5a63

    Signed-off-by: Namjae Jeon
    Signed-off-by: Vivek Trivedi
    Acked-by: Steven Whitehouse
    Acked-by: Sage Weil
    Signed-off-by: Al Viro

    Namjae Jeon
     
  • Note that provided ->d_dname() reproduces what we used to get for
    those guys in e.g. /proc/self/maps; it might be a good idea to change
    that to something less ugly, but for now let's keep the existing
    user-visible behaviour

    Signed-off-by: Al Viro

    Al Viro
     
  • Pull user namespace and namespace infrastructure changes from Eric W Biederman:
    "This set of changes starts with a few small enhnacements to the user
    namespace. reboot support, allowing more arbitrary mappings, and
    support for mounting devpts, ramfs, tmpfs, and mqueuefs as just the
    user namespace root.

    I do my best to document that if you care about limiting your
    unprivileged users that when you have the user namespace support
    enabled you will need to enable memory control groups.

    There is a minor bug fix to prevent overflowing the stack if someone
    creates way too many user namespaces.

    The bulk of the changes are a continuation of the kuid/kgid push down
    work through the filesystems. These changes make using uids and gids
    typesafe which ensures that these filesystems are safe to use when
    multiple user namespaces are in use. The filesystems converted for
    3.9 are ceph, 9p, afs, ocfs2, gfs2, ncpfs, nfs, nfsd, and cifs. The
    changes for these filesystems were a little more involved so I split
    the changes into smaller hopefully obviously correct changes.

    XFS is the only filesystem that remains. I was hoping I could get
    that in this release so that user namespace support would be enabled
    with an allyesconfig or an allmodconfig but it looks like the xfs
    changes need another couple of days before it they are ready."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (93 commits)
    cifs: Enable building with user namespaces enabled.
    cifs: Convert struct cifs_ses to use a kuid_t and a kgid_t
    cifs: Convert struct cifs_sb_info to use kuids and kgids
    cifs: Modify struct smb_vol to use kuids and kgids
    cifs: Convert struct cifsFileInfo to use a kuid
    cifs: Convert struct cifs_fattr to use kuid and kgids
    cifs: Convert struct tcon_link to use a kuid.
    cifs: Modify struct cifs_unix_set_info_args to hold a kuid_t and a kgid_t
    cifs: Convert from a kuid before printing current_fsuid
    cifs: Use kuids and kgids SID to uid/gid mapping
    cifs: Pass GLOBAL_ROOT_UID and GLOBAL_ROOT_GID to keyring_alloc
    cifs: Use BUILD_BUG_ON to validate uids and gids are the same size
    cifs: Override unmappable incoming uids and gids
    nfsd: Enable building with user namespaces enabled.
    nfsd: Properly compare and initialize kuids and kgids
    nfsd: Store ex_anon_uid and ex_anon_gid as kuids and kgids
    nfsd: Modify nfsd4_cb_sec to use kuids and kgids
    nfsd: Handle kuids and kgids in the nfs4acl to posix_acl conversion
    nfsd: Convert nfsxdr to use kuids and kgids
    nfsd: Convert nfs3xdr to use kuids and kgids
    ...

    Linus Torvalds
     
  • Pull module update from Rusty Russell:
    "The sweeping change is to make add_taint() explicitly indicate whether
    to disable lockdep, but it's a mechanical change."

    * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
    MODSIGN: Add option to not sign modules during modules_install
    MODSIGN: Add -s option to sign-file
    MODSIGN: Specify the hash algorithm on sign-file command line
    MODSIGN: Simplify Makefile with a Kconfig helper
    module: clean up load_module a little more.
    modpost: Ignore ARC specific non-alloc sections
    module: constify within_module_*
    taint: add explicit flag to show whether lock dep is still OK.
    module: printk message when module signature fail taints kernel.

    Linus Torvalds
     

24 Feb, 2013

17 commits

  • It is a pity to have MAX_NUMNODES+MAX_NUMNODES tree roots statically
    allocated, particularly when very few users will ever actually tune
    merge_across_nodes 0 to use more than 1+1 of those trees. Not a big
    deal (only 16kB wasted on each machine with CONFIG_MAXSMP), but a pity.

    Start off with 1+1 statically allocated, then if merge_across_nodes is
    ever tuned, allocate for nr_node_ids+nr_node_ids. Do not attempt to
    free up the extra if it's tuned back, that would be a waste of effort.

    Signed-off-by: Hugh Dickins
    Cc: Mel Gorman
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • I dislike the way in which "swapcache" gets used in do_swap_page():
    there is always a page from swapcache there (even if maybe uncached by
    the time we lock it), but tests are made according to "swapcache".
    Rework that with "page != swapcache", as has been done in unuse_pte().

    Signed-off-by: Hugh Dickins
    Cc: Mel Gorman
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Before establishing that KSM page migration was the cause of my
    WARN_ON_ONCE(page_mapped(page))s, I suspected that they came from the
    lack of a ksm_might_need_to_copy() in swapoff's unuse_pte() - which in
    many respects is equivalent to faulting in a page.

    In fact I've never caught that as the cause: but in theory it does at
    least need the KSM_RUN_UNMERGE check in ksm_might_need_to_copy(), to
    avoid bringing a KSM page back in when it's not supposed to be.

    I intended to copy how it's done in do_swap_page(), but have a strong
    aversion to how "swapcache" ends up being used there: rework it with
    "page != swapcache".

    Signed-off-by: Hugh Dickins
    Cc: Mel Gorman
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • In "ksm: remove old stable nodes more thoroughly" I said that I'd never
    seen its WARN_ON_ONCE(page_mapped(page)). True at the time of writing,
    but it soon appeared once I tried fuller tests on the whole series.

    It turned out to be due to the KSM page migration itself: unmerge_and_
    remove_all_rmap_items() failed to locate and replace all the KSM pages,
    because of that hiatus in page migration when old pte has been replaced
    by migration entry, but not yet by new pte. follow_page() finds no page
    at that instant, but a KSM page reappears shortly after, without a
    fault.

    Add FOLL_MIGRATION flag, so follow_page() can do migration_entry_wait()
    for KSM's break_cow(). I'd have preferred to avoid another flag, and do
    it every time, in case someone else makes the same easy mistake; but did
    not find another transgressor (the common get_user_pages() is of course
    safe), and cannot be sure that every follow_page() caller is prepared to
    sleep - ia64's xencomm_vtop()? Now, THP's wait_split_huge_page() can
    already sleep there, since anon_vma locking was changed to mutex, but
    maybe that's somehow excluded.

    Signed-off-by: Hugh Dickins
    Cc: Mel Gorman
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Think of struct rmap_item as an extension of struct page (restricted to
    MADV_MERGEABLE areas): there may be a lot of them, we need to keep them
    small, especially on 32-bit architectures of limited lowmem.

    Siting "int nid" after "unsigned int checksum" works nicely on 64-bit,
    making no change to its 64-byte struct rmap_item; but bloats the 32-bit
    struct rmap_item from (nicely cache-aligned) 32 bytes to 36 bytes, which
    rounds up to 40 bytes once allocated from slab. We'd better avoid that.

    Hey, I only just remembered that the anon_vma pointer in struct
    rmap_item has no purpose until the rmap_item is hung from a stable tree
    node (which has its own nid field); and rmap_item's nid field no purpose
    than to say which tree root to tell rb_erase() when unlinking from an
    unstable tree.

    Double them up in a union. There's just one place where we set anon_vma
    early (when we already hold mmap_sem): now we must remove tree_rmap_item
    from its unstable tree there, before overwriting nid. No need to
    spatter BUG()s around: we'd be seeing oopses if this were wrong.

    Signed-off-by: Hugh Dickins
    Cc: Mel Gorman
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • An inconsistency emerged in reviewing the NUMA node changes to KSM: when
    meeting a page from the wrong NUMA node in a stable tree, we say that
    it's okay for comparisons, but not as a leaf for merging; whereas when
    meeting a page from the wrong NUMA node in an unstable tree, we bail out
    immediately.

    Now, it might be that a wrong NUMA node in an unstable tree is more
    likely to correlate with instablility (different content, with rbnode
    now misplaced) than page migration; but even so, we are accustomed to
    instablility in the unstable tree.

    Without strong evidence for which strategy is generally better, I'd
    rather be consistent with what's done in the stable tree: accept a page
    from the wrong NUMA node for comparison, but not as a leaf for merging.

    Signed-off-by: Hugh Dickins
    Cc: Mel Gorman
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Added slightly more detail to the Documentation of merge_across_nodes, a
    few comments in areas indicated by review, and renamed get_ksm_page()'s
    argument from "locked" to "lock_it". No functional change.

    Signed-off-by: Hugh Dickins
    Cc: Mel Gorman
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Fix several mempolicy leaks in the tmpfs mount logic. These leaks are
    slow - on the order of one object leaked per mount attempt.

    Leak 1 (umount doesn't free mpol allocated in mount):
    while true; do
    mount -t tmpfs -o mpol=interleave,size=100M nodev /mnt
    umount /mnt
    done

    Leak 2 (errors parsing remount options will leak mpol):
    mount -t tmpfs -o size=100M nodev /mnt
    while true; do
    mount -o remount,mpol=interleave,size=x /mnt 2> /dev/null
    done
    umount /mnt

    Leak 3 (multiple mpol per mount leak mpol):
    while true; do
    mount -t tmpfs -o mpol=interleave,mpol=interleave,size=100M nodev /mnt
    umount /mnt
    done

    This patch fixes all of the above. I could have broken the patch into
    three pieces but is seemed easier to review as one.

    [akpm@linux-foundation.org: fix handling of mpol_parse_str() errors, per Hugh]
    Signed-off-by: Greg Thelen
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     
  • The tmpfs remount logic preserves filesystem mempolicy if the mpol=M
    option is not specified in the remount request. A new policy can be
    specified if mpol=M is given.

    Before this patch remounting an mpol bound tmpfs without specifying
    mpol= mount option in the remount request would set the filesystem's
    mempolicy object to a freed mempolicy object.

    To reproduce the problem boot a DEBUG_PAGEALLOC kernel and run:
    # mkdir /tmp/x

    # mount -t tmpfs -o size=100M,mpol=interleave nodev /tmp/x

    # grep /tmp/x /proc/mounts
    nodev /tmp/x tmpfs rw,relatime,size=102400k,mpol=interleave:0-3 0 0

    # mount -o remount,size=200M nodev /tmp/x

    # grep /tmp/x /proc/mounts
    nodev /tmp/x tmpfs rw,relatime,size=204800k,mpol=??? 0 0
    # note ? garbage in mpol=... output above

    # dd if=/dev/zero of=/tmp/x/f count=1
    # panic here

    Panic:
    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [< (null)>] (null)
    [...]
    Oops: 0010 [#1] SMP DEBUG_PAGEALLOC
    Call Trace:
    mpol_shared_policy_init+0xa5/0x160
    shmem_get_inode+0x209/0x270
    shmem_mknod+0x3e/0xf0
    shmem_create+0x18/0x20
    vfs_create+0xb5/0x130
    do_last+0x9a1/0xea0
    path_openat+0xb3/0x4d0
    do_filp_open+0x42/0xa0
    do_sys_open+0xfe/0x1e0
    compat_sys_open+0x1b/0x20
    cstar_dispatch+0x7/0x1f

    Non-debug kernels will not crash immediately because referencing the
    dangling mpol will not cause a fault. Instead the filesystem will
    reference a freed mempolicy object, which will cause unpredictable
    behavior.

    The problem boils down to a dropped mpol reference below if
    shmem_parse_options() does not allocate a new mpol:

    config = *sbinfo
    shmem_parse_options(data, &config, true)
    mpol_put(sbinfo->mpol)
    sbinfo->mpol = config.mpol /* BUG: saves unreferenced mpol */

    This patch avoids the crash by not releasing the mempolicy if
    shmem_parse_options() doesn't create a new mpol.

    How far back does this issue go? I see it in both 2.6.36 and 3.3. I did
    not look back further.

    Signed-off-by: Greg Thelen
    Acked-by: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     
  • Rob van der Heij reported the following (paraphrased) on private mail.

    The scenario is that I want to avoid backups to fill up the page
    cache and purge stuff that is more likely to be used again (this is
    with s390x Linux on z/VM, so I don't give it as much memory that
    we don't care anymore). So I have something with LD_PRELOAD that
    intercepts the close() call (from tar, in this case) and issues
    a posix_fadvise() just before closing the file.

    This mostly works, except for small files (less than 14 pages)
    that remains in page cache after the face.

    Unfortunately Rob has not had a chance to test this exact patch but the
    test program below should be reproducing the problem he described.

    The issue is the per-cpu pagevecs for LRU additions. If the pages are
    added by one CPU but fadvise() is called on another then the pages
    remain resident as the invalidate_mapping_pages() only drains the local
    pagevecs via its call to pagevec_release(). The user-visible effect is
    that a program that uses fadvise() properly is not obeyed.

    A possible fix for this is to put the necessary smarts into
    invalidate_mapping_pages() to globally drain the LRU pagevecs if a
    pagevec page could not be discarded. The downside with this is that an
    inode cache shrink would send a global IPI and memory pressure
    potentially causing global IPI storms is very undesirable.

    Instead, this patch adds a check during fadvise(POSIX_FADV_DONTNEED) to
    check if invalidate_mapping_pages() discarded all the requested pages.
    If a subset of pages are discarded it drains the LRU pagevecs and tries
    again. If the second attempt fails, it assumes it is due to the pages
    being mapped, locked or dirty and does not care. With this patch, an
    application using fadvise() correctly will be obeyed but there is a
    downside that a malicious application can force the kernel to send
    global IPIs and increase overhead.

    If accepted, I would like this to be considered as a -stable candidate.
    It's not an urgent issue but it's a system call that is not working as
    advertised which is weak.

    The following test program demonstrates the problem. It should never
    report that pages are still resident but will without this patch. It
    assumes that CPU 0 and 1 exist.

    int main() {
    int fd;
    int pagesize = getpagesize();
    ssize_t written = 0, expected;
    char *buf;
    unsigned char *vec;
    int resident, i;
    cpu_set_t set;

    /* Prepare a buffer for writing */
    expected = FILESIZE_PAGES * pagesize;
    buf = malloc(expected + 1);
    if (buf == NULL) {
    printf("ENOMEM\n");
    exit(EXIT_FAILURE);
    }
    buf[expected] = 0;
    memset(buf, 'a', expected);

    /* Prepare the mincore vec */
    vec = malloc(FILESIZE_PAGES);
    if (vec == NULL) {
    printf("ENOMEM\n");
    exit(EXIT_FAILURE);
    }

    /* Bind ourselves to CPU 0 */
    CPU_ZERO(&set);
    CPU_SET(0, &set);
    if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
    perror("sched_setaffinity");
    exit(EXIT_FAILURE);
    }

    /* open file, unlink and write buffer */
    fd = open("fadvise-test-file", O_CREAT|O_EXCL|O_RDWR);
    if (fd == -1) {
    perror("open");
    exit(EXIT_FAILURE);
    }
    unlink("fadvise-test-file");
    while (written < expected) {
    ssize_t this_write;
    this_write = write(fd, buf + written, expected - written);

    if (this_write == -1) {
    perror("write");
    exit(EXIT_FAILURE);
    }

    written += this_write;
    }
    free(buf);

    /*
    * Force ourselves to another CPU. If fadvise only flushes the local
    * CPUs pagevecs then the fadvise will fail to discard all file pages
    */
    CPU_ZERO(&set);
    CPU_SET(1, &set);
    if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
    perror("sched_setaffinity");
    exit(EXIT_FAILURE);
    }

    /* sync and fadvise to discard the page cache */
    fsync(fd);
    if (posix_fadvise(fd, 0, expected, POSIX_FADV_DONTNEED) == -1) {
    perror("posix_fadvise");
    exit(EXIT_FAILURE);
    }

    /* map the file and use mincore to see which parts of it are resident */
    buf = mmap(NULL, expected, PROT_READ, MAP_SHARED, fd, 0);
    if (buf == NULL) {
    perror("mmap");
    exit(EXIT_FAILURE);
    }
    if (mincore(buf, expected, vec) == -1) {
    perror("mincore");
    exit(EXIT_FAILURE);
    }

    /* Check residency */
    for (i = 0, resident = 0; i < FILESIZE_PAGES; i++) {
    if (vec[i])
    resident++;
    }
    if (resident != 0) {
    printf("Nr unexpected pages resident: %d\n", resident);
    exit(EXIT_FAILURE);
    }

    munmap(buf, expected);
    close(fd);
    free(vec);
    exit(EXIT_SUCCESS);
    }

    Signed-off-by: Mel Gorman
    Reported-by: Rob van der Heij
    Tested-by: Rob van der Heij
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • We at SGI have a need to address some very high physical address ranges
    with our GRU (global reference unit), sometimes across partitioned
    machine boundaries and sometimes with larger addresses than the cpu
    supports. We do this with the aid of our own 'extended vma' module
    which mimics the vma. When something (either unmap or exit) frees an
    'extended vma' we use the mmu notifiers to clean them up.

    We had been able to mimic the functions
    __mmu_notifier_invalidate_range_start() and
    __mmu_notifier_invalidate_range_end() by locking the per-mm lock and
    walking the per-mm notifier list. But with the change to a global srcu
    lock (static in mmu_notifier.c) we can no longer do that. Our module has
    no access to that lock.

    So we request that these two functions be exported.

    Signed-off-by: Cliff Wickman
    Acked-by: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cliff Wickman
     
  • This change adds a follow_page_mask function which is equivalent to
    follow_page, but with an extra page_mask argument.

    follow_page_mask sets *page_mask to HPAGE_PMD_NR - 1 when it encounters
    a THP page, and to 0 in other cases.

    __get_user_pages() makes use of this in order to accelerate populating
    THP ranges - that is, when both the pages and vmas arrays are NULL, we
    don't need to iterate HPAGE_PMD_NR times to cover a single THP page (and
    we also avoid taking mm->page_table_lock that many times).

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Use long type for page counts in mm_populate() so as to avoid integer
    overflow when running the following test code:

    int main(void) {
    void *p = mmap(NULL, 0x100000000000, PROT_READ,
    MAP_PRIVATE | MAP_ANON, -1, 0);
    printf("p: %p\n", p);
    mlockall(MCL_CURRENT);
    printf("done\n");
    return 0;
    }

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • nr_free_zone_pages(), nr_free_buffer_pages() and nr_free_pagecache_pages()
    are horribly badly named, so accurately document them with code comments
    in case of the misuse of them.

    [akpm@linux-foundation.org: tweak comments]
    Reviewed-by: Randy Dunlap
    Signed-off-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • error_states[] has two separate states "unevictable LRU page" and
    "mlocked LRU page", and the former one has the higher priority now. But
    because of that the latter one is rarely chosen because pages with
    PageMlocked highly likely have PG_unevictable set. On the other hand,
    PG_unevictable without PageMlocked is common for ramfs or SHM_LOCKed
    shared memory, so reversing the priority of these two states helps us
    clearly distinguish them.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Chen Gong
    Cc: Tony Luck
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • memory_failure() can't handle memory errors on mlocked pages correctly,
    because page_action() judges such errors as ones on "unknown pages"
    instead of ones on "unevictable LRU page" or "mlocked LRU page". In
    order to determine page_state page_action() checks page flags at the
    timing of the judgement, but such page flags are not the same with those
    just after memory_failure() is called, because memory_failure() does
    unmapping of the error pages before doing page_action(). This unmapping
    changes the page state, especially page_remove_rmap() (called from
    try_to_unmap_one()) clears PG_mlocked, so page_action() can't catch
    mlocked pages after that.

    With this patch, we store the page flag of the error page before doing
    unmap, and (only) if the first check with page flags at the time decided
    the error page is unknown, we do the second check with the stored page
    flag. This implementation doesn't change error handling for the page
    types for which the first check can determine the page state correctly.

    [akpm@linux-foundation.org: tweak comments]
    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Cc: Chen Gong
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Whilst I run the risk of a flogging for disloyalty to the Lord of Sealand,
    I do have CONFIG_MEMCG=y CONFIG_MEMCG_KMEM not set, and grow tired of the
    "mm/memcontrol.c:4972:12: warning: `memcg_propagate_kmem' defined but not
    used [-Wunused-function]" seen in 3.8-rc: move the #ifdef outwards.

    Signed-off-by: Hugh Dickins
    Acked-by: Michal Hocko
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins