31 Jul, 2012

2 commits

  • __mem_open() which is called by both /proc//environ and
    /proc//mem ->open() handlers will allow the use of negative offsets.
    /proc//mem has negative offsets but not /proc//environ.

    Clean this by moving the 'force FMODE_UNSIGNED_OFFSET flag' to mem_open()
    to allow negative offsets only on /proc//mem.

    Signed-off-by: Djalal Harouni
    Cc: Oleg Nesterov
    Cc: Brad Spengler
    Acked-by: Kees Cook
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Djalal Harouni
     
  • Currently the following offset and environment address range check in
    environ_read() of /proc//environ is buggy:

    int this_len = mm->env_end - (mm->env_start + src);
    if (this_len /environ converted to 'unsigned
    long' may pass this check since '(mm->env_start + src)' can overflow and
    'this_len' will be positive.

    This can turn /proc//environ to act like /proc//mem since
    (mm->env_start + src) will point and read from another VMA.

    There are two fixes here plus some code cleaning:

    1) Fix the overflow by checking if the offset that was converted to
    unsigned long will always point to the [mm->env_start, mm->env_end]
    address range.

    2) Remove the truncation that was made to the result of the check,
    storing the result in 'int this_len' will alter its value and we can
    not depend on it.

    For kernels that have commit b409e578d ("proc: clean up
    /proc//environ handling") which adds the appropriate ptrace check and
    saves the 'mm' at ->open() time, this is not a security issue.

    This patch is taken from the grsecurity patch since it was just made
    available.

    Signed-off-by: Djalal Harouni
    Cc: Oleg Nesterov
    Cc: Brad Spengler
    Acked-by: Kees Cook
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Djalal Harouni
     

24 Jul, 2012

1 commit

  • Pull powerpc updates from Benjamin Herrenschmidt:
    "Notable highlights:

    - iommu improvements from Anton removing the per-iommu global lock in
    favor of dividing the DMA space into pools, each with its own lock,
    and hashed on the CPU number. Along with making the locking more
    fine grained, this gives significant improvements in multiqueue
    networking scalability.

    - Still from Anton, we know provide a vdso based variant of getcpu
    which makes sched_getcpu with the appropriate glibc patch something
    like 18 times faster.

    - More anton goodness (he's been busy !) in other areas such as a
    faster __clear_user and copy_page on P7, various perf fixes to
    improve sampling quality, etc...

    - One more step toward removing legacy i2c interfaces by using new
    device-tree based probing of platform devices for the AOA audio
    drivers

    - A nice series of patches from Michael Neuling that helps avoiding
    confusion between register numbers and litterals in assembly code,
    trying to enforce the use of "%rN" register names in gas rather
    than plain numbers.

    - A pile of FSL updates

    - The usual bunch of small fixes, cleanups etc...

    You may spot a change to drivers/char/mem. The patch got no comment
    or ack from outside, it's a trivial patch to allow the architecture to
    skip creating /dev/port, which we use to disable it on ppc64 that
    don't have a legacy brige. On those, IO ports 0...64K are not mapped
    in kernel space at all, so accesses to /dev/port cause oopses (and
    yes, distros -still- ship userspace that bangs hard coded ports such
    as kbdrate)."

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (106 commits)
    powerpc/mpic: Create a revmap with enough entries for IPIs and timers
    Remove stale .rej file
    powerpc/iommu: Fix iommu pool initialization
    powerpc/eeh: Check handle_eeh_events() return value
    powerpc/85xx: Add phy nodes in SGMII mode for MPC8536/44/72DS & P2020DS
    powerpc/e500: add paravirt QEMU platform
    powerpc/mpc85xx_ds: convert to unified PCI init
    powerpc/fsl-pci: get PCI init out of board files
    powerpc/85xx: Update corenet64_smp_defconfig
    powerpc/85xx: Update corenet32_smp_defconfig
    powerpc/85xx: Rename P1021RDB-PC device trees to be consistent
    powerpc/watchdog: move booke watchdog param related code to setup-common.c
    sound/aoa: Adapt to new i2c probing scheme
    i2c/powermac: Improve detection of devices from device-tree
    powerpc: Disable /dev/port interface on systems without an ISA bridge
    of: Improve prom_update_property() function
    powerpc: Add "memory" attribute for mfmsr()
    powerpc/ftrace: Fix assembly trampoline register usage
    powerpc/hw_breakpoints: Fix incorrect pointer access
    powerpc: Put the gpr save/restore functions in their own section
    ...

    Linus Torvalds
     

14 Jul, 2012

5 commits

  • Pass mount flags to sget() so that it can use them in initialising a new
    superblock before the set function is called. They could also be passed to the
    compare function.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     
  • Add a helper that abstracts out the jump to an already parsed struct path
    from ->follow_link operation from procfs. Not only does this clean up
    the code by moving the two sides of this game into a single helper, but
    it also prepares for making struct nameidata private to namei.c

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Currently the non-nd_set_link based versions of ->follow_link are expected
    to do a path_put(&nd->path) on failure. This calling convention is unexpected,
    undocumented and doesn't match what the nd_set_link-based instances do.

    Move the path_put out of the only non-nd_set_link based ->follow_link
    instance into the caller.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Just the flags; only NFS cares even about that, but there are
    legitimate uses for such argument. And getting rid of that
    completely would require splitting ->lookup() into a couple
    of methods (at least), so let's leave that alone for now...

    Signed-off-by: Al Viro

    Al Viro
     
  • Just the lookup flags. Die, bastard, die...

    Signed-off-by: Al Viro

    Al Viro
     

11 Jul, 2012

1 commit

  • prom_update_property() currently fails if the property doesn't
    actually exist yet which isn't what we want. Change to add-or-update
    instead of update-only, then we can remove a lot duplicated lines.

    Suggested-by: Grant Likely
    Signed-off-by: Dong Aisheng
    Acked-by: Rob Herring
    Signed-off-by: Benjamin Herrenschmidt

    Dong Aisheng
     

05 Jun, 2012

1 commit

  • Cyrill Gorcunov reports that I broke the fdinfo files with commit
    30a08bf2d31d ("proc: move fd symlink i_mode calculations into
    tid_fd_revalidate()"), and he's quite right.

    The tid_fd_revalidate() function is not just used for the /fd
    symlinks, it's also used for the /fdinfo/ files, and the
    permission model for those are different.

    So do the dynamic symlink permission handling just for symlinks, making
    the fdinfo files once more appear as the proper regular files they are.

    Of course, Al Viro argued (probably correctly) that we shouldn't do the
    symlink permission games at all, and make the symlinks always just be
    the normal 'lrwxrwxrwx'. That would have avoided this issue too, but
    since somebody noticed that the permissions had changed (which was the
    reason for that original commit 30a08bf2d31d in the first place), people
    do apparently use this feature.

    [ Basically, you can use the symlink permission data as a cheap "fdinfo"
    replacement, since you see whether the file is open for reading and/or
    writing by just looking at st_mode of the symlink. So the feature
    does make sense, even if the pain it has caused means we probably
    shouldn't have done it to begin with. ]

    Reported-and-tested-by: Cyrill Gorcunov
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

01 Jun, 2012

11 commits

  • We would like to have an ability to restore command line arguments and
    program environment pointers but first we need to obtain them somehow.
    Thus we put these values into /proc/$pid/stat. The exit_code is needed to
    restore zombie tasks.

    Signed-off-by: Cyrill Gorcunov
    Acked-by: Kees Cook
    Cc: Pavel Emelyanov
    Cc: Serge Hallyn
    Cc: KAMEZAWA Hiroyuki
    Cc: Alexey Dobriyan
    Cc: Tejun Heo
    Cc: Andrew Vagin
    Cc: Vasiliy Kulikov
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • When we do checkpoint of a task we need to know the list of children the
    task, has but there is no easy and fast way to generate reverse
    parent->children chain from arbitrary (while a parent pid is
    provided in "PPid" field of /proc//status).

    So instead of walking over all pids in the system (creating one big
    process tree in memory, just to figure out which children a task has) --
    we add explicit /proc//task//children entry, because the kernel
    already has this kind of information but it is not yet exported.

    This is a first level children, not the whole process tree.

    Signed-off-by: Cyrill Gorcunov
    Reviewed-by: Oleg Nesterov
    Reviewed-by: Kees Cook
    Cc: Pavel Emelyanov
    Cc: Serge Hallyn
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Currently, nonlinear mappings can not be distinguished from ordinary
    mappings. This patch adds into /proc/pid/smaps line "Nonlinear:
    kB", where size is amount of nonlinear ptes in vma, this line appears only
    if VM_NONLINEAR is set. This information may be useful not only for
    checkpoint/restore project.

    Requested by Pavel Emelyanov.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Andi Kleen
    Cc: Pavel Emelyanov
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Currently smaps reports migration entries as "swap", as result "swap" can
    appears in shared mapping.

    This patch converts migration entries into pages and handles them as usual.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Andi Kleen
    Cc: Pavel Emelyanov
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • This is an implementation of Andrew's proposal to extend the pagemap file
    bits to report what is missing about tasks' working set.

    The problem with the working set detection is multilateral. In the criu
    (checkpoint/restore) project we dump the tasks' memory into image files
    and to do it properly we need to detect which pages inside mappings are
    really in use. The mincore syscall I though could help with this did not.
    First, it doesn't report swapped pages, thus we cannot find out which
    parts of anonymous mappings to dump. Next, it does report pages from page
    cache as present even if they are not mapped, and it doesn't make that has
    not been cow-ed.

    Note, that issue with swap pages is critical -- we must dump swap pages to
    image file. But the issues with file pages are optimization -- we can
    take all file pages to image, this would be correct, but if we know that a
    page is not mapped or not cow-ed, we can remove them from dump file. The
    dump would still be self-consistent, though significantly smaller in size
    (up to 10 times smaller on real apps).

    Andrew noticed, that the proc pagemap file solved 2 of 3 above issues --
    it reports whether a page is present or swapped and it doesn't report not
    mapped page cache pages. But, it doesn't distinguish cow-ed file pages
    from not cow-ed.

    I would like to make the last unused bit in this file to report whether the
    page mapped into respective pte is PageAnon or not.

    [comment stolen from Pavel Emelyanov's v1 patch]

    Signed-off-by: Konstantin Khlebnikov
    Cc: Pavel Emelyanov
    Cc: Matt Mackall
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • - use int fpr priority and nice, since task_nice()/task_prio() return that

    - field 24: get_mm_rss() returns unsigned long

    Signed-off-by: Jan Engelhardt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Engelhardt
     
  • Pass "fd" directly, not via pointer -- one less memory read.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • rcu_read_lock()/rcu_read_unlock() is nop for TINY_RCU, but is not a nop
    for, say, PREEMPT_RCU.

    proc_fill_cache() is called without RCU lock, there is no need to
    lock/unlock on error path, simply jump out of the loop.

    Signed-off-by: Alexey Dobriyan
    Cc: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • mm_access() handles this much better, and avoids some race conditions.

    Signed-off-by: Cong Wang
    Cc: Oleg Nesterov
    Cc: Alexey Dobriyan
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cong Wang
     
  • mm_for_maps() is a simple wrapper for mm_access(), and the name is
    misleading, so just remove it and use mm_access() directly.

    Signed-off-by: Cong Wang
    Cc: Oleg Nesterov
    Cc: Alexey Dobriyan
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cong Wang
     
  • Similar to e268337dfe26 ("proc: clean up and fix /proc//mem
    handling"), move the check of permission to open(), this will simplify
    read() code.

    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Cong Wang
    Cc: Oleg Nesterov
    Cc: Alexey Dobriyan
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cong Wang
     

30 May, 2012

2 commits

  • The oom_score_adj scale ranges from -1000 to 1000 and represents the
    proportion of memory available to the process at allocation time. This
    means an oom_score_adj value of 300, for example, will bias a process as
    though it was using an extra 30.0% of available memory and a value of
    -350 will discount 35.0% of available memory from its usage.

    The oom killer badness heuristic also uses this scale to report the oom
    score for each eligible process in determining the "best" process to
    kill. Thus, it can only differentiate each process's memory usage by
    0.1% of system RAM.

    On large systems, this can end up being a large amount of memory: 256MB
    on 256GB systems, for example.

    This can be fixed by having the badness heuristic to use the actual
    memory usage in scoring threads and then normalizing it to the
    oom_score_adj scale for userspace. This results in better comparison
    between eligible threads for kill and no change from the userspace
    perspective.

    Suggested-by: KOSAKI Motohiro
    Tested-by: Dave Jones
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • A missing validation of the value returned by find_vma() could cause a
    NULL ptr dereference when walking the pagetable.

    This is triggerable from usermode by a simple user by trying to read a
    page info out of /proc/pid/pagemap which doesn't exist.

    Introduced by commit 025c5b2451e4 ("thp: optimize away unnecessary page
    table locking").

    Signed-off-by: Sasha Levin
    Reviewed-by: Naoya Horiguchi
    Cc: David Rientjes
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Cc: [3.4.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

29 May, 2012

1 commit

  • Pull writeback tree from Wu Fengguang:
    "Mainly from Jan Kara to avoid iput() in the flusher threads."

    * tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: Avoid iput() from flusher thread
    vfs: Rename end_writeback() to clear_inode()
    vfs: Move waiting for inode writeback from end_writeback() to evict_inode()
    writeback: Refactor writeback_single_inode()
    writeback: Remove wb->list_lock from writeback_single_inode()
    writeback: Separate inode requeueing after writeback
    writeback: Move I_DIRTY_PAGES handling
    writeback: Move requeueing when I_SYNC set to writeback_sb_inodes()
    writeback: Move clearing of I_SYNC into inode_sync_complete()
    writeback: initialize global_dirty_limit
    fs: remove 8 bytes of padding from struct writeback_control on 64 bit builds
    mm: page-writeback.c: local functions should not be exposed globally

    Linus Torvalds
     

24 May, 2012

1 commit

  • Pull user namespace enhancements from Eric Biederman:
    "This is a course correction for the user namespace, so that we can
    reach an inexpensive, maintainable, and reasonably complete
    implementation.

    Highlights:
    - Config guards make it impossible to enable the user namespace and
    code that has not been converted to be user namespace safe.

    - Use of the new kuid_t type ensures the if you somehow get past the
    config guards the kernel will encounter type errors if you enable
    user namespaces and attempt to compile in code whose permission
    checks have not been updated to be user namespace safe.

    - All uids from child user namespaces are mapped into the initial
    user namespace before they are processed. Removing the need to add
    an additional check to see if the user namespace of the compared
    uids remains the same.

    - With the user namespaces compiled out the performance is as good or
    better than it is today.

    - For most operations absolutely nothing changes performance or
    operationally with the user namespace enabled.

    - The worst case performance I could come up with was timing 1
    billion cache cold stat operations with the user namespace code
    enabled. This went from 156s to 164s on my laptop (or 156ns to
    164ns per stat operation).

    - (uid_t)-1 and (gid_t)-1 are reserved as an internal error value.
    Most uid/gid setting system calls treat these value specially
    anyway so attempting to use -1 as a uid would likely cause
    entertaining failures in userspace.

    - If setuid is called with a uid that can not be mapped setuid fails.
    I have looked at sendmail, login, ssh and every other program I
    could think of that would call setuid and they all check for and
    handle the case where setuid fails.

    - If stat or a similar system call is called from a context in which
    we can not map a uid we lie and return overflowuid. The LFS
    experience suggests not lying and returning an error code might be
    better, but the historical precedent with uids is different and I
    can not think of anything that would break by lying about a uid we
    can't map.

    - Capabilities are localized to the current user namespace making it
    safe to give the initial user in a user namespace all capabilities.

    My git tree covers all of the modifications needed to convert the core
    kernel and enough changes to make a system bootable to runlevel 1."

    Fix up trivial conflicts due to nearby independent changes in fs/stat.c

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits)
    userns: Silence silly gcc warning.
    cred: use correct cred accessor with regards to rcu read lock
    userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq
    userns: Convert cgroup permission checks to use uid_eq
    userns: Convert tmpfs to use kuid and kgid where appropriate
    userns: Convert sysfs to use kgid/kuid where appropriate
    userns: Convert sysctl permission checks to use kuid and kgids.
    userns: Convert proc to use kuid/kgid where appropriate
    userns: Convert ext4 to user kuid/kgid where appropriate
    userns: Convert ext3 to use kuid/kgid where appropriate
    userns: Convert ext2 to use kuid/kgid where appropriate.
    userns: Convert devpts to use kuid/kgid where appropriate
    userns: Convert binary formats to use kuid/kgid where appropriate
    userns: Add negative depends on entries to avoid building code that is userns unsafe
    userns: signal remove unnecessary map_cred_ns
    userns: Teach inode_capable to understand inodes whose uids map to other namespaces.
    userns: Fail exec for suid and sgid binaries with ids outside our user namespace.
    userns: Convert stat to return values mapped from kuids and kgids
    userns: Convert user specfied uids and gids in chown into kuids and kgid
    userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs
    ...

    Linus Torvalds
     

19 May, 2012

2 commits

  • Merge misc fixes from Andrew Morton.

    * emailed from Andrew Morton : (4 patches)
    frv: delete incorrect task prototypes causing compile fail
    slub: missing test for partial pages flush work in flush_all()
    fs, proc: fix ABBA deadlock in case of execution attempt of map_files/ entries
    drivers/rtc/rtc-pl031.c: configure correct wday for 2000-01-01

    Linus Torvalds
     
  • Instead of doing the i_mode calculations at proc_fd_instantiate() time,
    move them into tid_fd_revalidate(), which is where the other inode state
    (notably uid/gid information) is updated too.

    Otherwise we'll end up with stale i_mode information if an fd is re-used
    while the dentry still hangs around. Not that anything really *cares*
    (symlink permissions don't really matter), but Tetsuo Handa noticed that
    the owner read/write bits don't always match the state of the
    readability of the file descriptor, and we _used_ to get this right a
    long time ago in a galaxy far, far away.

    Besides, aside from fixing an ugly detail (that has apparently been this
    way since commit 61a28784028e: "proc: Remove the hard coded inode
    numbers" in 2006), this removes more lines of code than it adds. And it
    just makes sense to update i_mode in the same place we update i_uid/gid.

    Al Viro correctly points out that we could just do the inode fill in the
    inode iops ->getattr() function instead. However, that does require
    somewhat slightly more invasive changes, and adds yet *another* lookup
    of the file descriptor. We need to do the revalidate() for other
    reasons anyway, and have the file descriptor handy, so we might as well
    fill in the information at this point.

    Reported-by: Tetsuo Handa
    Cc: Al Viro
    Acked-by: Eric Biederman
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

18 May, 2012

1 commit

  • map_files/ entries are never supposed to be executed, still curious
    minds might try to run them, which leads to the following deadlock

    ======================================================
    [ INFO: possible circular locking dependency detected ]
    3.4.0-rc4-24406-g841e6a6 #121 Not tainted
    -------------------------------------------------------
    bash/1556 is trying to acquire lock:
    (&sb->s_type->i_mutex_key#8){+.+.+.}, at: do_lookup+0x267/0x2b1

    but task is already holding lock:
    (&sig->cred_guard_mutex){+.+.+.}, at: prepare_bprm_creds+0x2d/0x69

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&sig->cred_guard_mutex){+.+.+.}:
    validate_chain+0x444/0x4f4
    __lock_acquire+0x387/0x3f8
    lock_acquire+0x12b/0x158
    __mutex_lock_common+0x56/0x3a9
    mutex_lock_killable_nested+0x40/0x45
    lock_trace+0x24/0x59
    proc_map_files_lookup+0x5a/0x165
    __lookup_hash+0x52/0x73
    do_lookup+0x276/0x2b1
    walk_component+0x3d/0x114
    do_last+0xfc/0x540
    path_openat+0xd3/0x306
    do_filp_open+0x3d/0x89
    do_sys_open+0x74/0x106
    sys_open+0x21/0x23
    tracesys+0xdd/0xe2

    -> #0 (&sb->s_type->i_mutex_key#8){+.+.+.}:
    check_prev_add+0x6a/0x1ef
    validate_chain+0x444/0x4f4
    __lock_acquire+0x387/0x3f8
    lock_acquire+0x12b/0x158
    __mutex_lock_common+0x56/0x3a9
    mutex_lock_nested+0x40/0x45
    do_lookup+0x267/0x2b1
    walk_component+0x3d/0x114
    link_path_walk+0x1f9/0x48f
    path_openat+0xb6/0x306
    do_filp_open+0x3d/0x89
    open_exec+0x25/0xa0
    do_execve_common+0xea/0x2f9
    do_execve+0x43/0x45
    sys_execve+0x43/0x5a
    stub_execve+0x6c/0xc0

    This is because prepare_bprm_creds grabs task->signal->cred_guard_mutex
    and when do_lookup happens we try to grab task->signal->cred_guard_mutex
    again in lock_trace.

    Fix it using plain ptrace_may_access() helper in proc_map_files_lookup()
    and in proc_map_files_readdir() instead of lock_trace(), the caller must
    be CAP_SYS_ADMIN granted anyway.

    Signed-off-by: Cyrill Gorcunov
    Reported-by: Sasha Levin
    Cc: Konstantin Khlebnikov
    Cc: Pavel Emelyanov
    Cc: Dave Jones
    Cc: Vasiliy Kulikov
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     

16 May, 2012

2 commits


11 May, 2012

1 commit

  • Reset the current pagemap-entry if the current pte isn't present, or if
    current vma is over. Otherwise pagemap reports last entry again and
    again.

    Non-present pte reporting was broken in commit 092b50bacd1c ("pagemap:
    introduce data structure for pagemap entry")

    Reporting for holes was broken in commit 5aaabe831eb5 ("pagemap: avoid
    splitting thp when reading /proc/pid/pagemap")

    Signed-off-by: Konstantin Khlebnikov
    Reported-by: Pavel Emelyanov
    Cc: Naoya Horiguchi
    Cc: KAMEZAWA Hiroyuki
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

06 May, 2012

1 commit

  • After we moved inode_sync_wait() from end_writeback() it doesn't make sense
    to call the function end_writeback() anymore. Rename it to clear_inode()
    which well says what the function really does - set I_CLEAR flag.

    Signed-off-by: Jan Kara
    Signed-off-by: Fengguang Wu

    Jan Kara
     

03 May, 2012

1 commit


26 Apr, 2012

2 commits

  • - Convert the old uid mapping functions into compatibility wrappers
    - Add a uid/gid mapping layer from user space uid and gids to kernel
    internal uids and gids that is extent based for simplicty and speed.
    * Working with number space after mapping uids/gids into their kernel
    internal version adds only mapping complexity over what we have today,
    leaving the kernel code easy to understand and test.
    - Add proc files /proc/self/uid_map /proc/self/gid_map
    These files display the mapping and allow a mapping to be added
    if a mapping does not exist.
    - Allow entering the user namespace without a uid or gid mapping.
    Since we are starting with an existing user our uids and gids
    still have global mappings so are still valid and useful they just don't
    have local mappings. The requirement for things to work are global uid
    and gid so it is odd but perfectly fine not to have a local uid
    and gid mapping.
    Not requiring global uid and gid mappings greatly simplifies
    the logic of setting up the uid and gid mappings by allowing
    the mappings to be set after the namespace is created which makes the
    slight weirdness worth it.
    - Make the mappings in the initial user namespace to the global
    uid/gid space explicit. Today it is an identity mapping
    but in the future we may want to twist this for debugging, similar
    to what we do with jiffies.
    - Document the memory ordering requirements of setting the uid and
    gid mappings. We only allow the mappings to be set once
    and there are no pointers involved so the requirments are
    trivial but a little atypical.

    Performance:

    In this scheme for the permission checks the performance is expected to
    stay the same as the actuall machine instructions should remain the same.

    The worst case I could think of is ls -l on a large directory where
    all of the stat results need to be translated with from kuids and
    kgids to uids and gids. So I benchmarked that case on my laptop
    with a dual core hyperthread Intel i5-2520M cpu with 3M of cpu cache.

    My benchmark consisted of going to single user mode where nothing else
    was running. On an ext4 filesystem opening 1,000,000 files and looping
    through all of the files 1000 times and calling fstat on the
    individuals files. This was to ensure I was benchmarking stat times
    where the inodes were in the kernels cache, but the inode values were
    not in the processors cache. My results:

    v3.4-rc1: ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
    v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
    v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)

    All of the configurations ran in roughly 120ns when I performed tests
    that ran in the cpu cache.

    So in summary the performance impact is:
    1ns improvement in the worst case with user namespace support compiled out.
    8ns aka 5% slowdown in the worst case with user namespace support compiled in.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • Revert commit 85e72aa5384 ("proc: clear_refs: do not clear reserved
    pages"), which was a quick fix suitable for -stable until ARM had been
    moved over to the gate_vma mechanism:

    https://lkml.org/lkml/2012/1/14/55

    With commit f9d4861f ("ARM: 7294/1: vectors: use gate_vma for vectors user
    mapping"), ARM does now use the gate_vma, so the PageReserved check can be
    removed from the proc code.

    Signed-off-by: Will Deacon
    Cc: Nicolas Pitre
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Will Deacon
     

13 Apr, 2012

1 commit

  • Pull timer fixes from Thomas Gleixner:
    "The itimer removal one is not strictly a fix, but I really wanted to
    avoid a rebase of the urgent ones."

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    Revert "clocksource: Load the ACPI PM clocksource asynchronously"
    clockevents: tTack broadcast device mode change in tick_broadcast_switch_to_oneshot()
    itimer: Use printk_once instead of WARN_ONCE
    nohz: Fix stale jiffies update in tick_nohz_restart()
    tick: Document TICK_ONESHOT config option
    proc: stats: Use arch_idle_time for idle and iowait times if available
    itimer: Schedule silent NULL pointer fixup in setitimer() for removal

    Linus Torvalds
     

06 Apr, 2012

2 commits

  • Merge batch of fixes from Andrew Morton:
    "The simple_open() cleanup was held back while I wanted for laggards to
    merge things.

    I still need to send a few checkpoint/restore patches. I've been
    wobbly about merging them because I'm wobbly about the overall
    prospects for success of the project. But after speaking with Pavel
    at the LSF conference, it sounds like they're further toward
    completion than I feared - apparently davem is at the "has stopped
    complaining" stage regarding the net changes. So I need to go back
    and re-review those patchs and their (lengthy) discussion."

    * emailed from Andrew Morton : (16 patches)
    memcg swap: use mem_cgroup_uncharge_swap fix
    backlight: add driver for DA9052/53 PMIC v1
    C6X: use set_current_blocked() and block_sigmask()
    MAINTAINERS: add entry for sparse checker
    MAINTAINERS: fix REMOTEPROC F: typo
    alpha: use set_current_blocked() and block_sigmask()
    simple_open: automatically convert to simple_open()
    scripts/coccinelle/api/simple_open.cocci: semantic patch for simple_open()
    libfs: add simple_open()
    hugetlbfs: remove unregister_filesystem() when initializing module
    drivers/rtc/rtc-88pm860x.c: fix rtc irq enable callback
    fs/xattr.c:setxattr(): improve handling of allocation failures
    fs/xattr.c:listxattr(): fall back to vmalloc() if kmalloc() failed
    fs/xattr.c: suppress page allocation failure warnings from sys_listxattr()
    sysrq: use SEND_SIG_FORCED instead of force_sig()
    proc: fix mount -t proc -o AAA

    Linus Torvalds
     
  • The proc_parse_options() call from proc_mount() runs only once at boot
    time. So on any later mount attempt, any mount options are ignored
    because ->s_root is already initialized.

    As a consequence, "mount -o " will ignore the options. The
    only way to change mount options is "mount -o remount,".

    To fix this, parse the mount options unconditionally.

    Signed-off-by: Vasiliy Kulikov
    Reported-by: Arkadiusz Miskiewicz
    Tested-by: Arkadiusz Miskiewicz
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Cc: Valdis Kletnieks
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     

30 Mar, 2012

2 commits

  • Git commit a25cac5198d4ff28 "proc: Consider NO_HZ when printing idle and
    iowait times" changes the code for /proc/stat to use get_cpu_idle_time_us
    and get_cpu_iowait_time_us if the system is running with nohz enabled.
    For architectures which define arch_idle_time (currently s390 only)
    this is a change for the worse. The result of arch_idle_time is supposed
    to be the exact sleep time of the target cpu and should be used instead
    of the value kept by the scheduler.

    Signed-off-by: Martin Schwidefsky
    Reviewed-by: Michal Hocko
    Reviewed-by: Srivatsa S. Bhat
    Link: http://lkml.kernel.org/r/20120330122308.18720283@de.ibm.com
    Signed-off-by: Thomas Gleixner

    Martin Schwidefsky
     
  • Pull x32 support for x86-64 from Ingo Molnar:
    "This tree introduces the X32 binary format and execution mode for x86:
    32-bit data space binaries using 64-bit instructions and 64-bit kernel
    syscalls.

    This allows applications whose working set fits into a 32 bits address
    space to make use of 64-bit instructions while using a 32-bit address
    space with shorter pointers, more compressed data structures, etc."

    Fix up trivial context conflicts in arch/x86/{Kconfig,vdso/vma.c}

    * 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
    x32: Fix alignment fail in struct compat_siginfo
    x32: Fix stupid ia32/x32 inversion in the siginfo format
    x32: Add ptrace for x32
    x32: Switch to a 64-bit clock_t
    x32: Provide separate is_ia32_task() and is_x32_task() predicates
    x86, mtrr: Use explicit sizing and padding for the 64-bit ioctls
    x86/x32: Fix the binutils auto-detect
    x32: Warn and disable rather than error if binutils too old
    x32: Only clear TIF_X32 flag once
    x32: Make sure TS_COMPAT is cleared for x32 tasks
    fs: Remove missed ->fds_bits from cessation use of fd_set structs internally
    fs: Fix close_on_exec pointer in alloc_fdtable
    x32: Drop non-__vdso weak symbols from the x32 VDSO
    x32: Fix coding style violations in the x32 VDSO code
    x32: Add x32 VDSO support
    x32: Allow x32 to be configured
    x32: If configured, add x32 system calls to system call tables
    x32: Handle process creation
    x32: Signal-related system calls
    x86: Add #ifdef CONFIG_COMPAT to
    ...

    Linus Torvalds