11 Jan, 2012

5 commits

  • Add support for mount options to restrict access to /proc/PID/
    directories. The default backward-compatible "relaxed" behaviour is left
    untouched.

    The first mount option is called "hidepid" and its value defines how much
    info about processes we want to be available for non-owners:

    hidepid=0 (default) means the old behavior - anybody may read all
    world-readable /proc/PID/* files.

    hidepid=1 means users may not access any /proc// directories, but
    their own. Sensitive files like cmdline, sched*, status are now protected
    against other users. As permission checking done in proc_pid_permission()
    and files' permissions are left untouched, programs expecting specific
    files' modes are not confused.

    hidepid=2 means hidepid=1 plus all /proc/PID/ will be invisible to other
    users. It doesn't mean that it hides whether a process exists (it can be
    learned by other means, e.g. by kill -0 $PID), but it hides process' euid
    and egid. It compicates intruder's task of gathering info about running
    processes, whether some daemon runs with elevated privileges, whether
    another user runs some sensitive program, whether other users run any
    program at all, etc.

    gid=XXX defines a group that will be able to gather all processes' info
    (as in hidepid=0 mode). This group should be used instead of putting
    nonroot user in sudoers file or something. However, untrusted users (like
    daemons, etc.) which are not supposed to monitor the tasks in the whole
    system should not be added to the group.

    hidepid=1 or higher is designed to restrict access to procfs files, which
    might reveal some sensitive private information like precise keystrokes
    timings:

    http://www.openwall.com/lists/oss-security/2011/11/05/3

    hidepid=1/2 doesn't break monitoring userspace tools. ps, top, pgrep, and
    conky gracefully handle EPERM/ENOENT and behave as if the current user is
    the only user running processes. pstree shows the process subtree which
    contains "pstree" process.

    Note: the patch doesn't deal with setuid/setgid issues of keeping
    preopened descriptors of procfs files (like
    https://lkml.org/lkml/2011/2/7/368). We rely on that the leaked
    information like the scheduling counters of setuid apps doesn't threaten
    anybody's privacy - only the user started the setuid program may read the
    counters.

    Signed-off-by: Vasiliy Kulikov
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Cc: Randy Dunlap
    Cc: "H. Peter Anvin"
    Cc: Greg KH
    Cc: Theodore Tso
    Cc: Alan Cox
    Cc: James Morris
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     
  • Add support for procfs mount options. Actual mount options are coming in
    the next patches.

    Signed-off-by: Vasiliy Kulikov
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Cc: Randy Dunlap
    Cc: "H. Peter Anvin"
    Cc: Greg KH
    Cc: Theodore Tso
    Cc: Alan Cox
    Cc: James Morris
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     
  • This one behaves similarly to the /proc//fd/ one - it contains
    symlinks one for each mapping with file, the name of a symlink is
    "vma->vm_start-vma->vm_end", the target is the file. Opening a symlink
    results in a file that point exactly to the same inode as them vma's one.

    For example the ls -l of some arbitrary /proc//map_files/

    | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
    | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
    | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
    | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
    | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so

    This *helps* checkpointing process in three ways:

    1. When dumping a task mappings we do know exact file that is mapped
    by particular region. We do this by opening
    /proc/$pid/map_files/$address symlink the way we do with file
    descriptors.

    2. This also helps in determining which anonymous shared mappings are
    shared with each other by comparing the inodes of them.

    3. When restoring a set of processes in case two of them has a mapping
    shared, we map the memory by the 1st one and then open its
    /proc/$pid/map_files/$address file and map it by the 2nd task.

    Using /proc/$pid/maps for this is quite inconvenient since it brings
    repeatable re-reading and reparsing for this text file which slows down
    restore procedure significantly. Also as being pointed in (3) it is a way
    easier to use top level shared mapping in children as
    /proc/$pid/map_files/$address when needed.

    [akpm@linux-foundation.org: coding-style fixes]
    [gorcunov@openvz.org: make map_files depend on CHECKPOINT_RESTORE]
    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Cyrill Gorcunov
    Reviewed-by: Vasiliy Kulikov
    Reviewed-by: "Kirill A. Shutemov"
    Cc: Tejun Heo
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Prepare the ground for the next "map_files" patch which needs a name of a
    link file to analyse.

    Signed-off-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Tejun Heo
    Cc: Vasiliy Kulikov
    Cc: "Kirill A. Shutemov"
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • oom_score_adj is used for guarding processes from OOM-Killer. One of
    problem is that it's inherited at fork(). When a daemon set oom_score_adj
    and make children, it's hard to know where the value is set.

    This patch adds some tracepoints useful for debugging. This patch adds
    3 trace points.
    - creating new task
    - renaming a task (exec)
    - set oom_score_adj

    To debug, users need to enable some trace pointer. Maybe filtering is useful as

    # EVENT=/sys/kernel/debug/tracing/events/task/
    # echo "oom_score_adj != 0" > $EVENT/task_newtask/filter
    # echo "oom_score_adj != 0" > $EVENT/task_rename/filter
    # echo 1 > $EVENT/enable
    # EVENT=/sys/kernel/debug/tracing/events/oom/
    # echo 1 > $EVENT/enable

    output will be like this.
    # grep oom /sys/kernel/debug/tracing/trace
    bash-7699 [007] d..3 5140.744510: oom_score_adj_update: pid=7699 comm=bash oom_score_adj=-1000
    bash-7699 [007] ...1 5151.818022: task_newtask: pid=7729 comm=bash clone_flags=1200011 oom_score_adj=-1000
    ls-7729 [003] ...2 5151.818504: task_rename: pid=7729 oldcomm=bash newcomm=ls oom_score_adj=-1000
    bash-7699 [002] ...1 5175.701468: task_newtask: pid=7730 comm=bash clone_flags=1200011 oom_score_adj=-1000
    grep-7730 [007] ...2 5175.701993: task_rename: pid=7730 oldcomm=bash newcomm=grep oom_score_adj=-1000

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

09 Jan, 2012

1 commit

  • * 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (165 commits)
    reiserfs: Properly display mount options in /proc/mounts
    vfs: prevent remount read-only if pending removes
    vfs: count unlinked inodes
    vfs: protect remounting superblock read-only
    vfs: keep list of mounts for each superblock
    vfs: switch ->show_options() to struct dentry *
    vfs: switch ->show_path() to struct dentry *
    vfs: switch ->show_devname() to struct dentry *
    vfs: switch ->show_stats to struct dentry *
    switch security_path_chmod() to struct path *
    vfs: prefer ->dentry->d_sb to ->mnt->mnt_sb
    vfs: trim includes a bit
    switch mnt_namespace ->root to struct mount
    vfs: take /proc/*/mounts and friends to fs/proc_namespace.c
    vfs: opencode mntget() mnt_set_mountpoint()
    vfs: spread struct mount - remaining argument of next_mnt()
    vfs: move fsnotify junk to struct mount
    vfs: move mnt_devname
    vfs: move mnt_list to struct mount
    vfs: switch pnode.h macros to struct mount *
    ...

    Linus Torvalds
     

07 Jan, 2012

2 commits

  • Al Viro
     
  • * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
    sched/tracing: Add a new tracepoint for sleeptime
    sched: Disable scheduler warnings during oopses
    sched: Fix cgroup movement of waking process
    sched: Fix cgroup movement of newly created process
    sched: Fix cgroup movement of forking process
    sched: Remove cfs bandwidth period check in tg_set_cfs_period()
    sched: Fix load-balance lock-breaking
    sched: Replace all_pinned with a generic flags field
    sched: Only queue remote wakeups when crossing cache boundaries
    sched: Add missing rcu_dereference() around ->real_parent usage
    [S390] fix cputime overflow in uptime_proc_show
    [S390] cputime: add sparse checking and cleanup
    sched: Mark parent and real_parent as __rcu
    sched, nohz: Fix missing RCU read lock
    sched, nohz: Set the NOHZ_BALANCE_KICK flag for idle load balancer
    sched, nohz: Fix the idle cpu check in nohz_idle_balance
    sched: Use jump_labels for sched_feat
    sched/accounting: Fix parameter passing in task_group_account_field
    sched/accounting: Fix user/system tick double accounting
    sched/accounting: Re-use scheduler statistics for the root cgroup
    ...

    Fix up conflicts in
    - arch/ia64/include/asm/cputime.h, include/asm-generic/cputime.h
    usecs_to_cputime64() vs the sparse cleanups
    - kernel/sched/fair.c, kernel/time/tick-sched.c
    scheduler changes in multiple branches

    Linus Torvalds
     

04 Jan, 2012

4 commits


30 Dec, 2011

1 commit

  • Commit 2a95ea6c0d129b4 ("procfs: do not overflow get_{idle,iowait}_time
    for nohz") did not take into account that one some architectures jiffies
    and cputime use different units.

    This causes get_idle_time() to return numbers in the wrong units, making
    the idle time fields in /proc/stat wrong.

    Instead of converting the usec value returned by
    get_cpu_{idle,iowait}_time_us to units of jiffies, use the new function
    usecs_to_cputime64 to convert it to the correct unit of cputime64_t.

    Signed-off-by: Andreas Schwab
    Acked-by: Michal Hocko
    Cc: Arnd Bergmann
    Cc: "Artem S. Tashkinov"
    Cc: Dave Jones
    Cc: Alexey Dobriyan
    Cc: Thomas Gleixner
    Cc: "Luck, Tony"
    Cc: Benjamin Herrenschmidt
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andreas Schwab
     

20 Dec, 2011

1 commit


15 Dec, 2011

4 commits


09 Dec, 2011

3 commits

  • Since commit a25cac5198d4 ("proc: Consider NO_HZ when printing idle and
    iowait times") we are reporting idle/io_wait time also while a CPU is
    tickless. We rely on get_{idle,iowait}_time functions to retrieve
    proper data.

    These functions, however, use usecs_to_cputime to translate micro
    seconds time to cputime64_t. This is just an alias to usecs_to_jiffies
    which reduces the data type from u64 to unsigned int and also checks
    whether the given parameter overflows jiffies_to_usecs(MAX_JIFFY_OFFSET)
    and returns MAX_JIFFY_OFFSET in that case.

    When we overflow depends on CONFIG_HZ but especially for CONFIG_HZ_300
    it is quite low (1431649781) so we are getting MAX_JIFFY_OFFSET for
    >3000s! until we overflow unsigned int. Just for reference
    CONFIG_HZ_100 has an overflow window around 20s, CONFIG_HZ_250 ~8s and
    CONFIG_HZ_1000 ~2s.

    This results in a bug when people saw [h]top going mad reporting 100%
    CPU usage even though there was basically no CPU load. The reason was
    simply that /proc/stat stopped reporting idle/io_wait changes (and
    reported MAX_JIFFY_OFFSET) and so the only change happening was for user
    system time.

    Let's use nsecs_to_jiffies64 instead which doesn't reduce the precision
    to 32b type and it is much more appropriate for cumulative time values
    (unlike usecs_to_jiffies which intended for timeout calculations).

    Signed-off-by: Michal Hocko
    Tested-by: Artem S. Tashkinov
    Cc: Dave Jones
    Cc: Arnd Bergmann
    Cc: Alexey Dobriyan
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Fix the error message "directives may not be used inside a macro argument"
    which appears when the kernel is compiled for the cris architecture.

    Signed-off-by: Claudio Scordino
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Claudio Scordino
     
  • kern_mount() doesn't pair with plain mntput()...

    Signed-off-by: Al Viro

    Al Viro
     

06 Dec, 2011

1 commit

  • This patch changes fields in cpustat from a structure, to an
    u64 array. Math gets easier, and the code is more flexible.

    Signed-off-by: Glauber Costa
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Paul Tuner
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1322498719-2255-2-git-send-email-glommer@parallels.com
    Signed-off-by: Ingo Molnar

    Glauber Costa
     

10 Nov, 2011

1 commit

  • This reverts commit aa6afca5bcaba8101f3ea09d5c3e4100b2b9f0e5.

    It escalates of some of the google-chrome SELinux problems with ptrace
    ("Check failed: pid_ > 0. Did not find zygote process"), and Andrew
    says that it is also causing mystery lockdep reports.

    Reported-by: Alex Villacís Lasso
    Requested-by: James Morris
    Requested-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

07 Nov, 2011

1 commit

  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     

03 Nov, 2011

4 commits

  • Says Andrew:

    "60 patches. That's good enough for -rc1 I guess. I have quite a lot
    of detritus to be rechecked, work through maintainers, etc.

    - most of the remains of MM
    - rtc
    - various misc
    - cgroups
    - memcg
    - cpusets
    - procfs
    - ipc
    - rapidio
    - sysctl
    - pps
    - w1
    - drivers/misc
    - aio"

    * akpm: (60 commits)
    memcg: replace ss->id_lock with a rwlock
    aio: allocate kiocbs in batches
    drivers/misc/vmw_balloon.c: fix typo in code comment
    drivers/misc/vmw_balloon.c: determine page allocation flag can_sleep outside loop
    w1: disable irqs in critical section
    drivers/w1/w1_int.c: multiple masters used same init_name
    drivers/power/ds2780_battery.c: fix deadlock upon insertion and removal
    drivers/power/ds2780_battery.c: add a nolock function to w1 interface
    drivers/power/ds2780_battery.c: create central point for calling w1 interface
    w1: ds2760 and ds2780, use ida for id and ida_simple_get() to get it
    pps gpio client: add missing dependency
    pps: new client driver using GPIO
    pps: default echo function
    include/linux/dma-mapping.h: add dma_zalloc_coherent()
    sysctl: make CONFIG_SYSCTL_SYSCALL default to n
    sysctl: add support for poll()
    RapidIO: documentation update
    drivers/net/rionet.c: fix ethernet address macros for LE platforms
    RapidIO: fix potential null deref in rio_setup_device()
    RapidIO: add mport driver for Tsi721 bridge
    ...

    Linus Torvalds
     
  • Adding support for poll() in sysctl fs allows userspace to receive
    notifications of changes in sysctl entries. This adds a infrastructure to
    allow files in sysctl fs to be pollable and implements it for hostname and
    domainname.

    [akpm@linux-foundation.org: s/declare/define/ for definitions]
    Signed-off-by: Lucas De Marchi
    Cc: Greg KH
    Cc: Kay Sievers
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lucas De Marchi
     
  • fd* files are restricted to the task's owner, and other users may not get
    direct access to them. But one may open any of these files and run any
    setuid program, keeping opened file descriptors. As there are permission
    checks on open(), but not on readdir() and read(), operations on the kept
    file descriptors will not be checked. It makes it possible to violate
    procfs permission model.

    Reading fdinfo/* may disclosure current fds' position and flags, reading
    directory contents of fdinfo/ and fd/ may disclosure the number of opened
    files by the target task. This information is not sensible per se, but it
    can reveal some private information (like length of a password stored in a
    file) under certain conditions.

    Used existing (un)lock_trace functions to check for ptrace_may_access(),
    but instead of using EPERM return code from it use EACCES to be consistent
    with existing proc_pid_follow_link()/proc_pid_readlink() return code. If
    they differ, attacker can guess what fds exist by analyzing stat() return
    code. Patched handlers: stat() for fd/*, stat() and read() for fdindo/*,
    readdir() and lookup() for fd/ and fdinfo/.

    Signed-off-by: Vasiliy Kulikov
    Cc: Cyrill Gorcunov
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     
  • On reading sysctl dirs we should return -EISDIR instead of -EINVAL.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Cyrill Gorcunov
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     

02 Nov, 2011

2 commits


01 Nov, 2011

4 commits

  • Some kernel components pin user space memory (infiniband and perf) (by
    increasing the page count) and account that memory as "mlocked".

    The difference between mlocking and pinning is:

    A. mlocked pages are marked with PG_mlocked and are exempt from
    swapping. Page migration may move them around though.
    They are kept on a special LRU list.

    B. Pinned pages cannot be moved because something needs to
    directly access physical memory. They may not be on any
    LRU list.

    I recently saw an mlockalled process where mm->locked_vm became
    bigger than the virtual size of the process (!) because some
    memory was accounted for twice:

    Once when the page was mlocked and once when the Infiniband
    layer increased the refcount because it needt to pin the RDMA
    memory.

    This patch introduces a separate counter for pinned pages and
    accounts them seperately.

    Signed-off-by: Christoph Lameter
    Cc: Mike Marciniszyn
    Cc: Roland Dreier
    Cc: Sean Hefty
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This removes mm->oom_disable_count entirely since it's unnecessary and
    currently buggy. The counter was intended to be per-process but it's
    currently decremented in the exit path for each thread that exits, causing
    it to underflow.

    The count was originally intended to prevent oom killing threads that
    share memory with threads that cannot be killed since it doesn't lead to
    future memory freeing. The counter could be fixed to represent all
    threads sharing the same mm, but it's better to remove the count since:

    - it is possible that the OOM_DISABLE thread sharing memory with the
    victim is waiting on that thread to exit and will actually cause
    future memory freeing, and

    - there is no guarantee that a thread is disabled from oom killing just
    because another thread sharing its mm is oom disabled.

    Signed-off-by: David Rientjes
    Reported-by: Oleg Nesterov
    Reviewed-by: Oleg Nesterov
    Cc: Ying Han
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The display of the "huge" tag was accidentally removed in 29ea2f698 ("mm:
    use walk_page_range() instead of custom page table walking code").

    Reported-by: Stephen Hemminger
    Tested-by: Stephen Hemminger
    Reviewed-by: Stephen Wilson
    Cc: KOSAKI Motohiro
    Cc: Hugh Dickins
    Acked-by: David Rientjes
    Cc: Lee Schermerhorn
    Cc: Alexey Dobriyan
    Cc: Christoph Lameter
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • These files were getting via an implicit include
    path, but we want to crush those out of existence since they cost
    time during compiles of processing thousands of lines of headers
    for no reason. Give them the lightweight header that just contains
    the EXPORT_SYMBOL infrastructure.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

26 Oct, 2011

1 commit

  • * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
    time, s390: Get rid of compile warning
    dw_apb_timer: constify clocksource name
    time: Cleanup old CONFIG_GENERIC_TIME references that snuck in
    time: Change jiffies_to_clock_t() argument type to unsigned long
    alarmtimers: Fix error handling
    clocksource: Make watchdog reset lockless
    posix-cpu-timers: Cure SMP accounting oddities
    s390: Use direct ktime path for s390 clockevent device
    clockevents: Add direct ktime programming function
    clockevents: Make minimum delay adjustments configurable
    nohz: Remove "Switched to NOHz mode" debugging messages
    proc: Consider NO_HZ when printing idle and iowait times
    nohz: Make idle/iowait counter update conditional
    nohz: Fix update_ts_time_stat idle accounting
    cputime: Clean up cputime_to_usecs and usecs_to_cputime macros
    alarmtimers: Rework RTC device selection using class interface
    alarmtimers: Add try_to_cancel functionality
    alarmtimers: Add more refined alarm state tracking
    alarmtimers: Remove period from alarm structure
    alarmtimers: Remove interval cap limit hack
    ...

    Linus Torvalds
     

22 Sep, 2011

3 commits

  • This is modeled after the smaps code.

    It detects transparent hugepages and then does a single gather_stats()
    for the page as a whole. This has two benifits:
    1. It is more efficient since it does many pages in a single shot.
    2. It does not have to break down the huge page.

    Signed-off-by: Dave Hansen
    Acked-by: Hugh Dickins
    Acked-by: David Rientjes
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • gather_pte_stats() does a number of checks on a target page
    to see whether it should even be considered for statistics.
    This breaks that code out in to a separate function so that
    we can use it in the transparent hugepage case in the next
    patch.

    Signed-off-by: Dave Hansen
    Acked-by: Hugh Dickins
    Reviewed-by: Christoph Lameter
    Acked-by: David Rientjes
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • We need to teach the numa_maps code about transparent huge pages. The
    first step is to teach gather_stats() that the pte it is dealing with
    might represent more than one page.

    Note that will we use this in a moment for transparent huge pages since
    they have use a single pmd_t which _acts_ as a "surrogate" for a bunch
    of smaller pte_t's.

    I'm a _bit_ unhappy that this interface counts in hugetlbfs page sizes
    for hugetlbfs pages and PAGE_SIZE for normal pages. That means that to
    figure out how many _bytes_ "dirty=1" means, you must first know the
    hugetlbfs page size. That's easier said than done especially if you
    don't have visibility in to the mount.

    But, that's probably a discussion for another day especially since it
    would change behavior to fix it. But, just in case anyone wonders why
    this patch only passes a '1' in the hugetlb case...

    Signed-off-by: Dave Hansen
    Acked-by: Hugh Dickins
    Acked-by: David Rientjes
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

08 Sep, 2011

1 commit

  • show_stat handler of the /proc/stat file relies on kstat_cpu(cpu)
    statistics when priting information about idle and iowait times.
    This is OK if we are not using tickless kernel (CONFIG_NO_HZ) because
    counters are updated periodically.
    With NO_HZ things got more tricky because we are not doing idle/iowait
    accounting while we are tickless so the value might get outdated.
    Users of /proc/stat will notice that by unchanged idle/iowait values
    which is then interpreted as 0% idle/iowait time. From the user space
    POV this is an unexpected behavior and a change of the interface.

    Let's fix this by using get_cpu_{idle,iowait}_time_us which accounts the
    total idle/iowait time since boot and it doesn't rely on sampling or any
    other periodic activity. Fall back to the previous behavior if NO_HZ is
    disabled or not configured.

    Signed-off-by: Michal Hocko
    Cc: Dave Jones
    Cc: Arnd Bergmann
    Cc: Alexey Dobriyan
    Link: http://lkml.kernel.org/r/39181366adac1b39cb6aa3cd53ff0f7c78d32676.1314172057.git.mhocko@suse.cz
    Signed-off-by: Thomas Gleixner

    Michal Hocko
     

07 Aug, 2011

1 commit

  • The CLOEXE bit is magical, and for performance (and semantic) reasons we
    don't actually maintain it in the file descriptor itself, but in a
    separate bit array. Which means that when we show f_flags, the CLOEXE
    status is shown incorrectly: we show the status not as it is now, but as
    it was when the file was opened.

    Fix that by looking up the bit properly in the 'fdt->close_on_exec' bit
    array.

    Uli needs this in order to re-implement the pfiles program:

    "For normal file descriptors (not sockets) this was the last piece of
    information which wasn't available. This is all part of my 'give
    Solaris users no reason to not switch' effort. I intend to offer the
    code to the util-linux-ng maintainers."

    Requested-by: Ulrich Drepper
    Signed-off-by: Linus Torvalds

    Linus Torvalds