22 Sep, 2011

3 commits

  • This is modeled after the smaps code.

    It detects transparent hugepages and then does a single gather_stats()
    for the page as a whole. This has two benifits:
    1. It is more efficient since it does many pages in a single shot.
    2. It does not have to break down the huge page.

    Signed-off-by: Dave Hansen
    Acked-by: Hugh Dickins
    Acked-by: David Rientjes
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • gather_pte_stats() does a number of checks on a target page
    to see whether it should even be considered for statistics.
    This breaks that code out in to a separate function so that
    we can use it in the transparent hugepage case in the next
    patch.

    Signed-off-by: Dave Hansen
    Acked-by: Hugh Dickins
    Reviewed-by: Christoph Lameter
    Acked-by: David Rientjes
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • We need to teach the numa_maps code about transparent huge pages. The
    first step is to teach gather_stats() that the pte it is dealing with
    might represent more than one page.

    Note that will we use this in a moment for transparent huge pages since
    they have use a single pmd_t which _acts_ as a "surrogate" for a bunch
    of smaller pte_t's.

    I'm a _bit_ unhappy that this interface counts in hugetlbfs page sizes
    for hugetlbfs pages and PAGE_SIZE for normal pages. That means that to
    figure out how many _bytes_ "dirty=1" means, you must first know the
    hugetlbfs page size. That's easier said than done especially if you
    don't have visibility in to the mount.

    But, that's probably a discussion for another day especially since it
    would change behavior to fix it. But, just in case anyone wonders why
    this patch only passes a '1' in the hugetlb case...

    Signed-off-by: Dave Hansen
    Acked-by: Hugh Dickins
    Acked-by: David Rientjes
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

07 Aug, 2011

2 commits

  • The CLOEXE bit is magical, and for performance (and semantic) reasons we
    don't actually maintain it in the file descriptor itself, but in a
    separate bit array. Which means that when we show f_flags, the CLOEXE
    status is shown incorrectly: we show the status not as it is now, but as
    it was when the file was opened.

    Fix that by looking up the bit properly in the 'fdt->close_on_exec' bit
    array.

    Uli needs this in order to re-implement the pfiles program:

    "For normal file descriptors (not sockets) this was the last piece of
    information which wasn't available. This is all part of my 'give
    Solaris users no reason to not switch' effort. I intend to offer the
    code to the util-linux-ng maintainers."

    Requested-by: Ulrich Drepper
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • WARN_ONCE() is very annoying, in that it shows the stack trace that we
    don't care about at all, and also triggers various user-level "kernel
    oopsed" logic that we really don't care about. And it's not like the
    user can do anything about the applications (sshd) in question, it's a
    distro issue.

    Requested-by: Andi Kleen (and many others)
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

28 Jul, 2011

1 commit

  • Since __proc_create() appends the name it is given to the end of the PDE
    structure that it allocates, there isn't a need to store a name pointer.
    Instead we can just replace the name pointer with a terminal char array of
    _unspecified_ length. The compiler will simply append the string to statically
    defined variables of PDE type overlapping any hole at the end of the structure
    and, unlike specifying an explicitly _zero_ length array, won't give a warning
    if you try to statically initialise it with a string of more than zero length.

    Also, whilst we're at it:

    (1) Move namelen to end just prior to name and reduce it to a single byte
    (name shouldn't be longer than NAME_MAX).

    (2) Move pde_unload_lock two places further on so that if it's four bytes in
    size on a 64-bit machine, it won't cause an unused hole in the PDE struct.

    Signed-off-by: David Howells
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Linus Torvalds

    David Howells
     

27 Jul, 2011

3 commits

  • This allows us to move duplicated code in
    (atomic_inc_not_zero() for now) to

    Signed-off-by: Arun Sharma
    Reviewed-by: Eric Dumazet
    Cc: Ingo Molnar
    Cc: David Miller
    Cc: Eric Dumazet
    Acked-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun Sharma
     
  • If an inode's mode permits opening /proc/PID/io and the resulting file
    descriptor is kept across execve() of a setuid or similar binary, the
    ptrace_may_access() check tries to prevent using this fd against the
    task with escalated privileges.

    Unfortunately, there is a race in the check against execve(). If
    execve() is processed after the ptrace check, but before the actual io
    information gathering, io statistics will be gathered from the
    privileged process. At least in theory this might lead to gathering
    sensible information (like ssh/ftp password length) that wouldn't be
    available otherwise.

    Holding task->signal->cred_guard_mutex while gathering the io
    information should protect against the race.

    The order of locking is similar to the one inside of ptrace_attach():
    first goes cred_guard_mutex, then lock_task_sighand().

    Signed-off-by: Vasiliy Kulikov
    Cc: Al Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     
  • Change the return value to ENOENT. This return value is then returned
    when opening the proc entry that have been removed. For example,
    open("/proc/bus/pci/XX/YY") when the corresponding device is being
    hot-removed.

    Signed-off-by: Daisuke Ogino
    Cc: Jesse Barnes
    Acked-by: Alexey Dobriyan
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Ogino
     

26 Jul, 2011

1 commit

  • /proc/pid/oom_adj is deprecated and scheduled for removal in August 2012
    according to Documentation/feature-removal-schedule.txt.

    This patch makes the warning more verbose by making it appear as a more
    serious problem (the presence of a stack trace and being multiline should
    attract more attention) so that applications still using the old interface
    can get fixed.

    Very popular users of the old interface have been converted since the oom
    killer rewrite has been introduced. udevd switched to the
    /proc/pid/oom_score_adj interface for v162, kde switched in 4.6.1, and
    opensshd switched in 5.7p1.

    At the start of 2012, this should be changed into a WARN() to emit all
    such incidents and then finally remove the tunable in August 2012 as
    scheduled.

    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

23 Jul, 2011

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (107 commits)
    vfs: use ERR_CAST for err-ptr tossing in lookup_instantiate_filp
    isofs: Remove global fs lock
    jffs2: fix IN_DELETE_SELF on overwriting rename() killing a directory
    fix IN_DELETE_SELF on overwriting rename() on ramfs et.al.
    mm/truncate.c: fix build for CONFIG_BLOCK not enabled
    fs:update the NOTE of the file_operations structure
    Remove dead code in dget_parent()
    AFS: Fix silly characters in a comment
    switch d_add_ci() to d_splice_alias() in "found negative" case as well
    simplify gfs2_lookup()
    jfs_lookup(): don't bother with . or ..
    get rid of useless dget_parent() in btrfs rename() and link()
    get rid of useless dget_parent() in fs/btrfs/ioctl.c
    fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers
    drivers: fix up various ->llseek() implementations
    fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek
    Ext4: handle SEEK_HOLE/SEEK_DATA generically
    Btrfs: implement our own ->llseek
    fs: add SEEK_HOLE and SEEK_DATA flags
    reiserfs: make reiserfs default to barrier=flush
    ...

    Fix up trivial conflicts in fs/xfs/linux-2.6/xfs_super.c due to the new
    shrinker callout for the inode cache, that clashed with the xfs code to
    start the periodic workers later.

    Linus Torvalds
     
  • * 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc: (39 commits)
    ptrace: do_wait(traced_leader_killed_by_mt_exec) can block forever
    ptrace: fix ptrace_signal() && STOP_DEQUEUED interaction
    connector: add an event for monitoring process tracers
    ptrace: dont send SIGSTOP on auto-attach if PT_SEIZED
    ptrace: mv send-SIGSTOP from do_fork() to ptrace_init_task()
    ptrace_init_task: initialize child->jobctl explicitly
    has_stopped_jobs: s/task_is_stopped/SIGNAL_STOP_STOPPED/
    ptrace: make former thread ID available via PTRACE_GETEVENTMSG after PTRACE_EVENT_EXEC stop
    ptrace: wait_consider_task: s/same_thread_group/ptrace_reparented/
    ptrace: kill real_parent_is_ptracer() in in favor of ptrace_reparented()
    ptrace: ptrace_reparented() should check same_thread_group()
    redefine thread_group_leader() as exit_signal >= 0
    do not change dead_task->exit_signal
    kill task_detached()
    reparent_leader: check EXIT_DEAD instead of task_detached()
    make do_notify_parent() __must_check, update the callers
    __ptrace_detach: avoid task_detached(), check do_notify_parent()
    kill tracehook_notify_death()
    make do_notify_parent() return bool
    ptrace: s/tracehook_tracer_task()/ptrace_parent()/
    ...

    Linus Torvalds
     

21 Jul, 2011

1 commit

  • Moving the event counter into the dynamically allocated 'struc seq_file'
    allows poll() support without the need to allocate its own tracking
    structure.

    All current users are switched over to use the new counter.

    Requested-by: Andrew Morton akpm@linux-foundation.org
    Acked-by: NeilBrown
    Tested-by: Lucas De Marchi lucas.demarchi@profusion.mobi
    Signed-off-by: Kay Sievers
    Signed-off-by: Al Viro

    Kay Sievers
     

20 Jul, 2011

4 commits


29 Jun, 2011

1 commit

  • /proc/PID/io may be used for gathering private information. E.g. for
    openssh and vsftpd daemons wchars/rchars may be used to learn the
    precise password length. Restrict it to processes being able to ptrace
    the target process.

    ptrace_may_access() is needed to prevent keeping open file descriptor of
    "io" file, executing setuid binary and gathering io information of the
    setuid'ed process.

    Signed-off-by: Vasiliy Kulikov
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     

23 Jun, 2011

1 commit


20 Jun, 2011

2 commits


17 Jun, 2011

1 commit


16 Jun, 2011

1 commit


13 Jun, 2011

1 commit


30 May, 2011

1 commit


27 May, 2011

9 commits

  • This change introduces a few of the less controversial /proc and
    /proc/sys interfaces for tile, along with sysfs attributes for
    various things that were originally proposed as /proc/tile files.
    It also adjusts the "hardwall" proc API.

    Arnd Bergmann reviewed the initial arch/tile submission, which
    included a complete set of all the /proc/tile and /proc/sys/tile
    knobs that we had added in a somewhat ad hoc way during initial
    development, and provided feedback on where most of them should go.

    One knob turned out to be similar enough to the existing
    /proc/sys/debug/exception-trace that it was re-implemented to use
    that model instead.

    Another knob was /proc/tile/grid, which reported the "grid" dimensions
    of a tile chip (e.g. 8x8 processors = 64-core chip). Arnd suggested
    looking at sysfs for that, so this change moves that information
    to a pair of sysfs attributes (chip_width and chip_height) in the
    /sys/devices/system/cpu directory. We also put the "chip_serial"
    and "chip_revision" information from our old /proc/tile/board file
    as attributes in /sys/devices/system/cpu.

    Other information collected via hypervisor APIs is now placed in
    /sys/hypervisor. We create a /sys/hypervisor/type file (holding the
    constant string "tilera") to be parallel with the Xen use of
    /sys/hypervisor/type holding "xen". We create three top-level files,
    "version" (the hypervisor's own version), "config_version" (the
    version of the configuration file), and "hvconfig" (the contents of
    the configuration file). The remaining information from our old
    /proc/tile/board and /proc/tile/switch files becomes an attribute
    group appearing under /sys/hypervisor/board/.

    Finally, after some feedback from Arnd Bergmann for the previous
    version of this patch, the /proc/tile/hardwall file is split up into
    two conceptual parts. First, a directory /proc/tile/hardwall/ which
    contains one file per active hardwall, each file named after the
    hardwall's ID and holding a cpulist that says which cpus are enclosed by
    the hardwall. Second, a /proc/PID file "hardwall" that is either
    empty (for non-hardwall-using processes) or contains the hardwall ID.

    Finally, this change pushes the /proc/sys/tile/unaligned_fixup/
    directory, with knobs controlling the kernel code for handling the
    fixup of unaligned exceptions.

    Reviewed-by: Arnd Bergmann
    Signed-off-by: Chris Metcalf

    Chris Metcalf
     
  • The balloon driver in a Xen guest frees guest pages and marks them as
    mmio. When the kernel crashes and the crash kernel attempts to read the
    oldmem via /proc/vmcore a read from ballooned pages will generate 100%
    load in dom0 because Xen asks qemu-dm for the page content. Since the
    reads come in as 8byte requests each ballooned page is tried 512 times.

    With this change a hook can be registered which checks wether the given
    pfn is really ram. The hook has to return a value > 0 for ram pages, a
    value < 0 on error (because the hypercall is not known) and 0 for non-ram
    pages.

    This will reduce the time to read /proc/vmcore. Without this change a
    512M guest with 128M crashkernel region needs 200 seconds to read it, with
    this change it takes just 2 seconds.

    Signed-off-by: Olaf Hering
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Olaf Hering
     
  • Currently, pagemap_read() has three error and/or corner case handling
    mistake.

    (1) If ppos parameter is wrong, mm refcount will be leak.
    (2) If count parameter is 0, mm refcount will be leak too.
    (3) If the current task is sleeping in kmalloc() and the system
    is out of memory and oom-killer kill the proc associated task,
    mm_refcount prevent the task free its memory. then system may
    hang up.

    Cc: Hugh Dickins
    Cc: Jovi Zhang
    Acked-by: Hugh Dickins
    Cc: Stephen Wilson
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • It whould be better if put check_mem_permission after __get_free_page in
    mem_write, to be same as function mem_read.

    Hugh Dickins explained the reason.

    check_mem_permission gets a reference to the mm. If we __get_free_page
    after check_mem_permission, imagine what happens if the system is out
    of memory, and the mm we're looking at is selected for killing by the
    OOM killer: while we wait in __get_free_page for more memory, no memory
    is freed from the selected mm because it cannot reach exit_mmap while
    we hold that reference.

    Reported-by: Jovi Zhang
    Signed-off-by: KOSAKI Motohiro
    Acked-by: Hugh Dickins
    Reviewed-by: Stephen Wilson
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • There is a macro for the max size kmalloc can allocate, so use it instead
    of a hardcoded number.

    Signed-off-by: Yuanhan Liu
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yuanhan Liu
     
  • No need for this local array to be writable, so mark it const.

    Signed-off-by: Mike Frysinger
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Frysinger
     
  • Convert fs/proc/ from strict_strto*() to kstrto*() functions.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Setup and cleanup of mm_struct->exe_file is currently done in fs/proc/.
    This was because exe_file was needed only for /proc//exe. Since we
    will need the exe_file functionality also for core dumps (so core name can
    contain full binary path), built this functionality always into the
    kernel.

    To achieve that move that out of proc FS to the kernel/ where in fact it
    should belong. By doing that we can make dup_mm_exe_file static. Also we
    can drop linux/proc_fs.h inclusion in fs/exec.c and kernel/fork.c.

    Signed-off-by: Jiri Slaby
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     
  • The type of vma->vm_flags is 'unsigned long'. Neither 'int' nor
    'unsigned int'. This patch fixes such misuse.

    Signed-off-by: KOSAKI Motohiro
    [ Changed to use a typedef - we'll extend it to cover more cases
    later, since there has been discussion about making it a 64-bit
    type.. - Linus ]
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

26 May, 2011

2 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-2.6-nsfd:
    net: fix get_net_ns_by_fd for !CONFIG_NET_NS
    ns proc: Return -ENOENT for a nonexistent /proc/self/ns/ entry.
    ns: Declare sys_setns in syscalls.h
    net: Allow setting the network namespace by fd
    ns proc: Add support for the ipc namespace
    ns proc: Add support for the uts namespace
    ns proc: Add support for the network namespace.
    ns: Introduce the setns syscall
    ns: proc files for namespace naming policy.

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (89 commits)
    bonding: documentation and code cleanup for resend_igmp
    bonding: prevent deadlock on slave store with alb mode (v3)
    net: hold rtnl again in dump callbacks
    Add Fujitsu 1000base-SX PCI ID to tg3
    bnx2x: protect sequence increment with mutex
    sch_sfq: fix peek() implementation
    isdn: netjet - blacklist Digium TDM400P
    via-velocity: don't annotate MAC registers as packed
    xen: netfront: hold RTNL when updating features.
    sctp: fix memory leak of the ASCONF queue when free asoc
    net: make dev_disable_lro use physical device if passed a vlan dev (v2)
    net: move is_vlan_dev into public header file (v2)
    bug.h: Fix build with CONFIG_PRINTK disabled.
    wireless: fix fatal kernel-doc error + warning in mac80211.h
    wireless: fix cfg80211.h new kernel-doc warnings
    iwlagn: dbg_fixed_rate only used when CONFIG_MAC80211_DEBUGFS enabled
    dst: catch uninitialized metrics
    be2net: hash key for rss-config cmd not set
    bridge: initialize fake_rtable metrics
    net: fix __dst_destroy_metrics_generic()
    ...

    Fix up trivial conflicts in drivers/staging/brcm80211/brcmfmac/wl_cfg80211.c

    Linus Torvalds
     

25 May, 2011

4 commits

  • In show_numa_map() we collect statistics into a numa_maps structure.
    Since the number of NUMA nodes can be very large, this structure is not a
    candidate for stack allocation.

    Instead of going thru a kmalloc()+kfree() cycle each time show_numa_map()
    is invoked, perform the allocation just once when /proc/pid/numa_maps is
    opened.

    Performing the allocation when numa_maps is opened, and thus before a
    reference to the target tasks mm is taken, eliminates a potential
    stalemate condition in the oom-killer as originally described by Hugh
    Dickins:

    ... imagine what happens if the system is out of memory, and the mm
    we're looking at is selected for killing by the OOM killer: while
    we wait in __get_free_page for more memory, no memory is freed
    from the selected mm because it cannot reach exit_mmap while we hold
    that reference.

    Signed-off-by: Stephen Wilson
    Reviewed-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Lee Schermerhorn
    Cc: Alexey Dobriyan
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Wilson
     
  • Now that mm/mempolicy.c is no longer implementing /proc/pid/numa_maps
    there is no need to export struct proc_maps_private to the world. Move it
    to fs/proc/internal.h instead.

    Signed-off-by: Stephen Wilson
    Reviewed-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Lee Schermerhorn
    Cc: Alexey Dobriyan
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Wilson
     
  • Moving show_numa_map() from mempolicy.c to task_mmu.c solves several
    issues.

    - Having the show() operation "miles away" from the corresponding
    seq_file iteration operations is a maintenance burden.

    - The need to export ad hoc info like struct proc_maps_private is
    eliminated.

    - The implementation of show_numa_map() can be improved in a simple
    manner by cooperating with the other seq_file operations (start,
    stop, etc) -- something that would be messy to do without this
    change.

    Signed-off-by: Stephen Wilson
    Reviewed-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Lee Schermerhorn
    Cc: Alexey Dobriyan
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Wilson
     
  • Spotted-by: Nathan Lynch
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman