30 May, 2011

1 commit


27 May, 2011

9 commits

  • This change introduces a few of the less controversial /proc and
    /proc/sys interfaces for tile, along with sysfs attributes for
    various things that were originally proposed as /proc/tile files.
    It also adjusts the "hardwall" proc API.

    Arnd Bergmann reviewed the initial arch/tile submission, which
    included a complete set of all the /proc/tile and /proc/sys/tile
    knobs that we had added in a somewhat ad hoc way during initial
    development, and provided feedback on where most of them should go.

    One knob turned out to be similar enough to the existing
    /proc/sys/debug/exception-trace that it was re-implemented to use
    that model instead.

    Another knob was /proc/tile/grid, which reported the "grid" dimensions
    of a tile chip (e.g. 8x8 processors = 64-core chip). Arnd suggested
    looking at sysfs for that, so this change moves that information
    to a pair of sysfs attributes (chip_width and chip_height) in the
    /sys/devices/system/cpu directory. We also put the "chip_serial"
    and "chip_revision" information from our old /proc/tile/board file
    as attributes in /sys/devices/system/cpu.

    Other information collected via hypervisor APIs is now placed in
    /sys/hypervisor. We create a /sys/hypervisor/type file (holding the
    constant string "tilera") to be parallel with the Xen use of
    /sys/hypervisor/type holding "xen". We create three top-level files,
    "version" (the hypervisor's own version), "config_version" (the
    version of the configuration file), and "hvconfig" (the contents of
    the configuration file). The remaining information from our old
    /proc/tile/board and /proc/tile/switch files becomes an attribute
    group appearing under /sys/hypervisor/board/.

    Finally, after some feedback from Arnd Bergmann for the previous
    version of this patch, the /proc/tile/hardwall file is split up into
    two conceptual parts. First, a directory /proc/tile/hardwall/ which
    contains one file per active hardwall, each file named after the
    hardwall's ID and holding a cpulist that says which cpus are enclosed by
    the hardwall. Second, a /proc/PID file "hardwall" that is either
    empty (for non-hardwall-using processes) or contains the hardwall ID.

    Finally, this change pushes the /proc/sys/tile/unaligned_fixup/
    directory, with knobs controlling the kernel code for handling the
    fixup of unaligned exceptions.

    Reviewed-by: Arnd Bergmann
    Signed-off-by: Chris Metcalf

    Chris Metcalf
     
  • The balloon driver in a Xen guest frees guest pages and marks them as
    mmio. When the kernel crashes and the crash kernel attempts to read the
    oldmem via /proc/vmcore a read from ballooned pages will generate 100%
    load in dom0 because Xen asks qemu-dm for the page content. Since the
    reads come in as 8byte requests each ballooned page is tried 512 times.

    With this change a hook can be registered which checks wether the given
    pfn is really ram. The hook has to return a value > 0 for ram pages, a
    value < 0 on error (because the hypercall is not known) and 0 for non-ram
    pages.

    This will reduce the time to read /proc/vmcore. Without this change a
    512M guest with 128M crashkernel region needs 200 seconds to read it, with
    this change it takes just 2 seconds.

    Signed-off-by: Olaf Hering
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Olaf Hering
     
  • Currently, pagemap_read() has three error and/or corner case handling
    mistake.

    (1) If ppos parameter is wrong, mm refcount will be leak.
    (2) If count parameter is 0, mm refcount will be leak too.
    (3) If the current task is sleeping in kmalloc() and the system
    is out of memory and oom-killer kill the proc associated task,
    mm_refcount prevent the task free its memory. then system may
    hang up.

    Cc: Hugh Dickins
    Cc: Jovi Zhang
    Acked-by: Hugh Dickins
    Cc: Stephen Wilson
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • It whould be better if put check_mem_permission after __get_free_page in
    mem_write, to be same as function mem_read.

    Hugh Dickins explained the reason.

    check_mem_permission gets a reference to the mm. If we __get_free_page
    after check_mem_permission, imagine what happens if the system is out
    of memory, and the mm we're looking at is selected for killing by the
    OOM killer: while we wait in __get_free_page for more memory, no memory
    is freed from the selected mm because it cannot reach exit_mmap while
    we hold that reference.

    Reported-by: Jovi Zhang
    Signed-off-by: KOSAKI Motohiro
    Acked-by: Hugh Dickins
    Reviewed-by: Stephen Wilson
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • There is a macro for the max size kmalloc can allocate, so use it instead
    of a hardcoded number.

    Signed-off-by: Yuanhan Liu
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yuanhan Liu
     
  • No need for this local array to be writable, so mark it const.

    Signed-off-by: Mike Frysinger
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Frysinger
     
  • Convert fs/proc/ from strict_strto*() to kstrto*() functions.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Setup and cleanup of mm_struct->exe_file is currently done in fs/proc/.
    This was because exe_file was needed only for /proc//exe. Since we
    will need the exe_file functionality also for core dumps (so core name can
    contain full binary path), built this functionality always into the
    kernel.

    To achieve that move that out of proc FS to the kernel/ where in fact it
    should belong. By doing that we can make dup_mm_exe_file static. Also we
    can drop linux/proc_fs.h inclusion in fs/exec.c and kernel/fork.c.

    Signed-off-by: Jiri Slaby
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     
  • The type of vma->vm_flags is 'unsigned long'. Neither 'int' nor
    'unsigned int'. This patch fixes such misuse.

    Signed-off-by: KOSAKI Motohiro
    [ Changed to use a typedef - we'll extend it to cover more cases
    later, since there has been discussion about making it a 64-bit
    type.. - Linus ]
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

26 May, 2011

2 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-2.6-nsfd:
    net: fix get_net_ns_by_fd for !CONFIG_NET_NS
    ns proc: Return -ENOENT for a nonexistent /proc/self/ns/ entry.
    ns: Declare sys_setns in syscalls.h
    net: Allow setting the network namespace by fd
    ns proc: Add support for the ipc namespace
    ns proc: Add support for the uts namespace
    ns proc: Add support for the network namespace.
    ns: Introduce the setns syscall
    ns: proc files for namespace naming policy.

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (89 commits)
    bonding: documentation and code cleanup for resend_igmp
    bonding: prevent deadlock on slave store with alb mode (v3)
    net: hold rtnl again in dump callbacks
    Add Fujitsu 1000base-SX PCI ID to tg3
    bnx2x: protect sequence increment with mutex
    sch_sfq: fix peek() implementation
    isdn: netjet - blacklist Digium TDM400P
    via-velocity: don't annotate MAC registers as packed
    xen: netfront: hold RTNL when updating features.
    sctp: fix memory leak of the ASCONF queue when free asoc
    net: make dev_disable_lro use physical device if passed a vlan dev (v2)
    net: move is_vlan_dev into public header file (v2)
    bug.h: Fix build with CONFIG_PRINTK disabled.
    wireless: fix fatal kernel-doc error + warning in mac80211.h
    wireless: fix cfg80211.h new kernel-doc warnings
    iwlagn: dbg_fixed_rate only used when CONFIG_MAC80211_DEBUGFS enabled
    dst: catch uninitialized metrics
    be2net: hash key for rss-config cmd not set
    bridge: initialize fake_rtable metrics
    net: fix __dst_destroy_metrics_generic()
    ...

    Fix up trivial conflicts in drivers/staging/brcm80211/brcmfmac/wl_cfg80211.c

    Linus Torvalds
     

25 May, 2011

5 commits

  • In show_numa_map() we collect statistics into a numa_maps structure.
    Since the number of NUMA nodes can be very large, this structure is not a
    candidate for stack allocation.

    Instead of going thru a kmalloc()+kfree() cycle each time show_numa_map()
    is invoked, perform the allocation just once when /proc/pid/numa_maps is
    opened.

    Performing the allocation when numa_maps is opened, and thus before a
    reference to the target tasks mm is taken, eliminates a potential
    stalemate condition in the oom-killer as originally described by Hugh
    Dickins:

    ... imagine what happens if the system is out of memory, and the mm
    we're looking at is selected for killing by the OOM killer: while
    we wait in __get_free_page for more memory, no memory is freed
    from the selected mm because it cannot reach exit_mmap while we hold
    that reference.

    Signed-off-by: Stephen Wilson
    Reviewed-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Lee Schermerhorn
    Cc: Alexey Dobriyan
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Wilson
     
  • Now that mm/mempolicy.c is no longer implementing /proc/pid/numa_maps
    there is no need to export struct proc_maps_private to the world. Move it
    to fs/proc/internal.h instead.

    Signed-off-by: Stephen Wilson
    Reviewed-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Lee Schermerhorn
    Cc: Alexey Dobriyan
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Wilson
     
  • Moving show_numa_map() from mempolicy.c to task_mmu.c solves several
    issues.

    - Having the show() operation "miles away" from the corresponding
    seq_file iteration operations is a maintenance burden.

    - The need to export ad hoc info like struct proc_maps_private is
    eliminated.

    - The implementation of show_numa_map() can be improved in a simple
    manner by cooperating with the other seq_file operations (start,
    stop, etc) -- something that would be messy to do without this
    change.

    Signed-off-by: Stephen Wilson
    Reviewed-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Lee Schermerhorn
    Cc: Alexey Dobriyan
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Wilson
     
  • Spotted-by: Nathan Lynch
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • John W. Linville
     

17 May, 2011

1 commit


11 May, 2011

4 commits

  • Acked-by: Daniel Lezcano
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • Acked-by: Daniel Lezcano
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • Implementing file descriptors for the network namespace
    is simple and straight forward.

    Acked-by: David S. Miller
    Acked-by: Daniel Lezcano
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • Create files under /proc//ns/ to allow controlling the
    namespaces of a process.

    This addresses three specific problems that can make namespaces hard to
    work with.
    - Namespaces require a dedicated process to pin them in memory.
    - It is not possible to use a namespace unless you are the child
    of the original creator.
    - Namespaces don't have names that userspace can use to talk about
    them.

    The namespace files under /proc//ns/ can be opened and the
    file descriptor can be used to talk about a specific namespace, and
    to keep the specified namespace alive.

    A namespace can be kept alive by either holding the file descriptor
    open or bind mounting the file someplace else. aka:
    mount --bind /proc/self/ns/net /some/filesystem/path
    mount --bind /proc/self/fd/ /some/filesystem/path

    This allows namespaces to be named with userspace policy.

    It requires additional support to make use of these filedescriptors
    and that will be comming in the following patches.

    Acked-by: Daniel Lezcano
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

10 May, 2011

1 commit

  • Linux kernel excludes guard page when performing mlock on a VMA with
    down-growing stack. However, some architectures have up-growing stack
    and locking the guard page should be excluded in this case too.

    This patch fixes lvm2 on PA-RISC (and possibly other architectures with
    up-growing stack). lvm2 calculates number of used pages when locking and
    when unlocking and reports an internal error if the numbers mismatch.

    [ Patch changed fairly extensively to also fix /proc//maps for the
    grows-up case, and to move things around a bit to clean it all up and
    share the infrstructure with the /proc bits.

    Tested on ia64 that has both grow-up and grow-down segments - Linus ]

    Signed-off-by: Mikulas Patocka
    Tested-by: Tony Luck
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     

19 Apr, 2011

1 commit


31 Mar, 2011

1 commit


28 Mar, 2011

1 commit

  • When m_start returns an error, the seq_file logic will still call m_stop
    with that error entry, so we'd better make sure that we check it before
    using it as a vma.

    Introduced by commit ec6fd8a4355c ("report errors in /proc/*/*map*
    sanely"), which replaced NULL with various ERR_PTR() cases.

    (On ia64, you happen to get a unaligned fault instead of a page fault,
    since the address used is generally some random error code like -EPERM)

    Reported-by: Anca Emanuel
    Reported-by: Tony Luck
    Cc: Al Viro
    Cc: Américo Wang
    Cc: Stephen Wilson
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

24 Mar, 2011

14 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    deal with races in /proc/*/{syscall,stack,personality}
    proc: enable writing to /proc/pid/mem
    proc: make check_mem_permission() return an mm_struct on success
    proc: hold cred_guard_mutex in check_mem_permission()
    proc: disable mem_write after exec
    mm: implement access_remote_vm
    mm: factor out main logic of access_process_vm
    mm: use mm_struct to resolve gate vma's in __get_user_pages
    mm: arch: rename in_gate_area_no_task to in_gate_area_no_mm
    mm: arch: make in_gate_area take an mm_struct instead of a task_struct
    mm: arch: make get_gate_vma take an mm_struct instead of a task_struct
    x86: mark associated mm when running a task in 32 bit compatibility mode
    x86: add context tag to mark mm when running a task in 32-bit compatibility mode
    auxv: require the target to be tracable (or yourself)
    close race in /proc/*/environ
    report errors in /proc/*/*map* sanely
    pagemap: close races with suid execve
    make sessionid permissions in /proc/*/task/* match those in /proc/*
    fix leaks in path_lookupat()

    Fix up trivial conflicts in fs/proc/base.c

    Linus Torvalds
     
  • After the previous cleanup in proc_get_sb() the global proc_mnt has no
    reasons to exists, kill it.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Eric W. Biederman
    Signed-off-by: Daniel Lezcano
    Cc: Alexey Dobriyan
    Acked-by: Serge E. Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Reorganize proc_get_sb() so it can be called before the struct pid of the
    first process is allocated.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Daniel Lezcano
    Cc: Oleg Nesterov
    Cc: Alexey Dobriyan
    Acked-by: Serge E. Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • While mm->start_stack was protected from cross-uid viewing (commit
    f83ce3e6b02d5 ("proc: avoid information leaks to non-privileged
    processes")), the start_code and end_code values were not. This would
    allow the text location of a PIE binary to leak, defeating ASLR.

    Note that the value "1" is used instead of "0" for a protected value since
    "ps", "killall", and likely other readers of /proc/pid/stat, take
    start_code of "0" to mean a kernel thread and will misbehave. Thanks to
    Brad Spengler for pointing this out.

    Addresses CVE-2011-0726

    Signed-off-by: Kees Cook
    Cc:
    Cc: Alexey Dobriyan
    Cc: David Howells
    Cc: Eugene Teo
    Cc: Martin Schwidefsky
    Cc: Brad Spengler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • 1. namelen is declared "unsigned short" which hints for "maybe space savings".
    Indeed in 2.4 struct proc_dir_entry looked like:

    struct proc_dir_entry {
    unsigned short low_ino;
    unsigned short namelen;

    Now, low_ino is "unsigned int", all savings were gone for a long time.
    "struct proc_dir_entry" is not that countless to worry about it's size,
    anyway.

    2. converting from unsigned short to int/unsigned int can only create
    problems, we better play it safe.

    Space is not really conserved, because of natural alignment for the next
    field. sizeof(struct proc_dir_entry) remains the same.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • [root@wei 1]# cat /proc/1/mem
    cat: /proc/1/mem: No such process

    error code -ESRCH is wrong in this situation. Return -EPERM instead.

    Signed-off-by: Jovi Zhang
    Reviewed-by: KOSAKI Motohiro
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jovi Zhang
     
  • The current code fails to print the "[heap]" marking if the heap is split
    into multiple mappings.

    Fix the check so that the marking is displayed in all possible cases:
    1. vma matches exactly the heap
    2. the heap vma is merged e.g. with bss
    3. the heap vma is splitted e.g. due to locked pages

    Test cases. In all cases, the process should have mapping(s) with
    [heap] marking:

    (1) vma matches exactly the heap

    #include
    #include
    #include

    int main (void)
    {
    if (sbrk(4096) != (void *)-1) {
    printf("check /proc/%d/maps\n", (int)getpid());
    while (1)
    sleep(1);
    }
    return 0;
    }

    # ./test1
    check /proc/553/maps
    [1] + Stopped ./test1
    # cat /proc/553/maps | head -4
    00008000-00009000 r-xp 00000000 01:00 3113640 /test1
    00010000-00011000 rw-p 00000000 01:00 3113640 /test1
    00011000-00012000 rw-p 00000000 00:00 0 [heap]
    4006f000-40070000 rw-p 00000000 00:00 0

    (2) the heap vma is merged

    #include
    #include
    #include

    char foo[4096] = "foo";
    char bar[4096];

    int main (void)
    {
    if (sbrk(4096) != (void *)-1) {
    printf("check /proc/%d/maps\n", (int)getpid());
    while (1)
    sleep(1);
    }
    return 0;
    }

    # ./test2
    check /proc/556/maps
    [2] + Stopped ./test2
    # cat /proc/556/maps | head -4
    00008000-00009000 r-xp 00000000 01:00 3116312 /test2
    00010000-00012000 rw-p 00000000 01:00 3116312 /test2
    00012000-00014000 rw-p 00000000 00:00 0 [heap]
    4004a000-4004b000 rw-p 00000000 00:00 0

    (3) the heap vma is splitted (this fails without the patch)

    #include
    #include
    #include
    #include

    int main (void)
    {
    if ((sbrk(4096) != (void *)-1) && !mlockall(MCL_FUTURE) &&
    (sbrk(4096) != (void *)-1)) {
    printf("check /proc/%d/maps\n", (int)getpid());
    while (1)
    sleep(1);
    }
    return 0;
    }

    # ./test3
    check /proc/559/maps
    [1] + Stopped ./test3
    # cat /proc/559/maps|head -4
    00008000-00009000 r-xp 00000000 01:00 3119108 /test3
    00010000-00011000 rw-p 00000000 01:00 3119108 /test3
    00011000-00012000 rw-p 00000000 00:00 0 [heap]
    00012000-00013000 rw-p 00000000 00:00 0 [heap]

    It looks like the bug has been there forever, and since it only results in
    some information missing from a procfile, it does not fulfil the -stable
    "critical issue" criteria.

    Signed-off-by: Aaro Koskinen
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaro Koskinen
     
  • This file is readable for the task owner. Hide kernel addresses from
    unprivileged users, leave them function names and offsets.

    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Kees Cook
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • All of those are rw-r--r-- and all are broken for suid - if you open
    a file before the target does suid-root exec, you'll be still able
    to access it. For personality it's not a big deal, but for syscall
    and stack it's a real problem.

    Fix: check that task is tracable for you at the time of read().

    Signed-off-by: Al Viro

    Al Viro
     
  • With recent changes there is no longer a security hazard with writing to
    /proc/pid/mem. Remove the #ifdef.

    Signed-off-by: Stephen Wilson
    Signed-off-by: Al Viro

    Stephen Wilson
     
  • This change allows us to take advantage of access_remote_vm(), which in turn
    eliminates a security issue with the mem_write() implementation.

    The previous implementation of mem_write() was insecure since the target task
    could exec a setuid-root binary between the permission check and the actual
    write. Holding a reference to the target mm_struct eliminates this
    vulnerability.

    Signed-off-by: Stephen Wilson
    Signed-off-by: Al Viro

    Stephen Wilson
     
  • Avoid a potential race when task exec's and we get a new ->mm but check against
    the old credentials in ptrace_may_access().

    Holding of the mutex is implemented by factoring out the body of the code into a
    helper function __check_mem_permission(). Performing this factorization now
    simplifies upcoming changes and minimizes churn in the diff's.

    Signed-off-by: Stephen Wilson
    Signed-off-by: Al Viro

    Stephen Wilson
     
  • This change makes mem_write() observe the same constraints as mem_read(). This
    is particularly important for mem_write as an accidental leak of the fd across
    an exec could result in arbitrary modification of the target process' memory.
    IOW, /proc/pid/mem is implicitly close-on-exec.

    Signed-off-by: Stephen Wilson
    Signed-off-by: Al Viro

    Stephen Wilson
     
  • Morally, the presence of a gate vma is more an attribute of a particular mm than
    a particular task. Moreover, dropping the dependency on task_struct will help
    make both existing and future operations on mm's more flexible and convenient.

    Signed-off-by: Stephen Wilson
    Reviewed-by: Michel Lespinasse
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Signed-off-by: Al Viro

    Stephen Wilson