24 Mar, 2011

40 commits

  • When dmesg_restrict is set to 1 CAP_SYS_ADMIN is needed to read the kernel
    ring buffer. But a root user without CAP_SYS_ADMIN is able to reset
    dmesg_restrict to 0.

    This is an issue when e.g. LXC (Linux Containers) are used and complete
    user space is running without CAP_SYS_ADMIN. A unprivileged and jailed
    root user can bypass the dmesg_restrict protection.

    With this patch writing to dmesg_restrict is only allowed when root has
    CAP_SYS_ADMIN.

    Signed-off-by: Richard Weinberger
    Acked-by: Dan Rosenberg
    Acked-by: Serge E. Hallyn
    Cc: Eric Paris
    Cc: Kees Cook
    Cc: James Morris
    Cc: Eugene Teo
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Weinberger
     
  • Add boundaries of allowed input ranges for: dirty_expire_centisecs,
    drop_caches, overcommit_memory, page-cluster and panic_on_oom.

    Signed-off-by: Petr Holasek
    Acked-by: Dave Young
    Cc: David Rientjes
    Cc: Wu Fengguang
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Holasek
     
  • Drop dead code.

    Signed-off-by: Denis Kirjanov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Denis Kirjanov
     
  • Since the for loop checks for the table->procname drop useless
    table->procname checks inside the loop body

    Signed-off-by: Denis Kirjanov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Denis Kirjanov
     
  • If rio is not a switch then "rswitch" is null.

    Signed-off-by: Dan Carpenter
    Cc: Matt Porter
    Cc: Kumar Gala
    Signed-off-by: Alexandre Bounine
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Carpenter
     
  • Removes resource reservation from the common sybsystem initialization code
    and make it part of mport driver initialization. This resolves conflict
    with resource reservation by device specific mport drivers.

    Signed-off-by: Alexandre Bounine
    Cc: Kumar Gala
    Cc: Matt Porter
    Cc: Li Yang
    Cc: Thomas Moll
    Cc: Micha Nelissen
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Bounine
     
  • Changes mport ID and host destination ID assignment to implement unified
    method common to all mport drivers. Makes "riohdid=" kernel command line
    parameter common for all architectures with support for more that one host
    destination ID assignment.

    Signed-off-by: Alexandre Bounine
    Cc: Kumar Gala
    Cc: Matt Porter
    Cc: Li Yang
    Cc: Thomas Moll
    Cc: Micha Nelissen
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Bounine
     
  • Subsystem initialization sequence modified to support presence of multiple
    RapidIO controllers in the system. The new sequence is compatible with
    initialization of PCI devices.

    Signed-off-by: Alexandre Bounine
    Cc: Kumar Gala
    Cc: Matt Porter
    Cc: Li Yang
    Cc: Thomas Moll
    Cc: Micha Nelissen
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Bounine
     
  • 1. Add an option to include RapidIO support if the PCI is available.
    2. Add FSL_RIO configuration option to enable controller selection.
    3. Add RapidIO support option into x86 and MIPS architectures.

    Signed-off-by: Alexandre Bounine
    Acked-by: Kumar Gala
    Cc: Matt Porter
    Cc: Li Yang
    Cc: Thomas Moll
    Cc: Micha Nelissen
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Bounine
     
  • This set of patches eliminates RapidIO dependency on PowerPC architecture
    and makes it available to other architectures (x86 and MIPS). It also
    enables support of new platform independent RapidIO controllers such as
    PCI-to-SRIO and PCI Express-to-SRIO.

    This patch:

    Extend number of mport callback functions to eliminate direct linking of
    architecture specific mport operations.

    Signed-off-by: Alexandre Bounine
    Cc: Kumar Gala
    Cc: Matt Porter
    Cc: Li Yang
    Cc: Thomas Moll
    Cc: Micha Nelissen
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Bounine
     
  • Add RapidIO documentation files as it was discussed earlier (see thread
    http://marc.info/?l=linux-kernel&m=129202338918062&w=2)

    Signed-off-by: Alexandre Bounine
    Cc: Kumar Gala
    Cc: Matt Porter
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Bounine
     
  • Add new sysfs attributes.

    1. Routing information required to to reach the RIO device:
    destid - device destination ID (real for for endpoint, route for switch)
    hopcount - hopcount for maintenance requests (switches only)

    2. device linking information:
    lprev - name of device that precedes the given device in the enumeration
    or discovery order (displayed along with of the port to which it
    is attached).
    lnext - names of devices (with corresponding port numbers) that are
    attached to the given device as next in the enumeration or
    discovery order (switches only)

    Signed-off-by: Alexandre Bounine
    Cc: Kumar Gala
    Cc: Matt Porter
    Cc: Li Yang
    Cc: Thomas Moll
    Cc: Micha Nelissen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Bounine
     
  • Reduce the lines of code and simplify the logic.

    Signed-off-by: Changli Gao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Changli Gao
     
  • Convert calls to func_enter on leaving a function to func_exit.

    The semantic patch that fixes this problem is as follows:
    (http://coccinelle.lip6.fr/)

    //
    @@
    @@

    - func_enter();
    + func_exit();
    return...;
    //

    Signed-off-by: Julia Lawall
    Cc: Roger Wolff
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Julia Lawall
     
  • put_tty_driver calls tty_driver_kref_put on its argument, and then
    tty_driver_kref_put calls kref_put on the address of a field of this
    argument. kref_put checks for NULL, but in this case the field is likely
    to have some offset and so the result of taking its address will not be
    NULL. Labels are added to be able to skip over the call to put_tty_driver
    when the argument will be NULL.

    The semantic match that finds this problem is as follows:
    (http://coccinelle.lip6.fr/)

    //
    @@
    expression *x;
    @@

    *if (x == NULL)
    { ...
    * put_tty_driver(x);
    ...
    return ...;
    }
    //

    Signed-off-by: Julia Lawall
    Cc: Torben Hohn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Julia Lawall
     
  • Add smd_pkt driver which provides device interface to smd packet ports.

    Signed-off-by: Niranjana Vishwanathapura
    Cc: Brian Swetland
    Cc: Greg KH
    Cc: Alan Cox
    Cc: David Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Niranjana Vishwanathapura
     
  • commit d2478521afc2022 ("char/ipmi: fix OOPS caused by
    pnp_unregister_driver on unregistered driver") introduced a section
    mismatch by calling __exit cleanup_ipmi_si from __devinit init_ipmi_si.

    Remove __exit annotation from cleanup_ipmi_si.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Corey Minyard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • While mm->start_stack was protected from cross-uid viewing (commit
    f83ce3e6b02d5 ("proc: avoid information leaks to non-privileged
    processes")), the start_code and end_code values were not. This would
    allow the text location of a PIE binary to leak, defeating ASLR.

    Note that the value "1" is used instead of "0" for a protected value since
    "ps", "killall", and likely other readers of /proc/pid/stat, take
    start_code of "0" to mean a kernel thread and will misbehave. Thanks to
    Brad Spengler for pointing this out.

    Addresses CVE-2011-0726

    Signed-off-by: Kees Cook
    Cc:
    Cc: Alexey Dobriyan
    Cc: David Howells
    Cc: Eugene Teo
    Cc: Martin Schwidefsky
    Cc: Brad Spengler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • 1. namelen is declared "unsigned short" which hints for "maybe space savings".
    Indeed in 2.4 struct proc_dir_entry looked like:

    struct proc_dir_entry {
    unsigned short low_ino;
    unsigned short namelen;

    Now, low_ino is "unsigned int", all savings were gone for a long time.
    "struct proc_dir_entry" is not that countless to worry about it's size,
    anyway.

    2. converting from unsigned short to int/unsigned int can only create
    problems, we better play it safe.

    Space is not really conserved, because of natural alignment for the next
    field. sizeof(struct proc_dir_entry) remains the same.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • [root@wei 1]# cat /proc/1/mem
    cat: /proc/1/mem: No such process

    error code -ESRCH is wrong in this situation. Return -EPERM instead.

    Signed-off-by: Jovi Zhang
    Reviewed-by: KOSAKI Motohiro
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jovi Zhang
     
  • The current code fails to print the "[heap]" marking if the heap is split
    into multiple mappings.

    Fix the check so that the marking is displayed in all possible cases:
    1. vma matches exactly the heap
    2. the heap vma is merged e.g. with bss
    3. the heap vma is splitted e.g. due to locked pages

    Test cases. In all cases, the process should have mapping(s) with
    [heap] marking:

    (1) vma matches exactly the heap

    #include
    #include
    #include

    int main (void)
    {
    if (sbrk(4096) != (void *)-1) {
    printf("check /proc/%d/maps\n", (int)getpid());
    while (1)
    sleep(1);
    }
    return 0;
    }

    # ./test1
    check /proc/553/maps
    [1] + Stopped ./test1
    # cat /proc/553/maps | head -4
    00008000-00009000 r-xp 00000000 01:00 3113640 /test1
    00010000-00011000 rw-p 00000000 01:00 3113640 /test1
    00011000-00012000 rw-p 00000000 00:00 0 [heap]
    4006f000-40070000 rw-p 00000000 00:00 0

    (2) the heap vma is merged

    #include
    #include
    #include

    char foo[4096] = "foo";
    char bar[4096];

    int main (void)
    {
    if (sbrk(4096) != (void *)-1) {
    printf("check /proc/%d/maps\n", (int)getpid());
    while (1)
    sleep(1);
    }
    return 0;
    }

    # ./test2
    check /proc/556/maps
    [2] + Stopped ./test2
    # cat /proc/556/maps | head -4
    00008000-00009000 r-xp 00000000 01:00 3116312 /test2
    00010000-00012000 rw-p 00000000 01:00 3116312 /test2
    00012000-00014000 rw-p 00000000 00:00 0 [heap]
    4004a000-4004b000 rw-p 00000000 00:00 0

    (3) the heap vma is splitted (this fails without the patch)

    #include
    #include
    #include
    #include

    int main (void)
    {
    if ((sbrk(4096) != (void *)-1) && !mlockall(MCL_FUTURE) &&
    (sbrk(4096) != (void *)-1)) {
    printf("check /proc/%d/maps\n", (int)getpid());
    while (1)
    sleep(1);
    }
    return 0;
    }

    # ./test3
    check /proc/559/maps
    [1] + Stopped ./test3
    # cat /proc/559/maps|head -4
    00008000-00009000 r-xp 00000000 01:00 3119108 /test3
    00010000-00011000 rw-p 00000000 01:00 3119108 /test3
    00011000-00012000 rw-p 00000000 00:00 0 [heap]
    00012000-00013000 rw-p 00000000 00:00 0 [heap]

    It looks like the bug has been there forever, and since it only results in
    some information missing from a procfile, it does not fulfil the -stable
    "critical issue" criteria.

    Signed-off-by: Aaro Koskinen
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaro Koskinen
     
  • This file is readable for the task owner. Hide kernel addresses from
    unprivileged users, leave them function names and offsets.

    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Kees Cook
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Chaning cpuset->mems/cpuset->cpus should be protected under
    callback_mutex.

    cpuset_clone() doesn't follow this rule. It's ok because it's
    called when creating and initializing a cgroup, but we'd better
    hold the lock to avoid subtil break in the future.

    Signed-off-by: Li Zefan
    Acked-by: Paul Menage
    Acked-by: David Rientjes
    Cc: Miao Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Those functions that use NODEMASK_ALLOC() can't propagate errno
    to users, but will fail silently.

    Fix it by using a static nodemask_t variable for each function, and
    those variables are protected by cgroup_mutex;

    [akpm@linux-foundation.org: fix comment spelling, strengthen cgroup_lock comment]
    Signed-off-by: Li Zefan
    Cc: Paul Menage
    Acked-by: David Rientjes
    Cc: Miao Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • oldcs->mems_allowed is not modified during cpuset_attach(), so we don't
    have to copy it to a buffer allocated by NODEMASK_ALLOC(). Just pass it
    to cpuset_migrate_mm().

    Signed-off-by: Li Zefan
    Cc: Paul Menage
    Acked-by: David Rientjes
    Cc: Miao Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • It's not necessary to copy cpuset->mems_allowed to a buffer allocated by
    NODEMASK_ALLOC(). Just pass it to nodelist_scnprintf().

    As spotted by Paul, a side effect is we fix a bug that the function can
    return -ENOMEM but the caller doesn't expect negative return value.
    Therefore change the return value of cpuset_sprintf_cpulist() and
    cpuset_sprintf_memlist() from int to size_t.

    Signed-off-by: Li Zefan
    Acked-by: Paul Menage
    Acked-by: David Rientjes
    Cc: Miao Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • When a memcg is oom and current has already received a SIGKILL, then give
    it access to memory reserves with a higher scheduling priority so that it
    may quickly exit and free its memory.

    This is identical to the global oom killer and is done even before
    checking for panic_on_oom: a pending SIGKILL here while panic_on_oom is
    selected is guaranteed to have come from userspace; the thread only needs
    access to memory reserves to exit and thus we don't unnecessarily panic
    the machine until the kernel has no last resort to free memory.

    Signed-off-by: David Rientjes
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • fs/fuse/dev.c::fuse_try_move_page() does

    (1) remove a page by ->steal()
    (2) re-add the page to page cache
    (3) link the page to LRU if it was not on LRU at (1)

    This implies the page is _on_ LRU when it's added to radix-tree. So, the
    page is added to memory cgroup while it's on LRU. because LRU is lazy and
    no one flushs it.

    This is the same behavior as SwapCache and needs special care as
    - remove page from LRU before overwrite pc->mem_cgroup.
    - add page to LRU after overwrite pc->mem_cgroup.

    And we need to taking care of pagevec.

    If PageLRU(page) is set before we add PCG_USED bit, the page will not be
    added to memcg's LRU (in short period). So, regardlress of PageLRU(page)
    value before commit_charge(), we need to check PageLRU(page) after
    commit_charge().

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=30432

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Johannes Weiner
    Acked-by: Daisuke Nishimura
    Cc: Miklos Szeredi
    Cc: Balbir Singh
    Reported-by: Daniel Poelzleithner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • KAMEZAWA Hiroyuki noted that free_pages_cgroup doesn't have to check for
    PageReserved because we never store the array on reserved pages (neither
    alloc_pages_exact nor vmalloc use those pages).

    So we can replace the check by a BUG_ON.

    Signed-off-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Currently we are allocating a single page_cgroup array per memory section
    (stored in mem_section->base) when CONFIG_SPARSEMEM is selected. This is
    correct but memory inefficient solution because the allocated memory
    (unless we fall back to vmalloc) is not kmalloc friendly:

    - 32b - 16384 entries (20B per entry) fit into 327680B so the
    524288B slab cache is used
    - 32b with PAE - 131072 entries with 2621440B fit into 4194304B
    - 64b - 32768 entries (40B per entry) fit into 2097152 cache

    This is ~37% wasted space per memory section and it sumps up for the whole
    memory. On a x86_64 machine it is something like 6MB per 1GB of RAM.

    We can reduce the internal fragmentation by using alloc_pages_exact which
    allocates PAGE_SIZE aligned blocks so we will get down to
    Cc: Dave Hansen
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • mm/memcontrol.c: In function 'mem_cgroup_force_empty':
    mm/memcontrol.c:2280: warning: 'flags' may be used uninitialized in this function

    It's a false positive.

    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The statistic counters are in units of pages, there is no reason to make
    them 64-bit wide on 32-bit machines.

    Make them native words. Since they are signed, this leaves 31 bit on
    32-bit machines, which can represent roughly 8TB assuming a page size of
    4k.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Johannes Weiner
    Signed-off-by: Greg Thelen
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • For increasing and decreasing per-cpu cgroup usage counters it makes sense
    to use signed types, as single per-cpu values might go negative during
    updates. But this is not the case for only-ever-increasing event
    counters.

    All the counters have been signed 64-bit so far, which was enough to count
    events even with the sign bit wasted.

    This patch:
    - divides s64 counters into signed usage counters and unsigned
    monotonically increasing event counters.
    - converts unsigned event counters into 'unsigned long' rather than
    'u64'. This matches the type used by the /proc/vmstat event counters.

    The next patch narrows the signed usage counters type (on 32-bit CPUs,
    that is).

    Signed-off-by: Johannes Weiner
    Signed-off-by: Greg Thelen
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • There is no clear pattern when we pass a page count and when we pass a
    byte count that is a multiple of PAGE_SIZE.

    We never charge or uncharge subpage quantities, so convert it all to page
    counts.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • We never uncharge subpage quantities.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • We never keep subpage quantities in the per-cpu stock.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • We have two charge cancelling functions: one takes a page count, the other
    a page size. The second one just divides the parameter by PAGE_SIZE and
    then calls the first one. This is trivial, no need for an extra function.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The reclaim_param_lock is only taken around single reads and writes to
    integer variables and is thus superfluous. Drop it.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • page_cgroup_zoneinfo() will never return NULL for a charged page, remove
    the check for it in mem_cgroup_get_reclaim_stat_from_page().

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • In struct page_cgroup, we have a full word for flags but only a few are
    reserved. Use the remaining upper bits to encode, depending on
    configuration, the node or the section, to enable page_cgroup-to-page
    lookups without a direct pointer.

    This saves a full word for every page in a system with memory cgroups
    enabled.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Minchan Kim
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner