20 Oct, 2008

40 commits

  • The cell_edac driver is setting the edac_mode field of the csrow's to an
    incorrect value, causing the sysfs show routine for that field to go out
    of an array bound and Oopsing the kernel when used.

    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Doug Thompson
    Cc: [2.6.27.x, 2.6.26.x. 2.6.25.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • This is only compile tested, because I do not own appropriate hardware.

    Signed-off-by: Andre Haupt
    Cc: Jim Cromie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andre Haupt
     
  • Increase the range of various posix message queue limits.

    Posix gives the message queue user the ability to 'trade off' the maximum
    size of messages with the number of possible messages that can be 'in
    flight'. Linux currently makes this trade off more restrictive than it
    needs to be.

    In particular, the maximum message size today can be made no smaller than
    8192. This greatly restricts those applications that would like to have
    the ability to post large numbers of very small messages.

    So this task lowers the limit that the maximum message size can be set to,
    from 8192 to 128. It also lowers the limit that the maximum #number of
    messages in flight can be set to, from 10 to 1.

    With these changes the message queue user can make better trade offs
    between #messages and message size, in order to get everything to fit
    within the setrlimit(RLIMIT_MSGQUEUE) limit for that particular user.

    This patch also applies the values in

    /proc/sys/fs/mqueue/msg_max
    /proc/sys/fs/mqueue/msgsize_max

    as the defaults for the max #messages allowed and the max message size
    allowed, respectively, for those applications that do not supply these.
    Previously, the defaults were hardwired to 10 and 8192, respectively.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Joe Korty
    Cc: Al Viro
    Cc: Manfred Spraul
    Cc: Nadia Derbey
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Korty
     
  • Add the symbols 'vmlist' and offset 'vm_struct.addr' to the vmcoreinfo[1]
    data for i386 vmalloc translation.

    makedumpfile[2] needs VMALLOC_START value for distinguishing a vmalloc
    address or not, because it should choose suitable translation method. If
    applying this patch, makedumpfile will be able to take VMALLOC_START value
    from 'vmlist.addr'.

    vmcoreinfo[1]:
    The vmcoreinfo data has the minimum debugging information only for dump
    filtering. makedumpfile[2] uses it to distinguish unnecessary pages and
    creates a small dumpfile.

    makedumpfile[2]:
    dump filtering command
    https://sourceforge.net/projects/makedumpfile/

    Signed-off-by: Ken'ichi Ohmichi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ken'ichi Ohmichi
     
  • elfcore header memory needs to be reserved in a crash kernel. This means
    that the relevant code should be protected by CONFIG_CRASH_DUMP rather
    than CONFIG_PROC_VMCORE.

    Signed-off-by: Simon Horman
    Cc: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Simon Horman
     
  • The usage of elfcorehdr_addr has changed recently such that being set to
    ELFCORE_ADDR_MAX is used by is_kdump_kernel() to indicate if the code is
    executing in a kernel executed as a crash kernel.

    However, arch/ia64/kernel/setup.c:reserve_elfcorehdr will rest
    elfcorehdr_addr to ELFCORE_ADDR_MAX on error, which means any subsequent
    calls to is_kdump_kernel() will return 0, even though they should return
    1.

    Ok, at this point in time there are no subsequent calls, but I think its
    fair to say that there is ample scope for error or at the very least
    confusion.

    This patch add an extra state, ELFCORE_ADDR_ERR, which indicates that
    elfcorehdr_addr was passed on the command line, and thus execution is
    taking place in a crashdump kernel, but vmcore can't be used for some
    reason. This is tested for using is_vmcore_usable() and set using
    vmcore_unusable(). A subsequent patch makes use of this new code.

    To summarise, the states that elfcorehdr_addr can now be in are as follows:

    ELFCORE_ADDR_MAX: not a crashdump kernel
    ELFCORE_ADDR_ERR: crashdump kernel but vmcore is unusable
    any other value: crash dump kernel and vmcore is usable

    Signed-off-by: Simon Horman
    Cc: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Simon Horman
     
  • o Make use of is_kdump_kernel() rather than checking elfcorehdr_addr directly.

    o Remove CONFIG_CRASH_DUMP as is_kdump_kernel() is safe to call anywhere

    o Remove CONFIG_PROC_FS as it is bogus, the check
    should occur regardless of if CONFIG_PROC_FS is set or not.

    Signed-off-by: Simon Horman
    Acked-by: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Simon Horman
     
  • IA64, PPC and SH also support the elfcorehdr command line.

    Signed-off-by: Simon Horman
    Acked-by: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Simon Horman
     
  • o elfcorehdr_addr is used by not only the code under CONFIG_PROC_VMCORE
    but also by the code which is not inside CONFIG_PROC_VMCORE. For
    example, is_kdump_kernel() is used by powerpc code to determine if
    kernel is booting after a panic then use previous kernel's TCE table.
    So even if CONFIG_PROC_VMCORE is not set in second kernel, one should be
    able to correctly determine that we are booting after a panic and setup
    calgary iommu accordingly.

    o So remove the assumption that elfcorehdr_addr is under
    CONFIG_PROC_VMCORE.

    o Move definition of elfcorehdr_addr to arch dependent crash files.
    (Unfortunately crash dump does not have an arch independent file
    otherwise that would have been the best place).

    o kexec.c is not the right place as one can Have CRASH_DUMP enabled in
    second kernel without KEXEC being enabled.

    o I don't see sh setup code parsing the command line for
    elfcorehdr_addr. I am wondering how does vmcore interface work on sh.
    Anyway, I am atleast defining elfcoredhr_addr so that compilation is not
    broken on sh.

    Signed-off-by: Vivek Goyal
    Acked-by: "Eric W. Biederman"
    Acked-by: Simon Horman
    Acked-by: Paul Mundt
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vivek Goyal
     
  • Now that wait_task_inactive(task, state) checks task->state == state,
    we can simplify the code and make this debugging check more robust.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • This adds a kconfig option to change the /proc/PID/coredump_filter default.
    Fedora has been carrying a trivial patch to change the hard-wired value for
    this default, since Fedora 8. The default default can't change safely
    because there are old GDB versions out there (all before 6.7) that are
    confused by the core dump files created by the MMF_DUMP_ELF_HEADERS setting.

    Signed-off-by: Roland McGrath
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Alan Cox
    Cc: Andi Kleen
    Cc: KOSAKI Motohiro
    Cc: Kawai Hidehiro
    Cc: Ingo Molnar
    Cc: David Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • If the coredumping is multi-threaded, format_corename() appends .%pid to
    the corename. This was needed before the proper multi-thread core dump
    support, now all the threads in the mm go into a single unified core file.

    Remove this special case, it is not even documented and we have "%p"
    and core_uses_pid.

    Signed-off-by: Oleg Nesterov
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Alan Cox
    Cc: Roland McGrath
    Cc: Andi Kleen
    Cc: La Monte Yarroll
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • ptrace_untrace() can now become static.

    Signed-off-by: Adrian Bunk
    Cc: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • bitmap_scnprintf_len() is not used now, so we remove it.

    Otherwise we have to maintain it and make its return
    value always equal to bitmap_scnprintf()'s return value.

    Signed-off-by: Lai Jiangshan
    Cc: Alexey Dobriyan
    Cc: Paul Menage
    Cc: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • 1) seq_file excepts that m->count == m->size when it's buf is full,
    so current code will causes bugs when buf is overflow.

    2) There is not too good that cpuset accesses struct seq_file's
    fields directly.

    Signed-off-by: Lai Jiangshan
    Cc: Alexey Dobriyan
    Acked-by: Paul Menage
    Cc: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • seq_cpumask_list(), seq_nodemask_list() are very like seq_cpumask(),
    seq_nodemask(), but they print human readable string.

    Signed-off-by: Lai Jiangshan
    Cc: Alexey Dobriyan
    Cc: Paul Menage
    Cc: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • "m->count + len < m->size" is true commonly, so bitmap_scnprintf()
    is commonly called. this fix saves a call to bitmap_scnprintf_len().

    Signed-off-by: Lai Jiangshan
    Cc: Alexey Dobriyan
    Cc: Paul Menage
    Cc: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • Remove the use of int cpus_nonempty variable from 'update_flag' function.

    Signed-off-by: Md.Rakib H. Mullick
    Acked-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rakib Mullick
     
  • Allocate all page_cgroup at boot and remove page_cgroup poitner from
    struct page. This patch adds an interface as

    struct page_cgroup *lookup_page_cgroup(struct page*)

    All FLATMEM/DISCONTIGMEM/SPARSEMEM and MEMORY_HOTPLUG is supported.

    Remove page_cgroup pointer reduces the amount of memory by
    - 4 bytes per PAGE_SIZE.
    - 8 bytes per PAGE_SIZE
    if memory controller is disabled. (even if configured.)

    On usual 8GB x86-32 server, this saves 8MB of NORMAL_ZONE memory.
    On my x86-64 server with 48GB of memory, this saves 96MB of memory.
    I think this reduction makes sense.

    By pre-allocation, kmalloc/kfree in charge/uncharge are removed.
    This means
    - we're not necessary to be afraid of kmalloc faiulre.
    (this can happen because of gfp_mask type.)
    - we can avoid calling kmalloc/kfree.
    - we can avoid allocating tons of small objects which can be fragmented.
    - we can know what amount of memory will be used for this extra-lru handling.

    I added printk message as

    "allocated %ld bytes of page_cgroup"
    "please try cgroup_disable=memory option if you don't want"

    maybe enough informative for users.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch makes page_cgroup->flags to be atomic_ops and define functions
    (and macros) to access it.

    Before trying to modify memory resource controller, this atomic operation
    on flags is necessary. Most of flags in this patch is for LRU and modfied
    under mz->lru_lock but we'll add another flags which is not for LRU soon.
    For example, we'll place LOCK bit on flags field. We need atomic
    operation to modify LRU bit without LOCK.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Some obvious optimization to memcg.

    I found mem_cgroup_charge_statistics() is a little big (in object) and
    does unnecessary address calclation. This patch is for optimization to
    reduce the size of this function.

    And res_counter_charge() is 'likely' to succeed.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • There are not-on-LRU pages which can be mapped and they are not worth to
    be accounted. (becasue we can't shrink them and need dirty codes to
    handle specical case) We'd like to make use of usual objrmap/radix-tree's
    protcol and don't want to account out-of-vm's control pages.

    When special_mapping_fault() is called, page->mapping is tend to be NULL
    and it's charged as Anonymous page. insert_page() also handles some
    special pages from drivers.

    This patch is for avoiding to account special pages.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch tries to make page->mapping to be NULL before
    mem_cgroup_uncharge_cache_page() is called.

    "page->mapping == NULL" is a good check for "whether the page is still
    radix-tree or not". This patch also adds BUG_ON() to
    mem_cgroup_uncharge_cache_page();

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • While page-cache's charge/uncharge is done under page_lock(), swap-cache
    isn't. (anonymous page is charged when it's newly allocated.)

    This patch moves do_swap_page()'s charge() call under lock. I don't see
    any bad problem *now* but this fix will be good for future for avoiding
    unnecessary racy state.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Acked-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Since we introduced rcu for read side, spin_lock is used only for update.
    But we always hold cgroup_lock() when update, so spin_lock() is not need.

    Additional cleanup:
    1) include linux/rcupdate.h explicitly
    2) remove unused variable cur_devcgroup in devcgroup_update_access()

    Signed-off-by: Lai Jiangshan
    Acked-by: "Serge E. Hallyn"
    Cc: Paul Menage
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • Signed-off-by: Li Zefan
    Acked-by: Serge Hallyn
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • This saves 40 bytes on my x86_32 box.

    Signed-off-by: Li Zefan
    Acked-by: Serge Hallyn
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • The choice of real/dummy declaration for cgroup_mm_owner_callbacks()
    shouldn't be based on CONFIG_MM_OWNER, but on CONFIG_CGROUPS. Otherwise
    kernel/exit.c fails to compile when something other than a cgroups
    controller selects CONFIG_MM_OWNER

    Signed-off-by: Paul Menage
    Acked-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • Rather than pre-generating the entire text for the "tasks" file each
    time the file is opened, we instead just generate/update the array of
    process ids and use a seq_file to report these to userspace. All open
    file handles on the same "tasks" file can share a pid array, which may
    be updated any time that no thread is actively reading the array. By
    sharing the array, the potential for userspace to DoS the system by
    opening many handles on the same "tasks" file is removed.

    [Based on a patch by Lai Jiangshan, extended to use seq_file]

    Signed-off-by: Paul Menage
    Reviewed-by: Lai Jiangshan
    Cc: Serge Hallyn
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • put_css_set_taskexit may be called when find_css_set is called on other
    cpu. And the race will occur:

    put_css_set_taskexit side find_css_set side

    |
    atomic_dec_and_test(&kref->refcount) |
    /* kref->refcount = 0 */ |
    ....................................................................
    | read_lock(&css_set_lock)
    | find_existing_css_set
    | get_css_set
    | read_unlock(&css_set_lock);
    ....................................................................
    __release_css_set |
    ....................................................................
    | /* use a released css_set */
    |

    [put_css_set is the same. But in the current code, all put_css_set are
    put into cgroup mutex critical region as the same as find_css_set.]

    [akpm@linux-foundation.org: repair comments]
    [menage@google.com: eliminate race in css_set refcounting]
    Signed-off-by: Lai Jiangshan
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • A corrupted extent for the extent file itself may try to get an impossible
    extent, causing a deadlock if I see it correctly.

    Check the inode number after the first_blocks checks and fail if it's the
    extent file, as according to the spec the extent file should have no
    extent for itself.

    Signed-off-by: Eric Sesterhenn
    Cc: Roman Zippel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sesterhenn
     
  • hfsplus: O_LARGEFILE checking is missing

    Addresses http://bugzilla.kernel.org/show_bug.cgi?id=8490

    From: Alan Cox
    Reported-by: didier
    Cc: Roman Zippel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     
  • A very large directory with many read failures (either due to storage
    problems, or due to invalid size & blocks from corruption) will generate a
    printk storm as the filesystem continues to try to read all the blocks.
    This flood of messages can tie up the box until it is complete - which may
    be a very long time, especially for very large corrupted values.

    This is fixed by only reporting the corruption once each time we try to
    read the directory.

    Signed-off-by: Eric Sandeen
    Signed-off-by: "Theodore Ts'o"
    Cc: Eugene Teo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sandeen
     
  • For blocksize < pagesize we need to remove blocks that got allocated in
    block_write_begin() if we fail with ENOSPC for later blocks.
    block_write_begin() internally does this if it allocated page locally.
    This makes sure we don't have blocks outside inode.i_size during ENOSPC.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • This fixes a bug where readdir() would return a directory entry twice
    if there was a hash collision in an hash tree indexed directory.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Eugene Dashevsky
    Signed-off-by: Mike Snitzer
    Signed-off-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eugene Dashevsky
     
  • In ordered mode, if a file data buffer being dirtied exists in the
    committing transaction, we write the buffer to the disk, move it from the
    committing transaction to the running transaction, then dirty it. But we
    don't have to remove the buffer from the committing transaction when the
    buffer couldn't be written out, otherwise it would miss the error and the
    committing transaction would not abort.

    This patch adds an error check before removing the buffer from the
    committing transaction.

    Signed-off-by: Hidehiro Kawai
    Acked-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hidehiro Kawai
     
  • If the journal doesn't abort when it gets an IO error in file data blocks,
    the file data corruption will spread silently. Because most of
    applications and commands do buffered writes without fsync(), they don't
    notice the IO error. It's scary for mission critical systems. On the
    other hand, if the journal aborts whenever it gets an IO error in file
    data blocks, the system will easily become inoperable. So this patch
    introduces a filesystem option to determine whether it aborts the journal
    or just call printk() when it gets an IO error in file data.

    If you mount a ext3 fs with data_err=abort option, it aborts on file data
    write error. If you mount it with data_err=ignore, it doesn't abort, just
    call printk(). data_err=ignore is the default.

    Signed-off-by: Hidehiro Kawai
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hidehiro Kawai
     
  • We could run into ENOSPC error on ext3, even when there is free blocks on
    the filesystem.

    The problem is triggered in the case the goal block group has 0 free
    blocks , and the rest block groups are skipped due to the check of
    "free_blocks < windowsz/2". Current code could fall back to non
    reservation allocation to prevent early ENOSPC after examing all the block
    groups with reservation on , but this code was bypassed if the reservation
    window is turned off already, which is true in this case.

    This patch fixed two issues:
    1) We don't need to turn off block reservation if the goal block group has
    0 free blocks left and continue search for the rest of block groups.

    Current code the intention is to turn off the block reservation if the
    goal allocation group has a few (some) free blocks left (not enough for
    make the desired reservation window),to try to allocation in the goal
    block group, to get better locality. But if the goal blocks have 0 free
    blocks, it should leave the block reservation on, and continues search for
    the next block groups,rather than turn off block reservation completely.

    2) we don't need to check the window size if the block reservation is off.

    The problem was originally found and fixed in ext4.

    Signed-off-by: Mingming Cao
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mingming Cao
     
  • When trying to resize a ext3 fs and you run out of reserved gdt blocks,
    you get an error that doesn't actually tell you what went wrong, it just
    says that the gdb it picked is not correct, which is the case since you
    don't have any reserved gdt blocks left. This patch adds a check to make
    sure you have reserved gdt blocks to use, and if not prints out a more
    relevant error.

    Signed-off-by: Josef Bacik
    Cc:
    Cc: Andreas Dilger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef Bacik
     
  • Currently, original metadata buffers are dirtied when they are unfiled
    whether the journal has aborted or not. Eventually these buffers will be
    written-back to the filesystem by pdflush. This means some metadata
    buffers are written to the filesystem without journaling if the journal
    aborts. So if both journal abort and system crash happen at the same
    time, the filesystem would become inconsistent state. Additionally,
    replaying journaled metadata can overwrite the latest metadata on the
    filesystem partly. Because, if the journal aborts, journaled metadata are
    preserved and replayed during the next mount not to lose uncheckpointed
    metadata. This would also break the consistency of the filesystem.

    This patch prevents original metadata buffers from being dirtied on abort
    by clearing BH_JBDDirty flag from those buffers. Thus, no metadata
    buffers are written to the filesystem without journaling.

    Signed-off-by: Hidehiro Kawai
    Acked-by: Jan Kara
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hidehiro Kawai