28 Oct, 2010

1 commit

  • When dirty_ratio or dirty_bytes is written the other parameter is disabled
    and set to 0 (in dirty_bytes_handler() / dirty_ratio_handler()).

    We do the same for dirty_background_ratio and dirty_background_bytes.

    However, in the sysctl documentation, we say that the counterpart becomes
    a function of the old value, that is not correct.

    Clarify the documentation reporting the actual behaviour.

    Reviewed-by: Greg Thelen
    Acked-by: David Rientjes
    Signed-off-by: Andrea Righi
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Righi
     

10 Aug, 2010

1 commit

  • The oom killer tasklist dump, enabled with the oom_dump_tasks sysctl, is
    very helpful information in diagnosing why a user's task has been killed.
    It emits useful information such as each eligible thread's memory usage
    that can determine why the system is oom, so it should be enabled by
    default.

    Signed-off-by: David Rientjes
    Acked-by: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

28 Jun, 2010

1 commit


25 May, 2010

2 commits

  • …hen it should be reclaimed

    The kernel applies some heuristics when deciding if memory should be
    compacted or reclaimed to satisfy a high-order allocation. One of these
    is based on the fragmentation. If the index is below 500, memory will not
    be compacted. This choice is arbitrary and not based on data. To help
    optimise the system and set a sensible default for this value, this patch
    adds a sysctl extfrag_threshold. The kernel will only compact memory if
    the fragmentation index is above the extfrag_threshold.

    [randy.dunlap@oracle.com: Fix build errors when proc fs is not configured]
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Christoph Lameter <cl@linux-foundation.org>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     
  • Add a proc file /proc/sys/vm/compact_memory. When an arbitrary value is
    written to the file, all zones are compacted. The expected user of such a
    trigger is a job scheduler that prepares the system before the target
    application runs.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

16 May, 2010

1 commit

  • With RPS inclusion, skb timestamping is not consistent in RX path.

    If netif_receive_skb() is used, its deferred after RPS dispatch.

    If netif_rx() is used, its done before RPS dispatch.

    This can give strange tcpdump timestamps results.

    I think timestamping should be done as soon as possible in the receive
    path, to get meaningful values (ie timestamps taken at the time packet
    was delivered by NIC driver to our stack), even if NAPI already can
    defer timestamping a bit (RPS can help to reduce the gap)

    Tom Herbert prefer to sample timestamps after RPS dispatch. In case
    sampling is expensive (HPET/acpi_pm on x86), this makes sense.

    Let admins switch from one mode to another, using a new
    sysctl, /proc/sys/net/core/netdev_tstamp_prequeue

    Its default value (1), means timestamps are taken as soon as possible,
    before backlog queueing, giving accurate timestamps.

    Setting a 0 value permits to sample timestamps when processing backlog,
    after RPS dispatch, to lower the load of the pre-RPS cpu.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Mar, 2010

1 commit

  • Presently, if panic_on_oom=2, the whole system panics even if the oom
    happend in some special situation (as cpuset, mempolicy....). Then,
    panic_on_oom=2 means painc_on_oom_always.

    Now, memcg doesn't check panic_on_oom flag. This patch adds a check.

    BTW, how it's useful ?

    kdump+panic_on_oom=2 is the last tool to investigate what happens in
    oom-ed system. When a task is killed, the sysytem recovers and there will
    be few hint to know what happnes. In mission critical system, oom should
    never happen. Then, panic_on_oom=2+kdump is useful to avoid next OOM by
    knowing precise information via snapshot.

    TODO:
    - For memcg, it's for isolate system's memory usage, oom-notiifer and
    freeze_at_oom (or rest_at_oom) should be implemented. Then, management
    daemon can do similar jobs (as kdump) or taking snapshot per cgroup.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Nick Piggin
    Reviewed-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

12 Dec, 2009

1 commit


10 Dec, 2009

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (42 commits)
    tree-wide: fix misspelling of "definition" in comments
    reiserfs: fix misspelling of "journaled"
    doc: Fix a typo in slub.txt.
    inotify: remove superfluous return code check
    hdlc: spelling fix in find_pvc() comment
    doc: fix regulator docs cut-and-pasteism
    mtd: Fix comment in Kconfig
    doc: Fix IRQ chip docs
    tree-wide: fix assorted typos all over the place
    drivers/ata/libata-sff.c: comment spelling fixes
    fix typos/grammos in Documentation/edac.txt
    sysctl: add missing comments
    fs/debugfs/inode.c: fix comment typos
    sgivwfb: Make use of ARRAY_SIZE.
    sky2: fix sky2_link_down copy/paste comment error
    tree-wide: fix typos "couter" -> "counter"
    tree-wide: fix typos "offest" -> "offset"
    fix kerneldoc for set_irq_msi()
    spidev: fix double "of of" in comment
    comment typo fix: sybsystem -> subsystem
    ...

    Linus Torvalds
     

04 Dec, 2009

1 commit

  • That is "success", "unknown", "through", "performance", "[re|un]mapping"
    , "access", "default", "reasonable", "[con]currently", "temperature"
    , "channel", "[un]used", "application", "example","hierarchy", "therefore"
    , "[over|under]flow", "contiguous", "threshold", "enough" and others.

    Signed-off-by: André Goddard Rosa
    Signed-off-by: Jiri Kosina

    André Goddard Rosa
     

19 Nov, 2009

1 commit


09 Nov, 2009

1 commit


24 Sep, 2009

3 commits

  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
    HWPOISON: Enable error_remove_page on btrfs
    HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
    HWPOISON: Add madvise() based injector for hardware poisoned pages v4
    HWPOISON: Enable error_remove_page for NFS
    HWPOISON: Enable .remove_error_page for migration aware file systems
    HWPOISON: The high level memory error handler in the VM v7
    HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
    HWPOISON: shmem: call set_page_dirty() with locked page
    HWPOISON: Define a new error_remove_page address space op for async truncation
    HWPOISON: Add invalidate_inode_page
    HWPOISON: Refactor truncate to allow direct truncating of page v2
    HWPOISON: check and isolate corrupted free pages v2
    HWPOISON: Handle hardware poisoned pages in try_to_unmap
    HWPOISON: Use bitmask/action code for try_to_unmap behaviour
    HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
    HWPOISON: Add poison check to page fault handling
    HWPOISON: Add basic support for poisoned pages in fault handler v3
    HWPOISON: Add new SIGBUS error codes for hardware poison signals
    HWPOISON: Add support for poison swap entries v2
    HWPOISON: Export some rmap vma locking to outside world
    ...

    Linus Torvalds
     
  • Introduce core pipe limiting sysctl.

    Since we can dump cores to pipe, rather than directly to the filesystem,
    we create a condition in which a user can create a very high load on the
    system simply by running bad applications.

    If the pipe reader specified in core_pattern is poorly written, we can
    have lots of ourstandig resources and processes in the system.

    This sysctl introduces an ability to limit that resource consumption.
    core_pipe_limit defines how many in-flight dumps may be run in parallel,
    dumps beyond this value are skipped and a note is made in the kernel log.
    A special value of 0 in core_pipe_limit denotes unlimited core dumps may
    be handled (this is the default value).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Neil Horman
    Reported-by: Earl Chew
    Cc: Oleg Nesterov
    Cc: Andi Kleen
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Horman
     
  • In "documentation: update Documentation/filesystem/proc.txt and
    Documentation/sysctls" (commit 760df93ec) we merged /proc/sys/fs
    documentation in Documentation/sysctl/fs.txt and
    Documentation/filesystem/proc.txt, but stale file-nr definition
    remained.

    This patch adds back the right fs-nr definition for 2.6 kernel.

    Signed-off-by: Xiaotian Feng
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiaotian Feng
     

23 Sep, 2009

1 commit

  • When syslog is not possible, at the same time there's no serial/net
    console available, it will be hard to read the printk messages. For
    example oops/panic/warning messages in shutdown phase.

    Add a printk delay feature, we can make each printk message delay some
    milliseconds.

    Setting the delay by proc/sysctl interface: /proc/sys/kernel/printk_delay

    The value range from 0 - 10000, default value is 0

    [akpm@linux-foundation.org: fix a few things]
    Signed-off-by: Dave Young
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Young
     

22 Sep, 2009

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
    trivial: fix typo in aic7xxx comment
    trivial: fix comment typo in drivers/ata/pata_hpt37x.c
    trivial: typo in kernel-parameters.txt
    trivial: fix typo in tracing documentation
    trivial: add __init/__exit macros in drivers/gpio/bt8xxgpio.c
    trivial: add __init macro/ fix of __exit macro location in ipmi_poweroff.c
    trivial: remove unnecessary semicolons
    trivial: Fix duplicated word "options" in comment
    trivial: kbuild: remove extraneous blank line after declaration of usage()
    trivial: improve help text for mm debug config options
    trivial: doc: hpfall: accept disk device to unload as argument
    trivial: doc: hpfall: reduce risk that hpfall can do harm
    trivial: SubmittingPatches: Fix reference to renumbered step
    trivial: fix typos "man[ae]g?ment" -> "management"
    trivial: media/video/cx88: add __init/__exit macros to cx88 drivers
    trivial: fix typo in CONFIG_DEBUG_FS in gcov doc
    trivial: fix missing printk space in amd_k7_smp_check
    trivial: fix typo s/ketymap/keymap/ in comment
    trivial: fix typo "to to" in multiple files
    trivial: fix typos in comments s/DGBU/DBGU/
    ...

    Linus Torvalds
     
  • Reported-by: Christian Thaeter
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

21 Sep, 2009

1 commit

  • The documentation for /proc/sys/kernel/* does not mention the possible
    value 2 for randomize-va-space yet. While being there, doing some
    reformatting, fixing grammar problems and clarifying the correlations
    between randomize-va-space, kernel parameter "norandmaps" and the
    CONFIG_COMPAT_BRK option.

    Signed-off-by: Horst Schirmeier
    Signed-off-by: Jiri Kosina

    Horst Schirmeier
     

16 Sep, 2009

1 commit

  • Add the high level memory handler that poisons pages
    that got corrupted by hardware (typically by a two bit flip in a DIMM
    or a cache) on the Linux level. The goal is to prevent everyone
    from accessing these pages in the future.

    This done at the VM level by marking a page hwpoisoned
    and doing the appropriate action based on the type of page
    it is.

    The code that does this is portable and lives in mm/memory-failure.c

    To quote the overview comment:

    High level machine check handler. Handles pages reported by the
    hardware as being corrupted usually due to a 2bit ECC memory or cache
    failure.

    This focuses on pages detected as corrupted in the background.
    When the current CPU tries to consume corruption the currently
    running process can just be killed directly instead. This implies
    that if the error cannot be handled for some reason it's safe to
    just ignore it because no corruption has been consumed yet. Instead
    when that happens another machine check will happen.

    Handles page cache pages in various states. The tricky part
    here is that we can access any page asynchronous to other VM
    users, because memory failures could happen anytime and anywhere,
    possibly violating some of their assumptions. This is why this code
    has to be extremely careful. Generally it tries to use normal locking
    rules, as in get the standard locks, even if that means the
    error handling takes potentially a long time.

    Some of the operations here are somewhat inefficient and have non
    linear algorithmic complexity, because the data structures have not
    been optimized for this case. This is in particular the case
    for the mapping from a vma to a process. Since this case is expected
    to be rare we hope we can get away with this.

    There are in principle two strategies to kill processes on poison:
    - just unmap the data and wait for an actual reference before
    killing
    - kill as soon as corruption is detected.
    Both have advantages and disadvantages and should be used
    in different situations. Right now both are implemented and can
    be switched with a new sysctl vm.memory_failure_early_kill
    The default is early kill.

    The patch does some rmap data structure walking on its own to collect
    processes to kill. This is unusual because normally all rmap data structure
    knowledge is in rmap.c only. I put it here for now to keep
    everything together and rmap knowledge has been seeping out anyways

    Includes contributions from Johannes Weiner, Chris Mason, Fengguang Wu,
    Nick Piggin (who did a lot of great work) and others.

    Cc: npiggin@suse.de
    Cc: riel@redhat.com
    Signed-off-by: Andi Kleen
    Acked-by: Rik van Riel
    Reviewed-by: Hidehiro Kawai

    Andi Kleen
     

11 Sep, 2009

1 commit


17 Jun, 2009

2 commits

  • A bug was brought to my attention against a distro kernel but it affects
    mainline and I believe problems like this have been reported in various
    guises on the mailing lists although I don't have specific examples at the
    moment.

    The reported problem was that malloc() stalled for a long time (minutes in
    some cases) if a large tmpfs mount was occupying a large percentage of
    memory overall. The pages did not get cleaned or reclaimed by
    zone_reclaim() because the zone_reclaim_mode was unsuitable, but the lists
    are uselessly scanned frequencly making the CPU spin at near 100%.

    This patchset intends to address that bug and bring the behaviour of
    zone_reclaim() more in line with expectations which were noticed during
    investigation. It is based on top of mmotm and takes advantage of
    Kosaki's work with respect to zone_reclaim().

    Patch 1 fixes the heuristics that zone_reclaim() uses to determine if the
    scan should go ahead. The broken heuristic is what was causing the
    malloc() stall as it uselessly scanned the LRU constantly. Currently,
    zone_reclaim is assuming zone_reclaim_mode is 1 and historically it
    could not deal with tmpfs pages at all. This fixes up the heuristic so
    that an unnecessary scan is more likely to be correctly avoided.

    Patch 2 notes that zone_reclaim() returning a failure automatically means
    the zone is marked full. This is not always true. It could have
    failed because the GFP mask or zone_reclaim_mode were unsuitable.

    Patch 3 introduces a counter zreclaim_failed that will increment each
    time the zone_reclaim scan-avoidance heuristics fail. If that
    counter is rapidly increasing, then zone_reclaim_mode should be
    set to 0 as a temporarily resolution and a bug reported because
    the scan-avoidance heuristic is still broken.

    This patch:

    On NUMA machines, the administrator can configure zone_reclaim_mode that
    is a more targetted form of direct reclaim. On machines with large NUMA
    distances for example, a zone_reclaim_mode defaults to 1 meaning that
    clean unmapped pages will be reclaimed if the zone watermarks are not
    being met.

    There is a heuristic that determines if the scan is worthwhile but the
    problem is that the heuristic is not being properly applied and is
    basically assuming zone_reclaim_mode is 1 if it is enabled. The lack of
    proper detection can manfiest as high CPU usage as the LRU list is scanned
    uselessly.

    Historically, once enabled it was depending on NR_FILE_PAGES which may
    include swapcache pages that the reclaim_mode cannot deal with. Patch
    vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by
    Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included
    pages that were not file-backed such as swapcache and made a calculation
    based on the inactive, active and mapped files. This is far superior when
    zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a
    reasonable starting figure.

    This patch alters how zone_reclaim() works out how many pages it might be
    able to reclaim given the current reclaim_mode. If RECLAIM_SWAP is set in
    the reclaim_mode it will either consider NR_FILE_PAGES as potential
    candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount
    swapcache and other non-file-backed pages. If RECLAIM_WRITE is not set,
    then NR_FILE_DIRTY number of pages are not candidates. If RECLAIM_SWAP is
    not set, then NR_FILE_MAPPED are not.

    [kosaki.motohiro@jp.fujitsu.com: Estimate unmapped pages minus tmpfs pages]
    [fengguang.wu@intel.com: Fix underflow problem in Kosaki's estimate]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Acked-by: Christoph Lameter
    Cc: KOSAKI Motohiro
    Cc: Wu Fengguang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • ALLOC_WMARK_MIN, ALLOC_WMARK_LOW and ALLOC_WMARK_HIGH determin whether
    pages_min, pages_low or pages_high is used as the zone watermark when
    allocating the pages. Two branches in the allocator hotpath determine
    which watermark to use.

    This patch uses the flags as an array index into a watermark array that is
    indexed with WMARK_* defines accessed via helpers. All call sites that
    use zone->pages_* are updated to use the helpers for accessing the values
    and the array offsets for setting.

    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Cc: KOSAKI Motohiro
    Cc: Pekka Enberg
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

13 Jun, 2009

1 commit


22 May, 2009

1 commit


15 May, 2009

1 commit

  • This reverts commit fafd688e4c0c34da0f3de909881117d374e4c7af.

    Work is progressing to switch away from pdflush as the process backing
    for flushing out dirty data. So it seems pointless to add more knobs
    to control pdflush threads. The original author of the patch did not
    have any specific use cases for adding the knobs, so we can easily
    revert this before 2.6.30 to avoid having to maintain this API
    forever.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

08 May, 2009

1 commit


03 May, 2009

1 commit

  • Avoid setting less than two pages for vm_dirty_bytes: this is necessary to
    avoid potential division by 0 (like the following) in get_dirty_limits().

    [ 49.951610] divide error: 0000 [#1] PREEMPT SMP
    [ 49.952195] last sysfs file: /sys/devices/pci0000:00/0000:00:01.1/host0/target0:0:0/0:0:0:0/block/sda/uevent
    [ 49.952195] CPU 1
    [ 49.952195] Modules linked in: pcspkr
    [ 49.952195] Pid: 3064, comm: dd Not tainted 2.6.30-rc3 #1
    [ 49.952195] RIP: 0010:[] [] get_dirty_limits+0xe9/0x2c0
    [ 49.952195] RSP: 0018:ffff88001de03a98 EFLAGS: 00010202
    [ 49.952195] RAX: 00000000000000c0 RBX: ffff88001de03b80 RCX: 28f5c28f5c28f5c3
    [ 49.952195] RDX: 0000000000000000 RSI: 00000000000000c0 RDI: 0000000000000000
    [ 49.952195] RBP: ffff88001de03ae8 R08: 0000000000000000 R09: 0000000000000000
    [ 49.952195] R10: ffff88001ddda9a0 R11: 0000000000000001 R12: 0000000000000001
    [ 49.952195] R13: ffff88001fbc8218 R14: ffff88001de03b70 R15: ffff88001de03b78
    [ 49.952195] FS: 00007fe9a435b6f0(0000) GS:ffff8800025d9000(0000) knlGS:0000000000000000
    [ 49.952195] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 49.952195] CR2: 00007fe9a39ab000 CR3: 000000001de38000 CR4: 00000000000006e0
    [ 49.952195] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 49.952195] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    [ 49.952195] Process dd (pid: 3064, threadinfo ffff88001de02000, task ffff88001ddda250)
    [ 49.952195] Stack:
    [ 49.952195] ffff88001fa0de00 ffff88001f2dbd70 ffff88001f9fe800 000080b900000000
    [ 49.952195] 00000000000000c0 ffff8800027a6100 0000000000000400 ffff88001fbc8218
    [ 49.952195] 0000000000000000 0000000000000600 ffff88001de03bb8 ffffffff802d3ed7
    [ 49.952195] Call Trace:
    [ 49.952195] [] balance_dirty_pages_ratelimited_nr+0x1d7/0x3f0
    [ 49.952195] [] ? ext3_writeback_write_end+0x9e/0x120
    [ 49.952195] [] generic_file_buffered_write+0x12f/0x330
    [ 49.952195] [] __generic_file_aio_write_nolock+0x26d/0x460
    [ 49.952195] [] ? generic_file_aio_write+0x52/0xd0
    [ 49.952195] [] generic_file_aio_write+0x69/0xd0
    [ 49.952195] [] ext3_file_write+0x26/0xc0
    [ 49.952195] [] do_sync_write+0xf1/0x140
    [ 49.952195] [] ? get_lock_stats+0x2a/0x60
    [ 49.952195] [] ? autoremove_wake_function+0x0/0x40
    [ 49.952195] [] vfs_write+0xcb/0x190
    [ 49.952195] [] sys_write+0x50/0x90
    [ 49.952195] [] system_call_fastpath+0x16/0x1b
    [ 49.952195] Code: 00 00 00 2b 05 09 1c 17 01 48 89 c6 49 0f af f4 48 c1 ee 02 48 89 f0 48 f7 e1 48 89 d6 31 d2 48 c1 ee 02 48 0f af 75 d0 48 89 f0 f7 f7 41 8b 95 ac 01 00 00 48 89 c7 49 0f af d4 48 c1 ea 02
    [ 49.952195] RIP [] get_dirty_limits+0xe9/0x2c0
    [ 49.952195] RSP
    [ 50.096523] ---[ end trace 008d7aa02f244d7b ]---

    Signed-off-by: Andrea Righi
    Cc: Peter Zijlstra
    Cc: David Rientjes
    Cc: Dave Chinner
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Righi
     

14 Apr, 2009

1 commit


07 Apr, 2009

1 commit

  • Add /proc entries to give the admin the ability to control the minimum and
    maximum number of pdflush threads. This allows finer control of pdflush
    on both large and small machines.

    The rationale is simply one size does not fit all. Admins on large and/or
    small systems may want to tune the min/max pdflush thread count to best
    suit their needs. Right now the min/max is hardcoded to 2/8. While
    probably a fair estimate for smaller machines, large machines with large
    numbers of CPUs and large numbers of filesystems/block devices may benefit
    from larger numbers of threads working on different block devices.

    Even if the background flushing algorithm is radically changed, it is
    still likely that multiple threads will be involved and admins would still
    desire finer control on the min/max other than to have to recompile the
    kernel.

    The patch adds '/proc/sys/vm/nr_pdflush_threads_min' and
    '/proc/sys/vm/nr_pdflush_threads_max' with r/w permissions.

    The minimum value for nr_pdflush_threads_min is 1 and the maximum value is
    the current value of nr_pdflush_threads_max. This minimum is required
    since additional thread creation is performed in a pdflush thread itself.

    The minimum value for nr_pdflush_threads_max is the current value of
    nr_pdflush_threads_min and the maximum value can be 1000.

    Documentation/sysctl/vm.txt is also updated.

    [akpm@linux-foundation.org: fix comment, fix whitespace, use __read_mostly]
    Signed-off-by: Peter W Morreale
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter W Morreale
     

03 Apr, 2009

3 commits

  • Previous description about system parameter in /proc/sys/net/unix/ is
    wrong (or missed). Simply add a new description about unix_dgram_qlen
    according to latest kernel.

    Signed-off-by: Li Xiaodong
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Xiaodong
     
  • Now /proc/sys is described in many places and much information is
    redundant. This patch updates the proc.txt and move the /proc/sys
    desciption out to the files in Documentation/sysctls.

    Details are:

    merge
    - 2.1 /proc/sys/fs - File system data
    - 2.11 /proc/sys/fs/mqueue - POSIX message queues filesystem
    - 2.17 /proc/sys/fs/epoll - Configuration options for the epoll interface
    with Documentation/sysctls/fs.txt.

    remove
    - 2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats
    since it's not better then the Documentation/binfmt_misc.txt.

    merge
    - 2.3 /proc/sys/kernel - general kernel parameters
    with Documentation/sysctls/kernel.txt

    remove
    - 2.5 /proc/sys/dev - Device specific parameters
    since it's obsolete the sysfs is used now.

    remove
    - 2.6 /proc/sys/sunrpc - Remote procedure calls
    since it's not better then the Documentation/sysctls/sunrpc.txt

    move
    - 2.7 /proc/sys/net - Networking stuff
    - 2.9 Appletalk
    - 2.10 IPX
    to newly created Documentation/sysctls/net.txt.

    remove
    - 2.8 /proc/sys/net/ipv4 - IPV4 settings
    since it's not better then the Documentation/networking/ip-sysctl.txt.

    add
    - Chapter 3 Per-Process Parameters
    to descibe /proc//xxx parameters.

    Signed-off-by: Shen Feng
    Cc: Randy Dunlap
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shen Feng
     
  • Implement a sysctl file that disables module-loading system-wide since
    there is no longer a viable way to remove CAP_SYS_MODULE after the system
    bounding capability set was removed in 2.6.25.

    Value can only be set to "1", and is tested only if standard capability
    checks allow CAP_SYS_MODULE. Given existing /dev/mem protections, this
    should allow administrators a one-way method to block module loading
    after initial boot-time module loading has finished.

    Signed-off-by: Kees Cook
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    Kees Cook
     

16 Jan, 2009

1 commit

  • Update Documentation/sysctl/vm.txt and Documentation/filesystems/proc.txt.
    More specifically, the section on /proc/sys/vm in
    Documentation/filesystems/proc.txt was removed and a link to
    Documentation/sysctl/vm.txt added.

    Most of the verbiage from proc.txt was simply moved in vm.txt, with new
    addtional text for "swappiness" and "stat_interval".

    Signed-off-by: Peter W Morreale
    Acked-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter W Morreale
     

08 Jan, 2009

1 commit

  • NOMMU mmap allocates a piece of memory for an mmap that's rounded up in size to
    the nearest power-of-2 number of pages. Currently it then discards the excess
    pages back to the page allocator, making that memory available for use by other
    things. This can, however, cause greater amount of fragmentation.

    To counter this, a sysctl is added in order to fine-tune the trimming
    behaviour. The default behaviour remains to trim pages aggressively, while
    this can either be disabled completely or set to a higher page-granular
    watermark in order to have finer-grained control.

    vm region vm_top bits taken from an earlier patch by David Howells.

    Signed-off-by: Paul Mundt
    Signed-off-by: David Howells
    Tested-by: Mike Frysinger

    Paul Mundt
     

07 Jan, 2009

1 commit

  • This change introduces two new sysctls to /proc/sys/vm:
    dirty_background_bytes and dirty_bytes.

    dirty_background_bytes is the counterpart to dirty_background_ratio and
    dirty_bytes is the counterpart to dirty_ratio.

    With growing memory capacities of individual machines, it's no longer
    sufficient to specify dirty thresholds as a percentage of the amount of
    dirtyable memory over the entire system.

    dirty_background_bytes and dirty_bytes specify quantities of memory, in
    bytes, that represent the dirty limits for the entire system. If either
    of these values is set, its value represents the amount of dirty memory
    that is needed to commence either background or direct writeback.

    When a `bytes' or `ratio' file is written, its counterpart becomes a
    function of the written value. For example, if dirty_bytes is written to
    be 8096, 8K of memory is required to commence direct writeback.
    dirty_ratio is then functionally equivalent to 8K / the amount of
    dirtyable memory:

    dirtyable_memory = free pages + mapped pages + file cache

    dirty_background_bytes = dirty_background_ratio * dirtyable_memory
    -or-
    dirty_background_ratio = dirty_background_bytes / dirtyable_memory

    AND

    dirty_bytes = dirty_ratio * dirtyable_memory
    -or-
    dirty_ratio = dirty_bytes / dirtyable_memory

    Only one of dirty_background_bytes and dirty_background_ratio may be
    specified at a time, and only one of dirty_bytes and dirty_ratio may be
    specified. When one sysctl is written, the other appears as 0 when read.

    The `bytes' files operate on a page size granularity since dirty limits
    are compared with ZVC values, which are in page units.

    Prior to this change, the minimum dirty_ratio was 5 as implemented by
    get_dirty_limits() although /proc/sys/vm/dirty_ratio would show any user
    written value between 0 and 100. This restriction is maintained, but
    dirty_bytes has a lower limit of only one page.

    Also prior to this change, the dirty_background_ratio could not equal or
    exceed dirty_ratio. This restriction is maintained in addition to
    restricting dirty_background_bytes. If either background threshold equals
    or exceeds that of the dirty threshold, it is implicitly set to half the
    dirty threshold.

    Acked-by: Peter Zijlstra
    Cc: Dave Chinner
    Cc: Christoph Lameter
    Signed-off-by: David Rientjes
    Cc: Andrea Righi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

30 Oct, 2008

1 commit


11 Oct, 2008

1 commit

  • We need to add a flag for all code that is in the drivers/staging/
    directory to prevent all other kernel developers from worrying about
    issues here, and to notify users that the drivers might not be as good
    as they are normally used to.

    Based on code from Andreas Gruenbacher and Jeff Mahoney to provide a
    TAINT flag for the support level of a kernel module in the Novell
    enterprise kernel release.

    This is the kernel portion of this feature, the ability for the flag to
    be set needs to be done in the build process and will happen in a
    follow-up patch.

    Cc: Andreas Gruenbacher
    Cc: Jeff Mahoney
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

23 Sep, 2008

1 commit


27 Jul, 2008

1 commit