02 Feb, 2006

3 commits

  • If large amounts of zone memory are used by empty slabs then zone_reclaim
    becomes uneffective. This patch shakes the slab a bit.

    The problem with this patch is that the slab reclaim is not containable to a
    zone. Thus slab reclaim may affect the whole system and be extremely slow.
    This also means that we cannot determine how many pages were freed in this
    zone. Thus we need to go off node for at least one allocation.

    The functionality is disabled by default.

    We could modify the shrinkers to take a zone parameter but that would be quite
    invasive. Better ideas are welcome.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • In some situations one may want zone_reclaim to behave differently. For
    example a process writing large amounts of memory will spew unto other nodes
    to cache the writes if many pages in a zone become dirty. This may impact the
    performance of processes running on other nodes.

    Allowing writes during reclaim puts a stop to that behavior and throttles the
    process by restricting the pages to the local zone.

    Similarly one may want to contain processes to local memory by enabling
    regular swap behavior during zone_reclaim. Off node memory allocation can
    then be controlled through memory policies and cpusets.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Currently the zone_reclaim code has a fixed window of 30 seconds of off node
    allocations should a local zone have no unused pagecache pages left. Reclaim
    will be attempted again after this timeout period to avoid repeated useless
    scans for memory. This is also useful to established sufficiently large off
    node allocation chunks to relieve the local node.

    It may be beneficial to adjust that time period for some special situations.
    For example if memory use was exceeding node capacity one may want to give up
    for longer periods of time. If memory spikes intermittendly then one may want
    to shorten the time period to reduce the number of off node allocations.

    This patch allows just that....

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

19 Jan, 2006

1 commit

  • proc support for zone reclaim

    This patch creates a proc entry /proc/sys/vm/zone_reclaim_mode that may be
    used to override the automatic determination of the zone reclaim made on
    bootup.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

09 Jan, 2006

2 commits

  • As recently there has been lot of traffic on the right values for batch and
    high water marks for per_cpu_pagelists. This patch makes these two
    variables configurable through /proc interface.

    A new tunable /proc/sys/vm/percpu_pagelist_fraction is added. This entry
    controls the fraction of pages at most in each zone that are allocated for
    each per cpu page list. The min value for this is 8. It means that we
    don't allow more than 1/8th of pages in each zone to be allocated in any
    single per_cpu_pagelist.

    The batch value of each per cpu pagelist is also updated as a result. It
    is set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8)

    Signed-off-by: Rohit Seth
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rohit Seth
     
  • Add /proc/sys/vm/drop_caches. When written to, this will cause the kernel to
    discard as much pagecache and/or reclaimable slab objects as it can. THis
    operation requires root permissions.

    It won't drop dirty data, so the user should run `sync' first.

    Caveats:

    a) Holds inode_lock for exorbitant amounts of time.

    b) Needs to be taught about NUMA nodes: propagate these all the way through
    so the discarding can be controlled on a per-node basis.

    This is a debugging feature: useful for getting consistent results between
    filesystem benchmarks. We could possibly put it under a config option, but
    it's less than 300 bytes.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

24 Jun, 2005

1 commit

  • Add a new `suid_dumpable' sysctl:

    This value can be used to query and set the core dump mode for setuid
    or otherwise protected/tainted binaries. The modes are

    0 - (default) - traditional behaviour. Any process which has changed
    privilege levels or is execute only will not be dumped

    1 - (debug) - all processes dump core when possible. The core dump is
    owned by the current user and no security is applied. This is intended
    for system debugging situations only. Ptrace is unchecked.

    2 - (suidsafe) - any binary which normally would not be dumped is dumped
    readable by root only. This allows the end user to remove such a dump but
    not access it directly. For security reasons core dumps in this mode will
    not overwrite one another or other files. This mode is appropriate when
    adminstrators are attempting to debug problems in a normal environment.

    (akpm:

    > > +EXPORT_SYMBOL(suid_dumpable);
    >
    > EXPORT_SYMBOL_GPL?

    No problem to me.

    > > if (current->euid == current->uid && current->egid == current->gid)
    > > current->mm->dumpable = 1;
    >
    > Should this be SUID_DUMP_USER?

    Actually the feedback I had from last time was that the SUID_ defines
    should go because its clearer to follow the numbers. They can go
    everywhere (and there are lots of places where dumpable is tested/used
    as a bool in untouched code)

    > Maybe this should be renamed to `dump_policy' or something. Doing that
    > would help us catch any code which isn't using the #defines, too.

    Fair comment. The patch was designed to be easy to maintain for Red Hat
    rather than for merging. Changing that field would create a gigantic
    diff because it is used all over the place.

    )

    Signed-off-by: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     

17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds