19 Oct, 2007

4 commits

  • After going through the kernels sysctl tables several times it has become
    clear that code review and testing is just not effective in prevent
    problematic sysctl tables from being used in the stable kernel. I certainly
    can't seem to fix the problems as fast as they are introduced.

    Therefore this patch adds sysctl_check_table which is called when a sysctl
    table is registered and checks to see if we have a problematic sysctl table.

    The biggest part of the code is the table of valid binary sysctl entries, but
    since we have frozen our set of binary sysctls this table should not need to
    change, and it makes it much easier to detect when someone unintentionally
    adds a new binary sysctl value.

    As best as I can determine all of the several hundred errors spewed on boot up
    now are legitimate.

    [bunk@kernel.org: kernel/sysctl_check.c must #include ]
    Signed-off-by: Eric W. Biederman
    Cc: Alexey Dobriyan
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Grumble. These numbers should have been in sysctl.h from the beginning if we
    ever expected anyone to use them. Oh well put them there now so we can find
    them and make maintenance easier.

    Signed-off-by: Eric W. Biederman
    Acked-by: Samuel Ortiz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • There as been no easy way to wrap the default sysctl strategy routine except
    for returning 0. Which is not always what we want. The few instances I have
    seen that want different behaviour have written their own version of
    sysctl_data. While not too hard it is unnecessary code and has the potential
    for extra bugs.

    So to make these situations easier and make that part of sysctl more symetric
    I have factord sysctl_data out of do_sysctl_strategy and exported as a
    function everyone can use.

    Further having sysctl_data be an explicit function makes checking for badly
    formed sysctl tables much easier.

    Signed-off-by: Eric W. Biederman
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • In sysctl.h the typedef struct ctl_table ctl_table violates coding style isn't
    needed and is a bit of a nuisance because it makes it harder to recognize
    ctl_table is a type name.

    So this patch removes it from the generic sysctl code. Hopefully I will have
    enough energy to send the rest of my patches will follow and to remove it from
    the rest of the kernel.

    Signed-off-by: Eric W. Biederman
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

01 Aug, 2007

1 commit

  • commit eab03ac7bd3e0da99eb9dc068772a85a5e3f3577 aka
    "[PATCH] Get rid of /proc/sys/proc" was good commit except strace(1) compile
    breakage it introduced:

    system.c:1581: error: 'CTL_PROC' undeclared here (not in a function)

    So, add dummy enum back.

    Signed-off-by: Alexey Dobriyan
    Cc: Stephen Hemminger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

26 Apr, 2007

4 commits


25 Apr, 2007

1 commit


15 Feb, 2007

13 commits

  • It isn't needed anymore, all of the users are gone, and all of the ctl_table
    initializers have been converted to use explicit names of the fields they are
    initializing.

    [akpm@osdl.org: NTFS fix]
    Signed-off-by: Eric W. Biederman
    Acked-by: Stephen Smalley
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Add a parent entry into the ctl_table so you can walk the list of parents and
    find the entire path to a ctl_table entry.

    Signed-off-by: Eric W. Biederman
    Cc: Stephen Smalley
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • With this change the sysctl inodes can be cached and nothing needs to be done
    when removing a sysctl table.

    For a cost of 2K code we will save about 4K of static tables (when we remove
    de from ctl_table) and 70K in proc_dir_entries that we will not allocate, or
    about half that on a 32bit arch.

    The speed feels about the same, even though we can now cache the sysctl
    dentries :(

    We get the core advantage that we don't need to have a 1 to 1 mapping between
    ctl table entries and proc files. Making it possible to have /proc/sys vary
    depending on the namespace you are in. The currently merged namespaces don't
    have an issue here but the network namespace under /proc/sys/net needs to have
    different directories depending on which network adapters are visible. By
    simply being a cache different directories being visible depending on who you
    are is trivial to implement.

    [akpm@osdl.org: fix uninitialised var]
    [akpm@osdl.org: fix ARM build]
    [bunk@stusta.de: make things static]
    Signed-off-by: Eric W. Biederman
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • The current logic to walk through the list of sysctl table headers is slightly
    painful and implement in a way it cannot be used by code outside sysctl.c

    I am in the process of implementing a version of the sysctl proc support that
    instead of using the proc generic non-caching monster, just uses the existing
    sysctl data structure as backing store for building the dcache entries and for
    doing directory reads. To use the existing data structures however I need a
    way to get at them.

    [akpm@osdl.org: warning fix]
    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • The semantic effect of insert_at_head is that it would allow new registered
    sysctl entries to override existing sysctl entries of the same name. Which is
    pain for caching and the proc interface never implemented.

    I have done an audit and discovered that none of the current users of
    register_sysctl care as (excpet for directories) they do not register
    duplicate sysctl entries.

    So this patch simply removes the support for overriding existing entries in
    the sys_sysctl interface since no one uses it or cares and it makes future
    enhancments harder.

    Signed-off-by: Eric W. Biederman
    Acked-by: Ralf Baechle
    Acked-by: Martin Schwidefsky
    Cc: Russell King
    Cc: David Howells
    Cc: "Luck, Tony"
    Cc: Ralf Baechle
    Cc: Paul Mackerras
    Cc: Martin Schwidefsky
    Cc: Andi Kleen
    Cc: Jens Axboe
    Cc: Corey Minyard
    Cc: Neil Brown
    Cc: "John W. Linville"
    Cc: James Bottomley
    Cc: Jan Kara
    Cc: Trond Myklebust
    Cc: Mark Fasheh
    Cc: David Chinner
    Cc: "David S. Miller"
    Cc: Patrick McHardy
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • There are currently no users in the kernel for CTL_ANY and it only has effect
    on the binary interface which is practically unused.

    So this complicates sysctl lookups for no good reason so just remove it.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • ocfs2 was did not have the binary number it uses under CTL_FS registered in
    sysctl.h. Register it to avoid future conflicts, and change the name of the
    definition to be in line with the rest of the sysctl numbers.

    Signed-off-by: Eric W. Biederman
    Acked-by: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Signed-off-by: Eric W. Biederman
    Cc: "John W. Linville"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • We need to have the the definition of all top level sysctl directories
    registers in sysctl.h so we don't conflict by accident and cause abi problems.

    Signed-off-by: Eric W. Biederman
    Acked-by: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Signed-off-by: Eric W. Biederman
    Acked-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Signed-off-by: Eric W. Biederman
    Cc: Trond Myklebust
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

09 Feb, 2007

1 commit


13 Dec, 2006

1 commit


11 Dec, 2006

1 commit


03 Dec, 2006

5 commits

  • This is in response to a request sent earlier by Eric W. Biederman
    and replaces all sysctl numbers for net.dccp.default with CTL_UNNUMBERED.

    It has been tested to compile and to work.

    Commiter note: I've removed the use of CTL_UNNUMBERED, not setting .ctl_name
    sets it to 0, that is the what CTL_UNNUMBERED is, reason is
    to avoid unneeded source code cluttering.

    Signed-off-by: Gerrit Renker
    Signed-off-by: Ian McDonald
    Signed-off-by: Arnaldo Carvalho de Melo

    Gerrit Renker
     
  • This one got lost on the way from Ian to Gerrit to me, fix it.

    Signed-off-by: Ian McDonald
    Signed-off-by: Arnaldo Carvalho de Melo

    Ian McDonald
     
  • This adds 3 sysctls which govern the retransmission behaviour of DCCP control
    packets (3way handshake, feature negotiation).

    It removes 4 FIXMEs from the code.

    The close resemblance of sysctl variables to their TCP analogues is emphasised
    not only by their name, but also by giving them the same initial values.
    This is useful since there is not much practical experience with DCCP yet.

    Furthermore, with regard to the previous patch, it is now possible to limit
    the number of keepalive-Responses by setting net.dccp.default.request_retries
    (also a bit like in TCP).

    Lastly, added documentation of all existing DCCP sysctls.

    Signed-off-by: Gerrit Renker
    Signed-off-by: Arnaldo Carvalho de Melo

    Gerrit Renker
     
  • Allow normal users to only choose among a restricted set of congestion
    control choices. The default is reno and what ever has been configured
    as default. But the policy can be changed by administrator at any time.

    For example, to allow any choice:
    cp /proc/sys/net/ipv4/tcp_available_congestion_control \
    /proc/sys/net/ipv4/tcp_allowed_congestion_control

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger
     
  • Create /proc/sys/net/ipv4/tcp_available_congestion_control
    that reflects currently available TCP choices.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger
     

06 Nov, 2006

2 commits

  • This patch takes the CTL_UNNUMBERD concept from NFS and makes it available to
    all new sysctl users.

    At the same time the sysctl binary interface maintenance documentation is
    updated to mention and to describe what is needed to successfully maintain the
    sysctl binary interface.

    Signed-off-by: Eric W. Biederman
    Acked-by: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Since it is becoming clear that there are just enough users of the binary
    sysctl interface that completely removing the binary interface from the kernel
    will not be an option for foreseeable future, we need to find a way to address
    the sysctl maintenance issues.

    The basic problem is that sysctl requires one central authority to allocate
    sysctl numbers, or else conflicts and ABI breakage occur. The proc interface
    to sysctl does not have that problem, as names are not densely allocated.

    By not terminating a sysctl table until I have neither a ctl_name nor a
    procname, it becomes simple to add sysctl entries that don't show up in the
    binary sysctl interface. Which allows people to avoid allocating a binary
    sysctl value when not needed.

    I have audited the kernel code and in my reading I have not found a single
    sysctl table that wasn't terminated by a completely zero filled entry. So
    this change in behavior should not affect anything.

    I think this mechanism eases the pain enough that combined with a little
    disciple we can solve the reoccurring sysctl ABI breakage.

    Signed-off-by: Eric W. Biederman
    Acked-by: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

27 Sep, 2006

1 commit

  • * 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6: (225 commits)
    [PATCH] Don't set calgary iommu as default y
    [PATCH] i386/x86-64: New Intel feature flags
    [PATCH] x86: Add a cumulative thermal throttle event counter.
    [PATCH] i386: Make the jiffies compares use the 64bit safe macros.
    [PATCH] x86: Refactor thermal throttle processing
    [PATCH] Add 64bit jiffies compares (for use with get_jiffies_64)
    [PATCH] Fix unwinder warning in traps.c
    [PATCH] x86: Allow disabling early pci scans with pci=noearly or disallowing conf1
    [PATCH] x86: Move direct PCI scanning functions out of line
    [PATCH] i386/x86-64: Make all early PCI scans dependent on CONFIG_PCI
    [PATCH] Don't leak NT bit into next task
    [PATCH] i386/x86-64: Work around gcc bug with noreturn functions in unwinder
    [PATCH] Fix some broken white space in ia32_signal.c
    [PATCH] Initialize argument registers for 32bit signal handlers.
    [PATCH] Remove all traces of signal number conversion
    [PATCH] Don't synchronize time reading on single core AMD systems
    [PATCH] Remove outdated comment in x86-64 mmconfig code
    [PATCH] Use string instructions for Core2 copy/clear
    [PATCH] x86: - restore i8259A eoi status on resume
    [PATCH] i386: Split multi-line printk in oops output.
    ...

    Linus Torvalds
     

26 Sep, 2006

3 commits

  • Currently one can enable slab reclaim by setting an explicit option in
    /proc/sys/vm/zone_reclaim_mode. Slab reclaim is then used as a final
    option if the freeing of unmapped file backed pages is not enough to free
    enough pages to allow a local allocation.

    However, that means that the slab can grow excessively and that most memory
    of a node may be used by slabs. We have had a case where a machine with
    46GB of memory was using 40-42GB for slab. Zone reclaim was effective in
    dealing with pagecache pages. However, slab reclaim was only done during
    global reclaim (which is a bit rare on NUMA systems).

    This patch implements slab reclaim during zone reclaim. Zone reclaim
    occurs if there is a danger of an off node allocation. At that point we

    1. Shrink the per node page cache if the number of pagecache
    pages is more than min_unmapped_ratio percent of pages in a zone.

    2. Shrink the slab cache if the number of the nodes reclaimable slab pages
    (patch depends on earlier one that implements that counter)
    are more than min_slab_ratio (a new /proc/sys/vm tunable).

    The shrinking of the slab cache is a bit problematic since it is not node
    specific. So we simply calculate what point in the slab we want to reach
    (current per node slab use minus the number of pages that neeed to be
    allocated) and then repeately run the global reclaim until that is
    unsuccessful or we have reached the limit. I hope we will have zone based
    slab reclaim at some point which will make that easier.

    The default for the min_slab_ratio is 5%

    Also remove the slab option from /proc/sys/vm/zone_reclaim_mode.

    [akpm@osdl.org: cleanups]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • To quote Alan Cox:

    The default Linux behaviour on an NMI of either memory or unknown is to
    continue operation. For many environments such as scientific computing
    it is preferable that the box is taken out and the error dealt with than
    an uncorrected parity/ECC error get propogated.

    A small number of systems do generate NMI's for bizarre random reasons
    such as power management so the default is unchanged. In other respects
    the new proc/sys entry works like the existing panic controls already in
    that directory.

    This is separate to the edac support - EDAC allows supported chipsets to
    handle ECC errors well, this change allows unsupported cases to at least
    panic rather than cause problems further down the line.

    Signed-off-by: Don Zickus
    Signed-off-by: Andi Kleen

    Don Zickus
     
  • Adds a new /proc/sys/kernel/nmi call that will enable/disable the nmi
    watchdog.

    Signed-off-by: Don Zickus
    Signed-off-by: Andi Kleen

    Don Zickus
     

23 Sep, 2006

2 commits

  • We do not always need proxy NDP functionality even we
    enable forwarding.

    Signed-off-by: YOSHIFUJI Hideaki
    Signed-off-by: David S. Miller

    YOSHIFUJI Hideaki
     
  • Add support for the Commercial IP Security Option (CIPSO) to the IPv4
    network stack. CIPSO has become a de-facto standard for
    trusted/labeled networking amongst existing Trusted Operating Systems
    such as Trusted Solaris, HP-UX CMW, etc. This implementation is
    designed to be used with the NetLabel subsystem to provide explicit
    packet labeling to LSM developers.

    The CIPSO/IPv4 packet labeling works by the LSM calling a NetLabel API
    function which attaches a CIPSO label (IPv4 option) to a given socket;
    this in turn attaches the CIPSO label to every packet leaving the
    socket without any extra processing on the outbound side. On the
    inbound side the individual packet's sk_buff is examined through a
    call to a NetLabel API function to determine if a CIPSO/IPv4 label is
    present and if so the security attributes of the CIPSO label are
    returned to the caller of the NetLabel API function.

    Signed-off-by: Paul Moore
    Signed-off-by: David S. Miller

    Paul Moore
     

04 Jul, 2006

1 commit

  • It turns out that it is advantageous to leave a small portion of unmapped file
    backed pages if all of a zone's pages (or almost all pages) are allocated and
    so the page allocator has to go off-node.

    This allows recently used file I/O buffers to stay on the node and
    reduces the times that zone reclaim is invoked if file I/O occurs
    when we run out of memory in a zone.

    The problem is that zone reclaim runs too frequently when the page cache is
    used for file I/O (read write and therefore unmapped pages!) alone and we have
    almost all pages of the zone allocated. Zone reclaim may remove 32 unmapped
    pages. File I/O will use these pages for the next read/write requests and the
    unmapped pages increase. After the zone has filled up again zone reclaim will
    remove it again after only 32 pages. This cycle is too inefficient and there
    are potentially too many zone reclaim cycles.

    With the 1% boundary we may still remove all unmapped pages for file I/O in
    zone reclaim pass. However. it will take a large number of read and writes
    to get back to 1% again where we trigger zone reclaim again.

    The zone reclaim 2.6.16/17 does not show this behavior because we have a 30
    second timeout.

    [akpm@osdl.org: rename the /proc file and the variable]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter