16 Jan, 2009

1 commit

  • Often the cause of kernel unaligned access warnings is not
    obvious from just the ip displayed in the warning. This adds
    the option via proc to dump the stack in addition to the warning.
    The default is off (just display the 1 line warning). To enable
    the stack to be shown: echo 1 > /proc/sys/kernel/unaligned-dump-stack

    Signed-off-by: Doug Chapman
    Signed-off-by: Tony Luck

    Doug Chapman
     

14 Jan, 2009

1 commit


08 Jan, 2009

1 commit

  • NOMMU mmap allocates a piece of memory for an mmap that's rounded up in size to
    the nearest power-of-2 number of pages. Currently it then discards the excess
    pages back to the page allocator, making that memory available for use by other
    things. This can, however, cause greater amount of fragmentation.

    To counter this, a sysctl is added in order to fine-tune the trimming
    behaviour. The default behaviour remains to trim pages aggressively, while
    this can either be disabled completely or set to a higher page-granular
    watermark in order to have finer-grained control.

    vm region vm_top bits taken from an earlier patch by David Howells.

    Signed-off-by: Paul Mundt
    Signed-off-by: David Howells
    Tested-by: Mike Frysinger

    Paul Mundt
     

07 Jan, 2009

1 commit

  • This change introduces two new sysctls to /proc/sys/vm:
    dirty_background_bytes and dirty_bytes.

    dirty_background_bytes is the counterpart to dirty_background_ratio and
    dirty_bytes is the counterpart to dirty_ratio.

    With growing memory capacities of individual machines, it's no longer
    sufficient to specify dirty thresholds as a percentage of the amount of
    dirtyable memory over the entire system.

    dirty_background_bytes and dirty_bytes specify quantities of memory, in
    bytes, that represent the dirty limits for the entire system. If either
    of these values is set, its value represents the amount of dirty memory
    that is needed to commence either background or direct writeback.

    When a `bytes' or `ratio' file is written, its counterpart becomes a
    function of the written value. For example, if dirty_bytes is written to
    be 8096, 8K of memory is required to commence direct writeback.
    dirty_ratio is then functionally equivalent to 8K / the amount of
    dirtyable memory:

    dirtyable_memory = free pages + mapped pages + file cache

    dirty_background_bytes = dirty_background_ratio * dirtyable_memory
    -or-
    dirty_background_ratio = dirty_background_bytes / dirtyable_memory

    AND

    dirty_bytes = dirty_ratio * dirtyable_memory
    -or-
    dirty_ratio = dirty_bytes / dirtyable_memory

    Only one of dirty_background_bytes and dirty_background_ratio may be
    specified at a time, and only one of dirty_bytes and dirty_ratio may be
    specified. When one sysctl is written, the other appears as 0 when read.

    The `bytes' files operate on a page size granularity since dirty limits
    are compared with ZVC values, which are in page units.

    Prior to this change, the minimum dirty_ratio was 5 as implemented by
    get_dirty_limits() although /proc/sys/vm/dirty_ratio would show any user
    written value between 0 and 100. This restriction is maintained, but
    dirty_bytes has a lower limit of only one page.

    Also prior to this change, the dirty_background_ratio could not equal or
    exceed dirty_ratio. This restriction is maintained in addition to
    restricting dirty_background_bytes. If either background threshold equals
    or exceeds that of the dirty threshold, it is implicitly set to half the
    dirty threshold.

    Acked-by: Peter Zijlstra
    Cc: Dave Chinner
    Cc: Christoph Lameter
    Signed-off-by: David Rientjes
    Cc: Andrea Righi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

29 Dec, 2008

2 commits

  • Conflicts:
    arch/sparc64/kernel/idprom.c

    David S. Miller
     
  • …el/git/tip/linux-2.6-tip

    * 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (241 commits)
    sched, trace: update trace_sched_wakeup()
    tracing/ftrace: don't trace on early stage of a secondary cpu boot, v3
    Revert "x86: disable X86_PTRACE_BTS"
    ring-buffer: prevent false positive warning
    ring-buffer: fix dangling commit race
    ftrace: enable format arguments checking
    x86, bts: memory accounting
    x86, bts: add fork and exit handling
    ftrace: introduce tracing_reset_online_cpus() helper
    tracing: fix warnings in kernel/trace/trace_sched_switch.c
    tracing: fix warning in kernel/trace/trace.c
    tracing/ring-buffer: remove unused ring_buffer size
    trace: fix task state printout
    ftrace: add not to regex on filtering functions
    trace: better use of stack_trace_enabled for boot up code
    trace: add a way to enable or disable the stack tracer
    x86: entry_64 - introduce FTRACE_ frame macro v2
    tracing/ftrace: add the printk-msg-only option
    tracing/ftrace: use preempt_enable_no_resched_notrace in ring_buffer_time_stamp()
    x86, bts: correctly report invalid bts records
    ...

    Fixed up trivial conflict in scripts/recordmcount.pl due to SH bits
    being already partly merged by the SH merge.

    Linus Torvalds
     

18 Dec, 2008

1 commit

  • Impact: enhancement to stack tracer

    The stack tracer currently is either on when configured in or
    off when it is not. It can not be disabled when it is configured on.
    (besides disabling the function tracer that it uses)

    This patch adds a way to enable or disable the stack tracer at
    run time. It defaults off on bootup, but a kernel parameter 'stacktrace'
    has been added to enable it on bootup.

    A new sysctl has been added "kernel.stack_tracer_enabled" to let
    the user enable or disable the stack tracer at run time.

    Signed-off-by: Steven Rostedt
    Signed-off-by: Ingo Molnar

    Steven Rostedt
     

05 Dec, 2008

1 commit

  • Add a sysctl to tweak the RSS limit used to decide when to grow
    the TSB for an address space.

    In order to avoid expensive divides and multiplies only simply
    positive and negative powers of two are supported.

    The function computed takes the number of TSB translations that will
    fit at one time in the TSB of a given size, and either adds or
    subtracts a percentage of entries. This final value is the
    RSS limit.

    See tsb_size_to_rss_limit().

    Signed-off-by: David S. Miller

    David S. Miller
     

04 Dec, 2008

2 commits


02 Dec, 2008

1 commit

  • It has been thought that the per-user file descriptors limit would also
    limit the resources that a normal user can request via the epoll
    interface. Vegard Nossum reported a very simple program (a modified
    version attached) that can make a normal user to request a pretty large
    amount of kernel memory, well within the its maximum number of fds. To
    solve such problem, default limits are now imposed, and /proc based
    configuration has been introduced. A new directory has been created,
    named /proc/sys/fs/epoll/ and inside there, there are two configuration
    points:

    max_user_instances = Maximum number of devices - per user

    max_user_watches = Maximum number of "watched" fds - per user

    The current default for "max_user_watches" limits the memory used by epoll
    to store "watches", to 1/32 of the amount of the low RAM. As example, a
    256MB 32bit machine, will have "max_user_watches" set to roughly 90000.
    That should be enough to not break existing heavy epoll users. The
    default value for "max_user_instances" is set to 128, that should be
    enough too.

    This also changes the userspace, because a new error code can now come out
    from EPOLL_CTL_ADD (-ENOSPC). The EMFILE from epoll_create() was already
    listed, so that should be ok.

    [akpm@linux-foundation.org: use get_current_user()]
    Signed-off-by: Davide Libenzi
    Cc: Michael Kerrisk
    Cc:
    Cc: Cyrill Gorcunov
    Reported-by: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

14 Nov, 2008

1 commit

  • Wrap access to task credentials so that they can be separated more easily from
    the task_struct during the introduction of COW creds.

    Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().

    Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more
    sense to use RCU directly rather than a convenient wrapper; these will be
    addressed by later patches.

    Signed-off-by: David Howells
    Reviewed-by: James Morris
    Acked-by: Serge Hallyn
    Cc: Al Viro
    Cc: linux-audit@redhat.com
    Cc: containers@lists.linux-foundation.org
    Cc: linux-mm@kvack.org
    Signed-off-by: James Morris

    David Howells
     

04 Nov, 2008

1 commit


31 Oct, 2008

1 commit


27 Oct, 2008

2 commits

  • Impact: add (default-off) dump-trace-on-oops flag

    Currently, ftrace is set up to dump its contents to the console if the
    kernel panics or oops. This can be annoying if you have trace data in
    the buffers and you experience an oops, but the trace data is old or
    static.

    Usually when you want ftrace to dump its contents is when you are debugging
    your system and you have set up ftrace to trace the events leading to
    an oops.

    This patch adds a control variable called "ftrace_dump_on_oops" that will
    enable the ftrace dump to console on oops. This variable is default off
    but a developer can enable it either through the kernel command line
    by adding "ftrace_dump_on_oops" or at run time by setting (or disabling)
    /proc/sys/kernel/ftrace_dump_on_oops.

    v2:

    Replaced /** with /* as Randy explained that kernel-doc does
    not yet handle variables.

    Signed-off-by: Steven Rostedt
    Signed-off-by: Ingo Molnar

    Steven Rostedt
     
  • Ingo Molnar
     

24 Oct, 2008

1 commit

  • …l/git/tip/linux-2.6-tip

    * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    sched: disable the hrtick for now
    sched: revert back to per-rq vruntime
    sched: fair scheduler should not resched rt tasks
    sched: optimize group load balancer
    sched: minor fast-path overhead reduction
    sched: fix the wrong mask_len, cleanup
    sched: kill unused scheduler decl.
    sched: fix the wrong mask_len
    sched: only update rq->clock while holding rq->lock

    Linus Torvalds
     

22 Oct, 2008

1 commit


21 Oct, 2008

1 commit

  • Due to confusion between the ftrace infrastructure and the gcc profiling
    tracer "ftrace", this patch renames the config options from FTRACE to
    FUNCTION_TRACER. The other two names that are offspring from FTRACE
    DYNAMIC_FTRACE and FTRACE_MCOUNT_RECORD will stay the same.

    This patch was generated mostly by script, and partially by hand.

    Signed-off-by: Steven Rostedt
    Signed-off-by: Ingo Molnar

    Steven Rostedt
     

20 Oct, 2008

2 commits

  • This patch adds a function to scan individual or all zones' unevictable
    lists and move any pages that have become evictable onto the respective
    zone's inactive list, where shrink_inactive_list() will deal with them.

    Adds sysctl to scan all nodes, and per node attributes to individual
    nodes' zones.

    Kosaki: If evictable page found in unevictable lru when write
    /proc/sys/vm/scan_unevictable_pages, print filename and file offset of
    these pages.

    [akpm@linux-foundation.org: fix one CONFIG_MMU=n build error]
    [kosaki.motohiro@jp.fujitsu.com: adapt vmscan-unevictable-lru-scan-sysctl.patch to new sysfs API]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • I noticed that tg_shares_up() unconditionally takes rq-locks for all cpus
    in the sched_domain. This hurts.

    We need the rq-locks whenever we change the weight of the per-cpu group sched
    entities. To allevate this a little, only change the weight when the new
    weight is at least shares_thresh away from the old value.

    This avoids the rq-lock for the top level entries, since those will never
    be re-weighted, and fuzzes the lower level entries a little to gain performance
    in semi-stable situations.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

17 Oct, 2008

3 commits

  • This patchs adds the CONFIG_AIO option which allows to remove support
    for asynchronous I/O operations, that are not necessarly used by
    applications, particularly on embedded devices. As this is a
    size-reduction option, it depends on CONFIG_EMBEDDED. It allows to
    save ~7 kilobytes of kernel code/data:

    text data bss dec hex filename
    1115067 119180 217088 1451335 162547 vmlinux
    1108025 119048 217088 1444161 160941 vmlinux.new
    -7042 -132 0 -7174 -1C06 +/-

    This patch has been originally written by Matt Mackall
    , and is part of the Linux Tiny project.

    [randy.dunlap@oracle.com: build fix]
    Signed-off-by: Thomas Petazzoni
    Cc: Benjamin LaHaise
    Cc: Zach Brown
    Signed-off-by: Matt Mackall
    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Petazzoni
     
  • name and nlen parameters passed to ->strategy hook are unused, remove
    them. In general ->strategy hook should know what it's doing, and don't
    do something tricky for which, say, pointer to original userspace array
    may be needed (name).

    Signed-off-by: Alexey Dobriyan
    Acked-by: David S. Miller [ networking bits ]
    Cc: Ralf Baechle
    Cc: David Howells
    Cc: Matt Mackall
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • It's somewhat unlikely that it happens, but right now a race window
    between interrupts or machine checks or oopses could corrupt the tainted
    bitmap because it is modified in a non atomic fashion.

    Convert the taint variable to an unsigned long and use only atomic bit
    operations on it.

    Unfortunately this means the intvec sysctl functions cannot be used on it
    anymore.

    It turned out the taint sysctl handler could actually be simplified a bit
    (since it only increases capabilities) so this patch actually removes
    code.

    [akpm@linux-foundation.org: remove unneeded include]
    Signed-off-by: Andi Kleen
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

15 Oct, 2008

1 commit

  • * 'for-2.6.28' of git://linux-nfs.org/~bfields/linux: (59 commits)
    svcrdma: Fix IRD/ORD polarity
    svcrdma: Update svc_rdma_send_error to use DMA LKEY
    svcrdma: Modify the RPC reply path to use FRMR when available
    svcrdma: Modify the RPC recv path to use FRMR when available
    svcrdma: Add support to svc_rdma_send to handle chained WR
    svcrdma: Modify post recv path to use local dma key
    svcrdma: Add a service to register a Fast Reg MR with the device
    svcrdma: Query device for Fast Reg support during connection setup
    svcrdma: Add FRMR get/put services
    NLM: Remove unused argument from svc_addsock() function
    NLM: Remove "proto" argument from lockd_up()
    NLM: Always start both UDP and TCP listeners
    lockd: Remove unused fields in the nlm_reboot structure
    lockd: Add helper to sanity check incoming NOTIFY requests
    lockd: change nlmclnt_grant() to take a "struct sockaddr *"
    lockd: Adjust nlmsvc_lookup_host() to accomodate AF_INET6 addresses
    lockd: Adjust nlmclnt_lookup_host() signature to accomodate non-AF_INET
    lockd: Support non-AF_INET addresses in nlm_lookup_host()
    NLM: Convert nlm_lookup_host() to use a single argument
    svcrdma: Add Fast Reg MR Data Types
    ...

    Linus Torvalds
     

14 Oct, 2008

1 commit

  • * 'proc' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc:
    proc: remove kernel.maps_protect
    proc: remove now unneeded ADDBUF macro
    [PATCH] proc: show personality via /proc/pid/personality
    [PATCH] signal, procfs: some lock_task_sighand() users do not need rcu_read_lock()
    proc: move PROC_PAGE_MONITOR to fs/proc/Kconfig
    proc: make grab_header() static
    proc: remove unused get_dma_list()
    proc: remove dummy vmcore_open()
    proc: proc_sys_root tweak
    proc: fix return value of proc_reg_open() in "too late" case

    Fixed up trivial conflict in removed file arch/sparc/include/asm/dma_32.h

    Linus Torvalds
     

10 Oct, 2008

1 commit

  • After commit 831830b5a2b5d413407adf380ef62fe17d6fcbf2 aka
    "restrict reading from /proc//maps to those who share ->mm or can ptrace"
    sysctl stopped being relevant because commit moved security checks from ->show
    time to ->start time (mm_for_maps()).

    Signed-off-by: Alexey Dobriyan
    Acked-by: Kees Cook

    Alexey Dobriyan
     

30 Sep, 2008

1 commit

  • This patch adds the CONFIG_FILE_LOCKING option which allows to remove
    support for advisory locks. With this patch enabled, the flock()
    system call, the F_GETLK, F_SETLK and F_SETLKW operations of fcntl()
    and NFS support are disabled. These features are not necessarly needed
    on embedded systems. It allows to save ~11 Kb of kernel code and data:

    text data bss dec hex filename
    1125436 118764 212992 1457192 163c28 vmlinux.old
    1114299 118564 212992 1445855 160fdf vmlinux
    -11137 -200 0 -11337 -2C49 +/-

    This patch has originally been written by Matt Mackall
    , and is part of the Linux Tiny project.

    Signed-off-by: Thomas Petazzoni
    Signed-off-by: Matt Mackall
    Cc: matthew@wil.cx
    Cc: linux-fsdevel@vger.kernel.org
    Cc: mpm@selenic.com
    Cc: akpm@linux-foundation.org
    Signed-off-by: J. Bruce Fields

    Thomas Petazzoni
     

17 Sep, 2008

1 commit


12 Sep, 2008

2 commits


05 Sep, 2008

1 commit

  • We should've set refcount on the root sysctl table; otherwise we'll blow
    up the first time we get down to zero dynamically registered sysctl
    tables.

    Signed-off-by: Al Viro
    Tested-by: James Bottomley
    Signed-off-by: Linus Torvalds

    Al Viro
     

28 Jul, 2008

1 commit

  • try_attach() should walk into the matching subdirectory, not the first one...

    Signed-off-by: Al Viro
    Tested-by: Valdis.Kletnieks@vt.edu
    Tested-by: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Al Viro
     

27 Jul, 2008

5 commits

  • * kill nameidata * argument; map the 3 bits in ->flags anybody cares
    about to new MAY_... ones and pass with the mask.
    * kill redundant gfs2_iop_permission()
    * sanitize ecryptfs_permission()
    * fix remaining places where ->permission() instances might barf on new
    MAY_... found in mask.

    The obvious next target in that direction is permission(9)

    folded fix for nfs_permission() breakage from Miklos Szeredi

    Signed-off-by: Al Viro

    Al Viro
     
  • * keep references to ctl_table_head and ctl_table in /proc/sys inodes
    * grab the former during operations, use the latter for access to
    entry if that succeeds
    * have ->d_compare() check if table should be seen for one who does lookup;
    that allows us to avoid flipping inodes - if we have the same name resolve
    to different things, we'll just keep several dentries and ->d_compare()
    will reject the wrong ones.
    * have ->lookup() and ->readdir() scan the table of our inode first, then
    walk all ctl_table_header and scan ->attached_by for those that are
    attached to our directory.
    * implement ->getattr().
    * get rid of insane amounts of tree-walking
    * get rid of the need to know dentry in ->permission() and of the contortions
    induced by that.

    Signed-off-by: Al Viro

    Al Viro
     
  • In a sense, that's the heart of the series. It's based on the following
    property of the trees we are actually asked to add: they can be split into
    stem that is already covered by registered trees and crown that is entirely
    new. IOW, if a/b and a/c/d are introduced by our tree, then a/c is also
    introduced by it.

    That allows to associate tree and table entry with each node in the union;
    while directory nodes might be covered by many trees, only one will cover
    the node by its crown. And that will allow much saner logics for /proc/sys
    in the next patches. This patch introduces the data structures needed to
    keep track of that.

    When adding a sysctl table, we find a "parent" one. Which is to say,
    find the deepest node on its stem that already is present in one of the
    tables from our table set or its ancestor sets. That table will be our
    parent and that node in it - attachment point. Add our table to list
    anchored in parent, have it refer the parent and contents of attachment
    point. Also remember where its crown lives.

    Signed-off-by: Al Viro

    Al Viro
     
  • Refcount the sucker; instead of freeing it by the end of unregistration
    just drop the refcount and free only when it hits zero. Make sure that
    we _always_ make ->unregistering non-NULL in start_unregistering().

    That allows anybody to get a reference to such puppy, preventing its
    freeing and reuse. It does *not* block unregistration. Anybody who
    holds such a reference can
    * try to grab a "use" reference (ctl_head_grab()); that will
    succeeds if and only if it hadn't entered unregistration yet. If it
    succeeds, we can use it in all normal ways until we release the "use"
    reference (with ctl_head_finish()). Note that this relies on having
    ->unregistering become non-NULL in all cases when one starts to unregister
    the sucker.
    * keep pointers to ctl_table entries; they *can* be freed if
    the entire thing is unregistered. However, if ctl_head_grab() succeeds,
    we know that unregistration had not happened (and will not happen until
    ctl_head_finish()) and such pointers can be used safely.

    IOW, now we can have inodes under /proc/sys keep references to ctl_table
    entries, protecting them with references to ctl_table_header and
    grabbing the latter for the duration of operations that require access
    to ctl_table. That won't cause deadlocks, since unregistration will not
    be stopped by mere keeping a reference to ctl_table_header.

    Signed-off-by: Al Viro

    Al Viro
     
  • New object: set of sysctls [currently - root and per-net-ns].
    Contains: pointer to parent set, list of tables and "should I see this set?"
    method (->is_seen(set)).
    Current lists of tables are subsumed by that; net-ns contains such a beast.
    ->lookup() for ctl_table_root returns pointer to ctl_table_set instead of
    that to ->list of that ctl_table_set.

    [folded compile fixes by rdd for configs without sysctl]

    Signed-off-by: Al Viro

    Al Viro
     

26 Jul, 2008

1 commit

  • All ratelimit user use same jiffies and burst params, so some messages
    (callbacks) will be lost.

    For example:
    a call printk_ratelimit(5 * HZ, 1)
    b call printk_ratelimit(5 * HZ, 1) before the 5*HZ timeout of a, then b will
    will be supressed.

    - rewrite __ratelimit, and use a ratelimit_state as parameter. Thanks for
    hints from andrew.

    - Add WARN_ON_RATELIMIT, update rcupreempt.h

    - remove __printk_ratelimit

    - use __ratelimit in net_ratelimit

    Signed-off-by: Dave Young
    Cc: "David S. Miller"
    Cc: "Paul E. McKenney"
    Cc: Dave Young
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Young
     

25 Jul, 2008

1 commit

  • Add basic support for more than one hstate in hugetlbfs. This is the key
    to supporting multiple hugetlbfs page sizes at once.

    - Rather than a single hstate, we now have an array, with an iterator
    - default_hstate continues to be the struct hstate which we use by default
    - Add functions for architectures to register new hstates

    [akpm@linux-foundation.org: coding-style fixes]
    Acked-by: Adam Litke
    Acked-by: Nishanth Aravamudan
    Signed-off-by: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen