17 Apr, 2015

1 commit

  • File /proc/sys/kernel/threads-max controls the maximum number of threads
    that can be created using fork().

    [akpm@linux-foundation.org: fix typo, per Guenter]
    Signed-off-by: Heinrich Schuchardt
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: Guenter Roeck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heinrich Schuchardt
     

16 Apr, 2015

1 commit

  • Currently, pages which are marked as unevictable are protected from
    compaction, but not from other types of migration. The POSIX real time
    extension explicitly states that mlock() will prevent a major page
    fault, but the spirit of this is that mlock() should give a process the
    ability to control sources of latency, including minor page faults.
    However, the mlock manpage only explicitly says that a locked page will
    not be written to swap and this can cause some confusion. The
    compaction code today does not give a developer who wants to avoid swap
    but wants to have large contiguous areas available any method to achieve
    this state. This patch introduces a sysctl for controlling compaction
    behavior with respect to the unevictable lru. Users who demand no page
    faults after a page is present can set compact_unevictable_allowed to 0
    and users who need the large contiguous areas can enable compaction on
    locked memory by leaving the default value of 1.

    To illustrate this problem I wrote a quick test program that mmaps a
    large number of 1MB files filled with random data. These maps are
    created locked and read only. Then every other mmap is unmapped and I
    attempt to allocate huge pages to the static huge page pool. When the
    compact_unevictable_allowed sysctl is 0, I cannot allocate hugepages
    after fragmenting memory. When the value is set to 1, allocations
    succeed.

    Signed-off-by: Eric B Munson
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Acked-by: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Thomas Gleixner
    Cc: Christoph Lameter
    Cc: Peter Zijlstra
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     

15 Apr, 2015

1 commit

  • With the current user interface of the watchdog mechanism it is only
    possible to disable or enable both lockup detectors at the same time.
    This series introduces new kernel parameters and changes the semantics of
    some existing kernel parameters, so that the hard lockup detector and the
    soft lockup detector can be disabled or enabled individually. With this
    series applied, the user interface is as follows.

    - parameters in /proc/sys/kernel

    . soft_watchdog
    This is a new parameter to control and examine the run state of
    the soft lockup detector.

    . nmi_watchdog
    The semantics of this parameter have changed. It can now be used
    to control and examine the run state of the hard lockup detector.

    . watchdog
    This parameter is still available to control the run state of both
    lockup detectors at the same time. If this parameter is examined,
    it shows the logical OR of soft_watchdog and nmi_watchdog.

    . watchdog_thresh
    The semantics of this parameter are not affected by the patch.

    - kernel command line parameters

    . nosoftlockup
    The semantics of this parameter have changed. It can now be used
    to disable the soft lockup detector at boot time.

    . nmi_watchdog=0 or nmi_watchdog=1
    Disable or enable the hard lockup detector at boot time. The patch
    introduces '=1' as a new option.

    . nowatchdog
    The semantics of this parameter are not affected by the patch. It
    is still available to disable both lockup detectors at boot time.

    Also, remove the proc_dowatchdog() function which is no longer needed.

    [dzickus@redhat.com: wrote changelog]
    [dzickus@redhat.com: update documentation for kernel params and sysctl]
    Signed-off-by: Ulrich Obergfell
    Signed-off-by: Don Zickus
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Obergfell
     

12 Feb, 2015

3 commits

  • Merge second set of updates from Andrew Morton:
    "More of MM"

    * emailed patches from Andrew Morton : (83 commits)
    mm/nommu.c: fix arithmetic overflow in __vm_enough_memory()
    mm/mmap.c: fix arithmetic overflow in __vm_enough_memory()
    vmstat: Reduce time interval to stat update on idle cpu
    mm/page_owner.c: remove unnecessary stack_trace field
    Documentation/filesystems/proc.txt: describe /proc//map_files
    mm: incorporate read-only pages into transparent huge pages
    vmstat: do not use deferrable delayed work for vmstat_update
    mm: more aggressive page stealing for UNMOVABLE allocations
    mm: always steal split buddies in fallback allocations
    mm: when stealing freepages, also take pages created by splitting buddy page
    mincore: apply page table walker on do_mincore()
    mm: /proc/pid/clear_refs: avoid split_huge_page()
    mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP)
    mempolicy: apply page table walker on queue_pages_range()
    arch/powerpc/mm/subpage-prot.c: use walk->vma and walk_page_vma()
    memcg: cleanup preparation for page table walk
    numa_maps: remove numa_maps->vma
    numa_maps: fix typo in gather_hugetbl_stats
    pagemap: use walk->vma instead of calling find_vma()
    clear_refs: remove clear_refs_private->vma and introduce clear_refs_test_walk()
    ...

    Linus Torvalds
     
  • Dave noticed that unprivileged process can allocate significant amount of
    memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
    memory cgroup. The trick is to allocate a lot of PMD page tables. Linux
    kernel doesn't account PMD tables to the process, only PTE.

    The use-cases below use few tricks to allocate a lot of PMD page tables
    while keeping VmRSS and VmPTE low. oom_score for the process will be 0.

    #include
    #include
    #include
    #include
    #include
    #include

    #define PUD_SIZE (1UL << 30)
    #define PMD_SIZE (1UL << 21)

    #define NR_PUD 130000

    int main(void)
    {
    char *addr = NULL;
    unsigned long i;

    prctl(PR_SET_THP_DISABLE);
    for (i = 0; i < NR_PUD ; i++) {
    addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
    MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
    if (addr == MAP_FAILED) {
    perror("mmap");
    break;
    }
    *addr = 'x';
    munmap(addr, PMD_SIZE);
    mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
    MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
    if (addr == MAP_FAILED)
    perror("re-mmap"), exit(1);
    }
    printf("PID %d consumed %lu KiB in PMD page tables\n",
    getpid(), i * 4096 >> 10);
    return pause();
    }

    The patch addresses the issue by account PMD tables to the process the
    same way we account PTE.

    The main place where PMD tables is accounted is __pmd_alloc() and
    free_pmd_range(). But there're few corner cases:

    - HugeTLB can share PMD page tables. The patch handles by accounting
    the table to all processes who share it.

    - x86 PAE pre-allocates few PMD tables on fork.

    - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
    check on exit(2).

    Accounting only happens on configuration where PMD page table's level is
    present (PMD is not folded). As with nr_ptes we use per-mm counter. The
    counter value is used to calculate baseline for badness score by
    oom-killer.

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Dave Hansen
    Cc: Hugh Dickins
    Reviewed-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: David Rientjes
    Tested-by: Sedat Dilek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Pull documentation updates from Jonathan Corbet:
    "Highlights this time around include:

    - A thrashing of SubmittingPatches to bring it out of the "send
    everything to Linus" era of kernel development.

    - A new document on completions from Nicholas McGuire

    - Lots of typo fixes, formatting improvements, corrections, build
    fixes, and more"

    * tag 'docs-for-linus' of git://git.lwn.net/linux-2.6: (35 commits)
    Documentation: Fix the wrong command `echo -1 > set_ftrace_pid` for cleaning the filter.
    can-doc: Fixed a wrong filepath in can.txt
    Documentation: Fix trivial typo in comment.
    kgdb,docs: Fix typo and minor style issues
    Documentation: add description for FTRACE probe status
    doc: brief user documentation for completion
    Documentation/misc-devices/mei: Fix indentation of embedded code.
    Documentation/misc-devices/mei: Fix indentation of enumeration.
    Documentation/misc-devices/mei: Fix spacing around parentheses.
    Documentation/misc-devices/mei: Fix formatting of headings.
    Documentation: devicetree: Fix double words in Doumentation/devicetree
    Documentation: mm: Fix typo in vm.txt
    lockstat: Add documentation on contention and contenting points
    Documentation: fix blackfin gptimers-example build errors
    Fixes column alignment in table of contents entry 1.9 in Documentation/filesystems/proc.txt
    CodingStyle: enable emacs display of trailing whitespace
    DocBook: Do not exceed argument list limit
    gpio: board.txt: Fix the gpio name example
    Documentation/SubmittingPatches: unify whitespace/tabs for the DCO
    MAINTAINERS: Add the docs-next git tree to the maintainer entry
    ...

    Linus Torvalds
     

11 Feb, 2015

1 commit

  • Pull networking updates from David Miller:

    1) More iov_iter conversion work from Al Viro.

    [ The "crypto: switch af_alg_make_sg() to iov_iter" commit was
    wrong, and this pull actually adds an extra commit on top of the
    branch I'm pulling to fix that up, so that the pre-merge state is
    ok. - Linus ]

    2) Various optimizations to the ipv4 forwarding information base trie
    lookup implementation. From Alexander Duyck.

    3) Remove sock_iocb altogether, from CHristoph Hellwig.

    4) Allow congestion control algorithm selection via routing metrics.
    From Daniel Borkmann.

    5) Make ipv4 uncached route list per-cpu, from Eric Dumazet.

    6) Handle rfs hash collisions more gracefully, also from Eric Dumazet.

    7) Add xmit_more support to r8169, e1000, and e1000e drivers. From
    Florian Westphal.

    8) Transparent Ethernet Bridging support for GRO, from Jesse Gross.

    9) Add BPF packet actions to packet scheduler, from Jiri Pirko.

    10) Add support for uniqu flow IDs to openvswitch, from Joe Stringer.

    11) New NetCP ethernet driver, from Muralidharan Karicheri and Wingman
    Kwok.

    12) More sanely handle out-of-window dupacks, which can result in
    serious ACK storms. From Neal Cardwell.

    13) Various rhashtable bug fixes and enhancements, from Herbert Xu,
    Patrick McHardy, and Thomas Graf.

    14) Support xmit_more in be2net, from Sathya Perla.

    15) Group Policy extensions for vxlan, from Thomas Graf.

    16) Remove Checksum Offload support for vxlan, from Tom Herbert.

    17) Like ipv4, support lockless transmit over ipv6 UDP sockets. From
    Vlad Yasevich.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1494+1 commits)
    crypto: fix af_alg_make_sg() conversion to iov_iter
    ipv4: Namespecify TCP PMTU mechanism
    i40e: Fix for stats init function call in Rx setup
    tcp: don't include Fast Open option in SYN-ACK on pure SYN-data
    openvswitch: Only set TUNNEL_VXLAN_OPT if VXLAN-GBP metadata is set
    ipv6: Make __ipv6_select_ident static
    ipv6: Fix fragment id assignment on LE arches.
    bridge: Fix inability to add non-vlan fdb entry
    net: Mellanox: Delete unnecessary checks before the function call "vunmap"
    cxgb4: Add support in cxgb4 to get expansion rom version via ethtool
    ethtool: rename reserved1 memeber in ethtool_drvinfo for expansion ROM version
    net: dsa: Remove redundant phy_attach()
    IB/mlx4: Reset flow support for IB kernel ULPs
    IB/mlx4: Always use the correct port for mirrored multicast attachments
    net/bonding: Fix potential bad memory access during bonding events
    tipc: remove tipc_snprintf
    tipc: nl compat add noop and remove legacy nl framework
    tipc: convert legacy nl stats show to nl compat
    tipc: convert legacy nl net id get to nl compat
    tipc: convert legacy nl net id set to nl compat
    ...

    Linus Torvalds
     

03 Feb, 2015

1 commit

  • Tx timestamps are looped onto the error queue on top of an skb. This
    mechanism leaks packet headers to processes unless the no-payload
    options SOF_TIMESTAMPING_OPT_TSONLY is set.

    Add a sysctl that optionally drops looped timestamp with data. This
    only affects processes without CAP_NET_RAW.

    The policy is checked when timestamps are generated in the stack.
    It is possible for timestamps with data to be reported after the
    sysctl is set, if these were queued internally earlier.

    No vulnerability is immediately known that exploits knowledge
    gleaned from packet headers, but it may still be preferable to allow
    administrators to lock down this path at the cost of possible
    breakage of legacy applications.

    Signed-off-by: Willem de Bruijn

    ----

    Changes
    (v1 -> v2)
    - test socket CAP_NET_RAW instead of capable(CAP_NET_RAW)
    (rfc -> v1)
    - document the sysctl in Documentation/sysctl/net.txt
    - fix access control race: read .._OPT_TSONLY only once,
    use same value for permission check and skb generation.
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

29 Jan, 2015

1 commit


22 Dec, 2014

1 commit

  • This adds a new taint flag to indicate when the kernel or a kernel
    module has been live patched. This will provide a clean indication in
    bug reports that live patching was used.

    Additionally, if the crash occurs in a live patched function, the live
    patch module will appear beside the patched function in the backtrace.

    Signed-off-by: Seth Jennings
    Acked-by: Josh Poimboeuf
    Reviewed-by: Miroslav Benes
    Reviewed-by: Petr Mladek
    Reviewed-by: Masami Hiramatsu
    Signed-off-by: Jiri Kosina

    Seth Jennings
     

14 Dec, 2014

1 commit

  • SysV can be abused to allocate locked kernel memory. For most systems, a
    small limit doesn't make sense, see the discussion with regards to SHMMAX.

    Therefore: increase MSGMNI to the maximum supported.

    And: If we ignore the risk of locking too much memory, then an automatic
    scaling of MSGMNI doesn't make sense. Therefore the logic can be removed.

    The code preserves auto_msgmni to avoid breaking any user space applications
    that expect that the value exists.

    Notes:
    1) If an administrator must limit the memory allocations, then he can set
    MSGMNI as necessary.

    Or he can disable sysv entirely (as e.g. done by Android).

    2) MSGMAX and MSGMNB are intentionally not increased, as these values are used
    to control latency vs. throughput:
    If MSGMNB is large, then msgsnd() just returns and more messages can be queued
    before a task switch to a task that calls msgrcv() is forced.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     

12 Dec, 2014

1 commit

  • Pull networking updates from David Miller:

    1) New offloading infrastructure and example 'rocker' driver for
    offloading of switching and routing to hardware.

    This work was done by a large group of dedicated individuals, not
    limited to: Scott Feldman, Jiri Pirko, Thomas Graf, John Fastabend,
    Jamal Hadi Salim, Andy Gospodarek, Florian Fainelli, Roopa Prabhu

    2) Start making the networking operate on IOV iterators instead of
    modifying iov objects in-situ during transfers. Thanks to Al Viro
    and Herbert Xu.

    3) A set of new netlink interfaces for the TIPC stack, from Richard
    Alpe.

    4) Remove unnecessary looping during ipv6 routing lookups, from Martin
    KaFai Lau.

    5) Add PAUSE frame generation support to gianfar driver, from Matei
    Pavaluca.

    6) Allow for larger reordering levels in TCP, which are easily
    achievable in the real world right now, from Eric Dumazet.

    7) Add a variable of napi_schedule that doesn't need to disable cpu
    interrupts, from Eric Dumazet.

    8) Use a doubly linked list to optimize neigh_parms_release(), from
    Nicolas Dichtel.

    9) Various enhancements to the kernel BPF verifier, and allow eBPF
    programs to actually be attached to sockets. From Alexei
    Starovoitov.

    10) Support TSO/LSO in sunvnet driver, from David L Stevens.

    11) Allow controlling ECN usage via routing metrics, from Florian
    Westphal.

    12) Remote checksum offload, from Tom Herbert.

    13) Add split-header receive, BQL, and xmit_more support to amd-xgbe
    driver, from Thomas Lendacky.

    14) Add MPLS support to openvswitch, from Simon Horman.

    15) Support wildcard tunnel endpoints in ipv6 tunnels, from Steffen
    Klassert.

    16) Do gro flushes on a per-device basis using a timer, from Eric
    Dumazet. This tries to resolve the conflicting goals between the
    desired handling of bulk vs. RPC-like traffic.

    17) Allow userspace to ask for the CPU upon what a packet was
    received/steered, via SO_INCOMING_CPU. From Eric Dumazet.

    18) Limit GSO packets to half the current congestion window, from Eric
    Dumazet.

    19) Add a generic helper so that all drivers set their RSS keys in a
    consistent way, from Eric Dumazet.

    20) Add xmit_more support to enic driver, from Govindarajulu
    Varadarajan.

    21) Add VLAN packet scheduler action, from Jiri Pirko.

    22) Support configurable RSS hash functions via ethtool, from Eyal
    Perry.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1820 commits)
    Fix race condition between vxlan_sock_add and vxlan_sock_release
    net/macb: fix compilation warning for print_hex_dump() called with skb->mac_header
    net/mlx4: Add support for A0 steering
    net/mlx4: Refactor QUERY_PORT
    net/mlx4_core: Add explicit error message when rule doesn't meet configuration
    net/mlx4: Add A0 hybrid steering
    net/mlx4: Add mlx4_bitmap zone allocator
    net/mlx4: Add a check if there are too many reserved QPs
    net/mlx4: Change QP allocation scheme
    net/mlx4_core: Use tasklet for user-space CQ completion events
    net/mlx4_core: Mask out host side virtualization features for guests
    net/mlx4_en: Set csum level for encapsulated packets
    be2net: Export tunnel offloads only when a VxLAN tunnel is created
    gianfar: Fix dma check map error when DMA_API_DEBUG is enabled
    cxgb4/csiostor: Don't use MASTER_MUST for fw_hello call
    net: fec: only enable mdio interrupt before phy device link up
    net: fec: clear all interrupt events to support i.MX6SX
    net: fec: reset fep link status in suspend function
    net: sock: fix access via invalid file descriptor
    net: introduce helper macro for_each_cmsghdr
    ...

    Linus Torvalds
     

11 Dec, 2014

1 commit

  • There have been several times where I have had to rebuild a kernel to
    cause a panic when hitting a WARN() in the code in order to get a crash
    dump from a system. Sometimes this is easy to do, other times (such as
    in the case of a remote admin) it is not trivial to send new images to
    the user.

    A much easier method would be a switch to change the WARN() over to a
    panic. This makes debugging easier in that I can now test the actual
    image the WARN() was seen on and I do not have to engage in remote
    debugging.

    This patch adds a panic_on_warn kernel parameter and
    /proc/sys/kernel/panic_on_warn calls panic() in the
    warn_slowpath_common() path. The function will still print out the
    location of the warning.

    An example of the panic_on_warn output:

    The first line below is from the WARN_ON() to output the WARN_ON()'s
    location. After that the panic() output is displayed.

    WARNING: CPU: 30 PID: 11698 at /home/prarit/dummy_module/dummy-module.c:25 init_dummy+0x1f/0x30 [dummy_module]()
    Kernel panic - not syncing: panic_on_warn set ...

    CPU: 30 PID: 11698 Comm: insmod Tainted: G W OE 3.17.0+ #57
    Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.00.29.D696.1311111329 11/11/2013
    0000000000000000 000000008e3f87df ffff88080f093c38 ffffffff81665190
    0000000000000000 ffffffff818aea3d ffff88080f093cb8 ffffffff8165e2ec
    ffffffff00000008 ffff88080f093cc8 ffff88080f093c68 000000008e3f87df
    Call Trace:
    [] dump_stack+0x46/0x58
    [] panic+0xd0/0x204
    [] ? init_dummy+0x1f/0x30 [dummy_module]
    [] warn_slowpath_common+0xd0/0xd0
    [] ? dummy_greetings+0x40/0x40 [dummy_module]
    [] warn_slowpath_null+0x1a/0x20
    [] init_dummy+0x1f/0x30 [dummy_module]
    [] do_one_initcall+0xd4/0x210
    [] ? __vunmap+0xc2/0x110
    [] load_module+0x16a9/0x1b30
    [] ? store_uevent+0x70/0x70
    [] ? copy_module_from_fd.isra.44+0x129/0x180
    [] SyS_finit_module+0xa6/0xd0
    [] system_call_fastpath+0x12/0x17

    Successfully tested by me.

    hpa said: There is another very valid use for this: many operators would
    rather a machine shuts down than being potentially compromised either
    functionally or security-wise.

    Signed-off-by: Prarit Bhargava
    Cc: Jonathan Corbet
    Cc: Rusty Russell
    Cc: "H. Peter Anvin"
    Cc: Andi Kleen
    Cc: Masami Hiramatsu
    Acked-by: Yasuaki Ishimatsu
    Cc: Fabian Frederick
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Prarit Bhargava
     

17 Nov, 2014

1 commit

  • RSS (Receive Side Scaling) typically uses Toeplitz hash and a 40 or 52 bytes
    RSS key.

    Some drivers use a constant (and well known key), some drivers use a random
    key per port, making bonding setups hard to tune. Well known keys increase
    attack surface, considering that number of queues is usually a power of two.

    This patch provides infrastructure to help drivers doing the right thing.

    netdev_rss_key_fill() should be used by drivers to initialize their RSS key,
    even if they provide ethtool -X support to let user redefine the key later.

    A new /proc/sys/net/core/netdev_rss_key file can be used to get the host
    RSS key even for drivers not providing ethtool -x support, in case some
    applications want to precisely setup flows to match some RX queues.

    Tested:

    myhost:~# cat /proc/sys/net/core/netdev_rss_key
    11:63:99:bb:79:fb:a5:a7:07:45:b2:20:bf:02:42:2d:08:1a:dd:19:2b:6b:23:ac:56:28:9d:70:c3:ac:e8:16:4b:b7:c1:10:53:a4:78:41:36:40:74:b6:15:ca:27:44:aa:b3:4d:72

    myhost:~# ethtool -x eth0
    RX flow hash indirection table for eth0 with 8 RX ring(s):
    0: 0 1 2 3 4 5 6 7
    RSS hash key:
    11:63:99:bb:79:fb:a5:a7:07:45:b2:20:bf:02:42:2d:08:1a:dd:19:2b:6b:23:ac:56:28:9d:70:c3:ac:e8:16:4b:b7:c1:10:53:a4:78:41

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

12 Nov, 2014

1 commit

  • Use the more common dynamic_debug capable net_dbg_ratelimited
    and remove the LIMIT_NETDEBUG macro.

    All messages are still ratelimited.

    Some KERN_ uses are changed to KERN_DEBUG.

    This may have some negative impact on messages that were
    emitted at KERN_INFO that are not not enabled at all unless
    DEBUG is defined or dynamic_debug is enabled. Even so,
    these messages are now _not_ emitted by default.

    This also eliminates the use of the net_msg_warn sysctl
    "/proc/sys/net/core/warnings". For backward compatibility,
    the sysctl is not removed, but it has no function. The extern
    declaration of net_msg_warn is removed from sock.h and made
    static in net/core/sysctl_net_core.c

    Miscellanea:

    o Update the sysctl documentation
    o Remove the embedded uses of pr_fmt
    o Coalesce format fragments
    o Realign arguments

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     

14 Oct, 2014

1 commit

  • format_corename() can only pass the leader's pid to the core handler,
    but there is no simple way to figure out which thread originated the
    coredump.

    As Jan explains, this also means that there is no simple way to create
    the backtrace of the crashed process:

    As programs are mostly compiled with implicit gcc -fomit-frame-pointer
    one needs program's .eh_frame section (equivalently PT_GNU_EH_FRAME
    segment) or .debug_frame section. .debug_frame usually is present only
    in separate debug info files usually not even installed on the system.
    While .eh_frame is a part of the executable/library (and it is even
    always mapped for C++ exceptions unwinding) it no longer has to be
    present anywhere on the disk as the program could be upgraded in the
    meantime and the running instance has its executable file already
    unlinked from disk.

    One possibility is to echo 0x3f >/proc/*/coredump_filter and dump all
    the file-backed memory including the executable's .eh_frame section.
    But that can create huge core files, for example even due to mmapped
    data files.

    Other possibility would be to read .eh_frame from /proc/PID/mem at the
    core_pattern handler time of the core dump. For the backtrace one needs
    to read the register state first which can be done from core_pattern
    handler:

    ptrace(PTRACE_SEIZE, tid, 0, PTRACE_O_TRACEEXIT)
    close(0); // close pipe fd to resume the sleeping dumper
    waitpid(); // should report EXIT
    PTRACE_GETREGS or other requests

    The remaining problem is how to get the 'tid' value of the crashed
    thread. It could be read from the first NT_PRSTATUS note of the core
    file but that makes the core_pattern handler complicated.

    Unfortunately %t is already used so this patch uses %i/%I.

    Automatic Bug Reporting Tool (https://github.com/abrt/abrt/wiki/overview)
    is experimenting with this. It is using the elfutils
    (https://fedorahosted.org/elfutils/) unwinder for generating the
    backtraces. Apart from not needing matching executables as mentioned
    above, another advantage is that we can get the backtrace without saving
    the core (which might be quite large) to disk.

    [mmilata@redhat.com: final paragraph of changelog]
    Signed-off-by: Jan Kratochvil
    Signed-off-by: Oleg Nesterov
    Cc: Alexander Viro
    Cc: Denys Vlasenko
    Cc: Jan Kratochvil
    Cc: Mark Wielaard
    Cc: Martin Milata
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

02 Sep, 2014

1 commit

  • TIPC name table updates are distributed asynchronously in a cluster,
    entailing a risk of certain race conditions. E.g., if two nodes
    simultaneously issue conflicting (overlapping) publications, this may
    not be detected until both publications have reached a third node, in
    which case one of the publications will be silently dropped on that
    node. Hence, we end up with an inconsistent name table.

    In most cases this conflict is just a temporary race, e.g., one
    node is issuing a publication under the assumption that a previous,
    conflicting, publication has already been withdrawn by the other node.
    However, because of the (rtt related) distributed update delay, this
    may not yet hold true on all nodes. The symptom of this failure is a
    syslog message: "tipc: Cannot publish {%u,%u,%u}, overlap error".

    In this commit we add a resiliency queue at the receiving end of
    the name table distributor. When insertion of an arriving publication
    fails, we retain it in this queue for a short amount of time, assuming
    that another update will arrive very soon and clear the conflict. If so
    happens, we insert the publication, otherwise we drop it.

    The (configurable) retention value defaults to 2000 ms. Knowing from
    experience that the situation described above is extremely rare, there
    is no risk that the queue will accumulate any large number of items.

    Signed-off-by: Erik Hugne
    Signed-off-by: Jon Maloy
    Acked-by: Ying Xue
    Signed-off-by: David S. Miller

    Erik Hugne
     

09 Aug, 2014

1 commit

  • This taint flag will be set if the system has ever entered a softlockup
    state. Similar to TAINT_WARN it is useful to know whether or not the
    system has been in a softlockup state when debugging.

    [akpm@linux-foundation.org: apply the taint before calling panic()]
    Signed-off-by: Josh Hunt
    Cc: Jason Baron
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josh Hunt
     

24 Jun, 2014

2 commits

  • A 'softlockup' is defined as a bug that causes the kernel to loop in
    kernel mode for more than a predefined period to time, without giving
    other tasks a chance to run.

    Currently, upon detection of this condition by the per-cpu watchdog
    task, debug information (including a stack trace) is sent to the system
    log.

    On some occasions, we have observed that the "victim" rather than the
    actual "culprit" (i.e. the owner/holder of the contended resource) is
    reported to the user. Often this information has proven to be
    insufficient to assist debugging efforts.

    To avoid loss of useful debug information, for architectures which
    support NMI, this patch makes it possible to improve soft lockup
    reporting. This is accomplished by issuing an NMI to each cpu to obtain
    a stack trace.

    If NMI is not supported we just revert back to the old method. A sysctl
    and boot-time parameter is available to toggle this feature.

    [dzickus@redhat.com: add CONFIG_SMP in certain areas]
    [akpm@linux-foundation.org: additional CONFIG_SMP=n optimisations]
    [mq@suse.cz: fix warning]
    Signed-off-by: Aaron Tomlin
    Signed-off-by: Don Zickus
    Cc: David S. Miller
    Cc: Mateusz Guzik
    Cc: Oleg Nesterov
    Signed-off-by: Jan Moskyto Matejka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Tomlin
     
  • Oleg reports a division by zero error on zero-length write() to the
    percpu_pagelist_fraction sysctl:

    divide error: 0000 [#1] SMP DEBUG_PAGEALLOC
    CPU: 1 PID: 9142 Comm: badarea_io Not tainted 3.15.0-rc2-vm-nfs+ #19
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    task: ffff8800d5aeb6e0 ti: ffff8800d87a2000 task.ti: ffff8800d87a2000
    RIP: 0010: percpu_pagelist_fraction_sysctl_handler+0x84/0x120
    RSP: 0018:ffff8800d87a3e78 EFLAGS: 00010246
    RAX: 0000000000000f89 RBX: ffff88011f7fd000 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000010
    RBP: ffff8800d87a3e98 R08: ffffffff81d002c8 R09: ffff8800d87a3f50
    R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000060
    R13: ffffffff81c3c3e0 R14: ffffffff81cfddf8 R15: ffff8801193b0800
    FS: 00007f614f1e9740(0000) GS:ffff88011f440000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007f614f1fa000 CR3: 00000000d9291000 CR4: 00000000000006e0
    Call Trace:
    proc_sys_call_handler+0xb3/0xc0
    proc_sys_write+0x14/0x20
    vfs_write+0xba/0x1e0
    SyS_write+0x46/0xb0
    tracesys+0xe1/0xe6

    However, if the percpu_pagelist_fraction sysctl is set by the user, it
    is also impossible to restore it to the kernel default since the user
    cannot write 0 to the sysctl.

    This patch allows the user to write 0 to restore the default behavior.
    It still requires a fraction equal to or larger than 8, however, as
    stated by the documentation for sanity. If a value in the range [1, 7]
    is written, the sysctl will return EINVAL.

    This successfully solves the divide by zero issue at the same time.

    Signed-off-by: David Rientjes
    Reported-by: Oleg Drokin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

07 Jun, 2014

1 commit

  • When writing to a sysctl string, each write, regardless of VFS position,
    begins writing the string from the start. This means the contents of
    the last write to the sysctl controls the string contents instead of the
    first:

    open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
    write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
    write(1, "/bin/true", 9) = 9
    close(1) = 0

    $ cat /proc/sys/kernel/modprobe
    /bin/true

    Expected behaviour would be to have the sysctl be "AAAA..." capped at
    maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
    contents of the second write. Similarly, multiple short writes would
    not append to the sysctl.

    The old behavior is unlike regular POSIX files enough that doing audits
    of software that interact with sysctls can end up in unexpected or
    dangerous situations. For example, "as long as the input starts with a
    trusted path" turns out to be an insufficient filter, as what must also
    happen is for the input to be entirely contained in a single write
    syscall -- not a common consideration, especially for high level tools.

    This provides kernel.sysctl_writes_strict as a way to make this behavior
    act in a less surprising manner for strings, and disallows non-zero file
    position when writing numeric sysctls (similar to what is already done
    when reading from non-zero file positions). For now, the default (0) is
    to warn about non-zero file position use, but retain the legacy
    behavior. Setting this to -1 disables the warning, and setting this to
    1 enables the file position respecting behavior.

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: move misplaced hunk, per Randy]
    Signed-off-by: Kees Cook
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

05 Jun, 2014

2 commits

  • Existing description is worded in a way which almost encourages setting of
    vfs_cache_pressure above 100, possibly way above it.

    Users are left in a dark what this numeric value is - an int? a
    percentage? what the scale is?

    As a result, we are getting reports about noticeable performance
    degradation from users who have set vfs_cache_pressure to ridiculously
    high values - because they thought there is no downside to it.

    Via code inspection it's obvious that this value is treated as a
    percentage. This patch changes text to reflect this fact, and adds a
    cautionary paragraph advising against setting vfs_cache_pressure sky high.

    Signed-off-by: Denys Vlasenko
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Denys Vlasenko
     
  • When it was introduced, zone_reclaim_mode made sense as NUMA distances
    punished and workloads were generally partitioned to fit into a NUMA
    node. NUMA machines are now common but few of the workloads are
    NUMA-aware and it's routine to see major performance degradation due to
    zone_reclaim_mode being enabled but relatively few can identify the
    problem.

    Those that require zone_reclaim_mode are likely to be able to detect
    when it needs to be enabled and tune appropriately so lets have a
    sensible default for the bulk of users.

    This patch (of 2):

    zone_reclaim_mode causes processes to prefer reclaiming memory from
    local node instead of spilling over to other nodes. This made sense
    initially when NUMA machines were almost exclusively HPC and the
    workload was partitioned into nodes. The NUMA penalties were
    sufficiently high to justify reclaiming the memory. On current machines
    and workloads it is often the case that zone_reclaim_mode destroys
    performance but not all users know how to detect this. Favour the
    common case and disable it by default. Users that are sophisticated
    enough to know they need zone_reclaim_mode will detect it.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Zhang Yanfei
    Acked-by: Michal Hocko
    Reviewed-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

08 Apr, 2014

1 commit

  • As sysctl_hung_task_timeout_sec is unsigned long, when this value is
    larger then LONG_MAX/HZ, the function schedule_timeout_interruptible in
    watchdog will return immediately without sleep and with print :

    schedule_timeout: wrong timeout value ffffffffffffff83

    and then the funtion watchdog will call schedule_timeout_interruptible
    again and again. The screen will be filled with

    "schedule_timeout: wrong timeout value ffffffffffffff83"

    This patch does some check and correction in sysctl, to let the function
    schedule_timeout_interruptible allways get the valid parameter.

    Signed-off-by: Liu Hua
    Tested-by: Satoru Takeuchi
    Cc: [3.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Liu Hua
     

07 Apr, 2014

1 commit

  • Pull module updates from Rusty Russell:
    "Nothing major: the stricter permissions checking for sysfs broke a
    staging driver; fix included. Greg KH said he'd take the patch but
    hadn't as the merge window opened, so it's included here to avoid
    breaking build"

    * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
    staging: fix up speakup kobject mode
    Use 'E' instead of 'X' for unsigned module taint flag.
    VERIFY_OCTAL_PERMISSIONS: stricter checking for sysfs perms.
    kallsyms: fix percpu vars on x86-64 with relocation.
    kallsyms: generalize address range checking
    module: LLVMLinux: Remove unused function warning from __param_check macro
    Fix: module signature vs tracepoints: add new TAINT_UNSIGNED_MODULE
    module: remove MODULE_GENERIC_TABLE
    module: allow multiple calls to MODULE_DEVICE_TABLE() per module
    module: use pr_cont

    Linus Torvalds
     

04 Apr, 2014

1 commit

  • There is plenty of anecdotal evidence and a load of blog posts
    suggesting that using "drop_caches" periodically keeps your system
    running in "tip top shape". Perhaps adding some kernel documentation
    will increase the amount of accurate data on its use.

    If we are not shrinking caches effectively, then we have real bugs.
    Using drop_caches will simply mask the bugs and make them harder to
    find, but certainly does not fix them, nor is it an appropriate
    "workaround" to limit the size of the caches. On the contrary, there
    have been bug reports on issues that turned out to be misguided use of
    cache dropping.

    Dropping caches is a very drastic and disruptive operation that is good
    for debugging and running tests, but if it creates bug reports from
    production use, kernel developers should be aware of its use.

    Add a bit more documentation about it, a syslog message to track down
    abusers, and vmstat drop counters to help analyze problem reports.

    [akpm@linux-foundation.org: checkpatch fixes]
    [hannes@cmpxchg.org: add runtime suppression control]
    Signed-off-by: Dave Hansen
    Signed-off-by: Michal Hocko
    Acked-by: KOSAKI Motohiro
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

01 Apr, 2014

1 commit

  • Pull scheduler changes from Ingo Molnar:
    "Bigger changes:

    - sched/idle restructuring: they are WIP preparation for deeper
    integration between the scheduler and idle state selection, by
    Nicolas Pitre.

    - add NUMA scheduling pseudo-interleaving, by Rik van Riel.

    - optimize cgroup context switches, by Peter Zijlstra.

    - RT scheduling enhancements, by Thomas Gleixner.

    The rest is smaller changes, non-urgnt fixes and cleanups"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (68 commits)
    sched: Clean up the task_hot() function
    sched: Remove double calculation in fix_small_imbalance()
    sched: Fix broken setscheduler()
    sparc64, sched: Remove unused sparc64_multi_core
    sched: Remove unused mc_capable() and smt_capable()
    sched/numa: Move task_numa_free() to __put_task_struct()
    sched/fair: Fix endless loop in idle_balance()
    sched/core: Fix endless loop in pick_next_task()
    sched/fair: Push down check for high priority class task into idle_balance()
    sched/rt: Fix picking RT and DL tasks from empty queue
    trace: Replace hardcoding of 19 with MAX_NICE
    sched: Guarantee task priority in pick_next_task()
    sched/idle: Remove stale old file
    sched: Put rq's sched_avg under CONFIG_FAIR_GROUP_SCHED
    cpuidle/arm64: Remove redundant cpuidle_idle_call()
    cpuidle/powernv: Remove redundant cpuidle_idle_call()
    sched, nohz: Exclude isolated cores from load balancing
    sched: Fix select_task_rq_fair() description comments
    workqueue: Replace hardcoding of -20 and 19 with MIN_NICE and MAX_NICE
    sys: Replace hardcoding of -20 and 19 with MIN_NICE and MAX_NICE
    ...

    Linus Torvalds
     

13 Mar, 2014

1 commit

  • Users have reported being unable to trace non-signed modules loaded
    within a kernel supporting module signature.

    This is caused by tracepoint.c:tracepoint_module_coming() refusing to
    take into account tracepoints sitting within force-loaded modules
    (TAINT_FORCED_MODULE). The reason for this check, in the first place, is
    that a force-loaded module may have a struct module incompatible with
    the layout expected by the kernel, and can thus cause a kernel crash
    upon forced load of that module on a kernel with CONFIG_TRACEPOINTS=y.

    Tracepoints, however, specifically accept TAINT_OOT_MODULE and
    TAINT_CRAP, since those modules do not lead to the "very likely system
    crash" issue cited above for force-loaded modules.

    With kernels having CONFIG_MODULE_SIG=y (signed modules), a non-signed
    module is tainted re-using the TAINT_FORCED_MODULE taint flag.
    Unfortunately, this means that Tracepoints treat that module as a
    force-loaded module, and thus silently refuse to consider any tracepoint
    within this module.

    Since an unsigned module does not fit within the "very likely system
    crash" category of tainting, add a new TAINT_UNSIGNED_MODULE taint flag
    to specifically address this taint behavior, and accept those modules
    within Tracepoints. We use the letter 'X' as a taint flag character for
    a module being loaded that doesn't know how to sign its name (proposed
    by Steven Rostedt).

    Also add the missing 'O' entry to trace event show_module_flags() list
    for the sake of completeness.

    Signed-off-by: Mathieu Desnoyers
    Acked-by: Steven Rostedt
    NAKed-by: Ingo Molnar
    CC: Thomas Gleixner
    CC: David Howells
    CC: Greg Kroah-Hartman
    Signed-off-by: Rusty Russell

    Mathieu Desnoyers
     

27 Feb, 2014

1 commit


02 Feb, 2014

1 commit


01 Feb, 2014

1 commit

  • Pull core debug changes from Ingo Molnar:
    "This contains mostly kernel debugging related updates:

    - make hung_task detection more configurable to distros
    - add final bits for x86 UV NMI debugging, with related KGDB changes
    - update the mailing-list of MAINTAINERS entries I'm involved with"

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    hung_task: Display every hung task warning
    sysctl: Add neg_one as a standard constraint
    x86/uv/nmi, kgdb/kdb: Fix UV NMI handler when KDB not configured
    x86/uv/nmi: Fix Sparse warnings
    kgdb/kdb: Fix no KDB config problem
    MAINTAINERS: Restore "L: linux-kernel@vger.kernel.org" entries

    Linus Torvalds
     

31 Jan, 2014

1 commit


30 Jan, 2014

1 commit

  • Prior to commit fe35004fbf9e ("mm: avoid swapping out with
    swappiness==0") setting swappiness to 0, reclaim code could still evict
    recently used user anonymous memory to swap even though there is a
    significant amount of RAM used for page cache.

    The behaviour of setting swappiness to 0 has since changed. When set,
    the reclaim code does not initiate swap until the amount of free pages
    and file-backed pages, is less than the high water mark in a zone.

    Let's update the documentation to reflect this.

    [akpm@linux-foundation.org: remove comma, per Randy]
    Signed-off-by: Aaron Tomlin
    Acked-by: Rik van Riel
    Acked-by: Bryn M. Reeves
    Cc: Satoru Moriya
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Tomlin
     

28 Jan, 2014

1 commit

  • Excessive migration of pages can hurt the performance of workloads
    that span multiple NUMA nodes. However, it turns out that the
    p->numa_migrate_deferred knob is a really big hammer, which does
    reduce migration rates, but does not actually help performance.

    Now that the second stage of the automatic numa balancing code
    has stabilized, it is time to replace the simplistic migration
    deferral code with something smarter.

    Signed-off-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Peter Zijlstra
    Cc: Chegu Vinod
    Link: http://lkml.kernel.org/r/1390860228-21539-2-git-send-email-riel@redhat.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     

25 Jan, 2014

1 commit

  • When khungtaskd detects hung tasks, it prints out
    backtraces from a number of those tasks.

    Limiting the number of backtraces being printed
    out can result in the user not seeing the information
    necessary to debug the issue. The hung_task_warnings
    sysctl controls this feature.

    This patch makes it possible for hung_task_warnings
    to accept a special value to print an unlimited
    number of backtraces when khungtaskd detects hung
    tasks.

    The special value is -1. To use this value it is
    necessary to change types from ulong to int.

    Signed-off-by: Aaron Tomlin
    Reviewed-by: Rik van Riel
    Acked-by: David Rientjes
    Cc: oleg@redhat.com
    Link: http://lkml.kernel.org/r/1390239253-24030-3-git-send-email-atomlin@redhat.com
    [ Build warning fix. ]
    Signed-off-by: Ingo Molnar

    Aaron Tomlin
     

24 Jan, 2014

1 commit

  • For general-purpose (i.e. distro) kernel builds it makes sense to build
    with CONFIG_KEXEC to allow end users to choose what kind of things they
    want to do with kexec. However, in the face of trying to lock down a
    system with such a kernel, there needs to be a way to disable kexec_load
    (much like module loading can be disabled). Without this, it is too easy
    for the root user to modify kernel memory even when CONFIG_STRICT_DEVMEM
    and modules_disabled are set. With this change, it is still possible to
    load an image for use later, then disable kexec_load so the image (or lack
    of image) can't be altered.

    The intention is for using this in environments where "perfect"
    enforcement is hard. Without a verified boot, along with verified
    modules, and along with verified kexec, this is trying to give a system a
    better chance to defend itself (or at least grow the window of
    discoverability) against attack in the face of a privilege escalation.

    In my mind, I consider several boot scenarios:

    1) Verified boot of read-only verified root fs loading fd-based
    verification of kexec images.
    2) Secure boot of writable root fs loading signed kexec images.
    3) Regular boot loading kexec (e.g. kcrash) image early and locking it.
    4) Regular boot with no control of kexec image at all.

    1 and 2 don't exist yet, but will soon once the verified kexec series has
    landed. 4 is the state of things now. The gap between 2 and 4 is too
    large, so this change creates scenario 3, a middle-ground above 4 when 2
    and 1 are not possible for a system.

    Signed-off-by: Kees Cook
    Acked-by: Rik van Riel
    Cc: Vivek Goyal
    Cc: Eric Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

22 Jan, 2014

1 commit

  • Some applications that run on HPC clusters are designed around the
    availability of RAM and the overcommit ratio is fine tuned to get the
    maximum usage of memory without swapping. With growing memory, the
    1%-of-all-RAM grain provided by overcommit_ratio has become too coarse
    for these workload (on a 2TB machine it represents no less than 20GB).

    This patch adds the new overcommit_kbytes sysctl variable that allow a
    much finer grain.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix nommu build]
    Signed-off-by: Jerome Marchand
    Cc: Dave Hansen
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerome Marchand
     

17 Dec, 2013

1 commit

  • commit 887c290e (sched/numa: Decide whether to favour task or group weights
    based on swap candidate relationships) drop the check against
    sysctl_numa_balancing_settle_count, this patch remove the sysctl.

    Signed-off-by: Wanpeng Li
    Acked-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Acked-by: David Rientjes
    Signed-off-by: Peter Zijlstra
    Cc: Andrew Morton
    Cc: Naoya Horiguchi
    Link: http://lkml.kernel.org/r/1386833006-6600-1-git-send-email-liwanp@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     

13 Nov, 2013

2 commits

  • Some setuid binaries will allow reading of files which have read
    permission by the real user id. This is problematic with files which
    use %pK because the file access permission is checked at open() time,
    but the kptr_restrict setting is checked at read() time. If a setuid
    binary opens a %pK file as an unprivileged user, and then elevates
    permissions before reading the file, then kernel pointer values may be
    leaked.

    This happens for example with the setuid pppd application on Ubuntu 12.04:

    $ head -1 /proc/kallsyms
    00000000 T startup_32

    $ pppd file /proc/kallsyms
    pppd: In file /proc/kallsyms: unrecognized option 'c1000000'

    This will only leak the pointer value from the first line, but other
    setuid binaries may leak more information.

    Fix this by adding a check that in addition to the current process having
    CAP_SYSLOG, that effective user and group ids are equal to the real ids.
    If a setuid binary reads the contents of a file which uses %pK then the
    pointer values will be printed as NULL if the real user is unprivileged.

    Update the sysctl documentation to reflect the changes, and also correct
    the documentation to state the kptr_restrict=0 is the default.

    This is a only temporary solution to the issue. The correct solution is
    to do the permission check at open() time on files, and to replace %pK
    with a function which checks the open() time permission. %pK uses in
    printk should be removed since no sane permission check can be done, and
    instead protected by using dmesg_restrict.

    Signed-off-by: Ryan Mallon
    Cc: Kees Cook
    Cc: Alexander Viro
    Cc: Joe Perches
    Cc: "Eric W. Biederman"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ryan Mallon
     
  • Now dirty_background_ratio/dirty_ratio contains a percentage of total
    avaiable memory, which contains free pages and reclaimable pages. The
    number of these pages is not equal to the number of total system memory.
    But they are described as a percentage of total system memory in
    Documentation/sysctl/vm.txt. So we need to fix them to avoid
    misunderstanding.

    Signed-off-by: Zheng Liu
    Cc: Rob Landley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zheng Liu