02 Mar, 2013

1 commit

  • Pull new ARC architecture from Vineet Gupta:
    "Initial ARC Linux port with some fixes on top for 3.9-rc1:

    I would like to introduce the Linux port to ARC Processors (from
    Synopsys) for 3.9-rc1. The patch-set has been discussed on the public
    lists since Nov and has received a fair bit of review, specially from
    Arnd, tglx, Al and other subsystem maintainers for DeviceTree, kgdb...

    The arch bits are in arch/arc, some asm-generic changes (acked by
    Arnd), a minor change to PARISC (acked by Helge).

    The series is a touch bigger for a new port for 2 main reasons:

    1. It enables a basic kernel in first sub-series and adds
    ptrace/kgdb/.. later

    2. Some of the fallout of review (DeviceTree support, multi-platform-
    image support) were added on top of orig series, primarily to
    record the revision history.

    This updated pull request additionally contains

    - fixes due to our GNU tools catching up with the new syscall/ptrace
    ABI

    - some (minor) cross-arch Kconfig updates."

    * tag 'arc-v3.9-rc1-late' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc: (82 commits)
    ARC: split elf.h into uapi and export it for userspace
    ARC: Fixup the current ABI version
    ARC: gdbserver using regset interface possibly broken
    ARC: Kconfig cleanup tracking cross-arch Kconfig pruning in merge window
    ARC: make a copy of flat DT
    ARC: [plat-arcfpga] DT arc-uart bindings change: "baud" => "current-speed"
    ARC: Ensure CONFIG_VIRT_TO_BUS is not enabled
    ARC: Fix pt_orig_r8 access
    ARC: [3.9] Fallout of hlist iterator update
    ARC: 64bit RTSC timestamp hardware issue
    ARC: Don't fiddle with non-existent caches
    ARC: Add self to MAINTAINERS
    ARC: Provide a default serial.h for uart drivers needing BASE_BAUD
    ARC: [plat-arcfpga] defconfig for fully loaded ARC Linux
    ARC: [Review] Multi-platform image #8: platform registers SMP callbacks
    ARC: [Review] Multi-platform image #7: SMP common code to use callbacks
    ARC: [Review] Multi-platform image #6: cpu-to-dma-addr optional
    ARC: [Review] Multi-platform image #5: NR_IRQS defined by ARC core
    ARC: [Review] Multi-platform image #4: Isolate platform headers
    ARC: [Review] Multi-platform image #3: switch to board callback
    ...

    Linus Torvalds
     

28 Feb, 2013

1 commit

  • The existing SUID_DUMP_* defines duplicate the newer SUID_DUMPABLE_*
    defines introduced in 54b501992dd2 ("coredump: warn about unsafe
    suid_dumpable / core_pattern combo"). Remove the new ones, and use the
    prior values instead.

    Signed-off-by: Kees Cook
    Reported-by: Chen Gang
    Cc: Alexander Viro
    Cc: Alan Cox
    Cc: "Eric W. Biederman"
    Cc: Doug Ledford
    Cc: Serge Hallyn
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

26 Feb, 2013

1 commit

  • Pull module update from Rusty Russell:
    "The sweeping change is to make add_taint() explicitly indicate whether
    to disable lockdep, but it's a mechanical change."

    * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
    MODSIGN: Add option to not sign modules during modules_install
    MODSIGN: Add -s option to sign-file
    MODSIGN: Specify the hash algorithm on sign-file command line
    MODSIGN: Simplify Makefile with a Kconfig helper
    module: clean up load_module a little more.
    modpost: Ignore ARC specific non-alloc sections
    module: constify within_module_*
    taint: add explicit flag to show whether lock dep is still OK.
    module: printk message when module signature fail taints kernel.

    Linus Torvalds
     

24 Feb, 2013

1 commit

  • When calculating amount of dirtyable memory, min_free_kbytes should be
    subtracted because it is not intended for dirty pages.

    Addresses http://bugs.debian.org/695182

    [akpm@linux-foundation.org: fix up min_free_kbytes extern declarations]
    [akpm@linux-foundation.org: fix min() warning]
    Signed-off-by: Paul Szabo
    Acked-by: Rik van Riel
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Szabo
     

22 Feb, 2013

1 commit


16 Feb, 2013

1 commit

  • PARISC defines /proc/sys/kernel/unaligned-trap to runtime toggle
    unaligned access emulation.

    The exact mechanics of enablig/disabling are still arch specific, we can
    make the sysctl usable by other arches.

    Signed-off-by: Vineet Gupta
    Acked-by: Helge Deller
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Cc: "Eric W. Biederman"
    Cc: Serge Hallyn

    Vineet Gupta
     

08 Feb, 2013

2 commits

  • Add a /proc/sys/kernel scheduler knob named
    sched_rr_timeslice_ms that allows global changing of the
    SCHED_RR timeslice value. User visable value is in milliseconds
    but is stored as jiffies. Setting to 0 (zero) resets to the
    default (currently 100ms).

    Signed-off-by: Clark Williams
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/20130207094704.13751796@riff.lan
    Signed-off-by: Ingo Molnar

    Clark Williams
     
  • Move the sysctl-related bits from include/linux/sched.h into
    a new file: include/linux/sched/sysctl.h. Then update source
    files requiring access to those bits by including the new
    header file.

    Signed-off-by: Clark Williams
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/20130207094659.06dced96@riff.lan
    Signed-off-by: Ingo Molnar

    Clark Williams
     

21 Jan, 2013

1 commit


10 Jan, 2013

1 commit

  • IA64 defines /proc/sys/kernel/ignore-unaligned-usertrap to control
    verbose warnings on unaligned access emulation.

    Although the exact mechanics of what to do with sysctl (ignore/shout)
    are arch specific, this change enables the sysctl to be usable cross-arch.

    Signed-off-by: Vineet Gupta
    Cc: Fenghua Yu
    Cc: "Eric W. Biederman"
    Cc: Serge Hallyn
    Signed-off-by: Tony Luck

    Vineet Gupta
     

17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

11 Dec, 2012

4 commits

  • The PTE scanning rate and fault rates are two of the biggest sources of
    system CPU overhead with automatic NUMA placement. Ideally a proper policy
    would detect if a workload was properly placed, schedule and adjust the
    PTE scanning rate accordingly. We do not track the necessary information
    to do that but we at least know if we migrated or not.

    This patch scans slower if a page was not migrated as the result of a
    NUMA hinting fault up to sysctl_numa_balancing_scan_period_max which is
    now higher than the previous default. Once every minute it will reset
    the scanner in case of phase changes.

    This is hilariously crude and the numbers are arbitrary. Workloads will
    converge quite slowly in comparison to what a proper policy should be able
    to do. On the plus side, we will chew up less CPU for workloads that have
    no need for automatic balancing.

    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • Add a 1 second delay before starting to scan the working set of
    a task and starting to balance it amongst nodes.

    [ note that before the constant per task WSS sampling rate patch
    the initial scan would happen much later still, in effect that
    patch caused this regression. ]

    The theory is that short-run tasks benefit very little from NUMA
    placement: they come and go, and they better stick to the node
    they were started on. As tasks mature and rebalance to other CPUs
    and nodes, so does their NUMA placement have to change and so
    does it start to matter more and more.

    In practice this change fixes an observable kbuild regression:

    # [ a perf stat --null --repeat 10 test of ten bzImage builds to /dev/shm ]

    !NUMA:
    45.291088843 seconds time elapsed ( +- 0.40% )
    45.154231752 seconds time elapsed ( +- 0.36% )

    +NUMA, no slow start:
    46.172308123 seconds time elapsed ( +- 0.30% )
    46.343168745 seconds time elapsed ( +- 0.25% )

    +NUMA, 1 sec slow start:
    45.224189155 seconds time elapsed ( +- 0.25% )
    45.160866532 seconds time elapsed ( +- 0.17% )

    and it also fixes an observable perf bench (hackbench) regression:

    # perf stat --null --repeat 10 perf bench sched messaging

    -NUMA:

    -NUMA: 0.246225691 seconds time elapsed ( +- 1.31% )
    +NUMA no slow start: 0.252620063 seconds time elapsed ( +- 1.13% )

    +NUMA 1sec delay: 0.248076230 seconds time elapsed ( +- 1.35% )

    The implementation is simple and straightforward, most of the patch
    deals with adding the /proc/sys/kernel/numa_balancing_scan_delay_ms tunable
    knob.

    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    [ Wrote the changelog, ran measurements, tuned the default. ]
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel

    Peter Zijlstra
     
  • Previously, to probe the working set of a task, we'd use
    a very simple and crude method: mark all of its address
    space PROT_NONE.

    That method has various (obvious) disadvantages:

    - it samples the working set at dissimilar rates,
    giving some tasks a sampling quality advantage
    over others.

    - creates performance problems for tasks with very
    large working sets

    - over-samples processes with large address spaces but
    which only very rarely execute

    Improve that method by keeping a rotating offset into the
    address space that marks the current position of the scan,
    and advance it by a constant rate (in a CPU cycles execution
    proportional manner). If the offset reaches the last mapped
    address of the mm then it then it starts over at the first
    address.

    The per-task nature of the working set sampling functionality in this tree
    allows such constant rate, per task, execution-weight proportional sampling
    of the working set, with an adaptive sampling interval/frequency that
    goes from once per 100ms up to just once per 8 seconds. The current
    sampling volume is 256 MB per interval.

    As tasks mature and converge their working set, so does the
    sampling rate slow down to just a trickle, 256 MB per 8
    seconds of CPU time executed.

    This, beyond being adaptive, also rate-limits rarely
    executing systems and does not over-sample on overloaded
    systems.

    [ In AutoNUMA speak, this patch deals with the effective sampling
    rate of the 'hinting page fault'. AutoNUMA's scanning is
    currently rate-limited, but it is also fundamentally
    single-threaded, executing in the knuma_scand kernel thread,
    so the limit in AutoNUMA is global and does not scale up with
    the number of CPUs, nor does it scan tasks in an execution
    proportional manner.

    So the idea of rate-limiting the scanning was first implemented
    in the AutoNUMA tree via a global rate limit. This patch goes
    beyond that by implementing an execution rate proportional
    working set sampling rate that is not implemented via a single
    global scanning daemon. ]

    [ Dan Carpenter pointed out a possible NULL pointer dereference in the
    first version of this patch. ]

    Based-on-idea-by: Andrea Arcangeli
    Bug-Found-By: Dan Carpenter
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    [ Wrote changelog and fixed bug. ]
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel

    Peter Zijlstra
     
  • NOTE: This patch is based on "sched, numa, mm: Add fault driven
    placement and migration policy" but as it throws away all the policy
    to just leave a basic foundation I had to drop the signed-offs-by.

    This patch creates a bare-bones method for setting PTEs pte_numa in the
    context of the scheduler that when faulted later will be faulted onto the
    node the CPU is running on. In itself this does nothing useful but any
    placement policy will fundamentally depend on receiving hints on placement
    from fault context and doing something intelligent about it.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel

    Peter Zijlstra
     

29 Nov, 2012

1 commit


09 Oct, 2012

1 commit

  • Introduce SYSCTL_EXCEPTION_TRACE config option and selec it in the
    architectures requiring support for the "exception-trace" debug_table
    entry in kernel/sysctl.c.

    Signed-off-by: Catalin Marinas
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: "David S. Miller"
    Cc: Chris Metcalf
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Catalin Marinas
     

06 Oct, 2012

1 commit

  • Adds an expert Kconfig option, CONFIG_COREDUMP, which allows disabling of
    core dump. This saves approximately 2.6k in the compiled kernel, and
    complements CONFIG_ELF_CORE, which now depends on it.

    CONFIG_COREDUMP also disables coredump-related sysctls, except for
    suid_dumpable and related functions, which are necessary for ptrace.

    [akpm@linux-foundation.org: fix binfmt_aout.c build]
    Signed-off-by: Alex Kelly
    Reviewed-by: Josh Triplett
    Acked-by: Serge Hallyn
    Acked-by: Kees Cook
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Kelly
     

02 Oct, 2012

1 commit

  • Pull arm64 support from Catalin Marinas:
    "Linux support for the 64-bit ARM architecture (AArch64)

    Features currently supported:
    - 39-bit address space for user and kernel (each)
    - 4KB and 64KB page configurations
    - Compat (32-bit) user applications (ARMv7, EABI only)
    - Flattened Device Tree (mandated for all AArch64 platforms)
    - ARM generic timers"

    * tag 'arm64-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64: (35 commits)
    arm64: ptrace: remove obsolete ptrace request numbers from user headers
    arm64: Do not set the SMP/nAMP processor bit
    arm64: MAINTAINERS update
    arm64: Build infrastructure
    arm64: Miscellaneous header files
    arm64: Generic timers support
    arm64: Loadable modules
    arm64: Miscellaneous library functions
    arm64: Performance counters support
    arm64: Add support for /proc/sys/debug/exception-trace
    arm64: Debugging support
    arm64: Floating point and SIMD
    arm64: 32-bit (compat) applications support
    arm64: User access library functions
    arm64: Signal handling support
    arm64: VDSO support
    arm64: System calls handling
    arm64: ELF definitions
    arm64: SMP support
    arm64: DMA mapping API
    ...

    Linus Torvalds
     

17 Sep, 2012

1 commit

  • This patch allows setting of the show_unhandled_signals variable via
    /proc/sys/debug/exception-trace. The default value is currently 1
    showing unhandled user faults (undefined instructions, data aborts) and
    invalid signal stack frames.

    Signed-off-by: Catalin Marinas
    Acked-by: Tony Lindgren
    Acked-by: Arnd Bergmann
    Acked-by: Nicolas Pitre
    Acked-by: Olof Johansson
    Acked-by: Santosh Shilimkar

    Catalin Marinas
     

04 Sep, 2012

1 commit

  • Unlike others, sched_migration_cost, sched_time_avg and
    sched_shares_window doesn't have time unit as suffix. Add them.

    Signed-off-by: Namhyung Kim
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1345083330-19486-1-git-send-email-namhyung@kernel.org
    Signed-off-by: Ingo Molnar

    Namhyung Kim
     

02 Aug, 2012

1 commit

  • Pull second vfs pile from Al Viro:
    "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
    deadlock reproduced by xfstests 068), symlink and hardlink restriction
    patches, plus assorted cleanups and fixes.

    Note that another fsfreeze deadlock (emergency thaw one) is *not*
    dealt with - the series by Fernando conflicts a lot with Jan's, breaks
    userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
    for massive vfsmount leak; this is going to be handled next cycle.
    There probably will be another pull request, but that stuff won't be
    in it."

    Fix up trivial conflicts due to unrelated changes next to each other in
    drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
    delousing target_core_file a bit
    Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
    fs: Remove old freezing mechanism
    ext2: Implement freezing
    btrfs: Convert to new freezing mechanism
    nilfs2: Convert to new freezing mechanism
    ntfs: Convert to new freezing mechanism
    fuse: Convert to new freezing mechanism
    gfs2: Convert to new freezing mechanism
    ocfs2: Convert to new freezing mechanism
    xfs: Convert to new freezing code
    ext4: Convert to new freezing mechanism
    fs: Protect write paths by sb_start_write - sb_end_write
    fs: Skip atime update on frozen filesystem
    fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
    fs: Improve filesystem freezing handling
    switch the protection of percpu_counter list to spinlock
    nfsd: Push mnt_want_write() outside of i_mutex
    btrfs: Push mnt_want_write() outside of i_mutex
    fat: Push mnt_want_write() outside of i_mutex
    ...

    Linus Torvalds
     

01 Aug, 2012

1 commit

  • Since per-BDI flusher threads were introduced in 2.6, the pdflush
    mechanism is not used any more. But the old interface exported through
    /proc/sys/vm/nr_pdflush_threads still exists and is obviously useless.

    For back-compatibility, printk warning information and return 2 to notify
    the users that the interface is removed.

    Signed-off-by: Wanpeng Li
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     

31 Jul, 2012

2 commits

  • register_sysctl_table() is a strange function, as it makes internal
    allocations (a header) to register a sysctl_table. This header is a
    handle to the table that is created, and can be used to unregister the
    table. But if the table is permanent and never unregistered, the header
    acts the same as a static variable.

    Unfortunately, this allocation of memory that is never expected to be
    freed fools kmemleak in thinking that we have leaked memory. For those
    sysctl tables that are never unregistered, and have no pointer referencing
    them, kmemleak will think that these are memory leaks:

    unreferenced object 0xffff880079fb9d40 (size 192):
    comm "swapper/0", pid 0, jiffies 4294667316 (age 12614.152s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] kmemleak_alloc+0x73/0x98
    [] kmemleak_alloc_recursive.constprop.42+0x16/0x18
    [] __kmalloc+0x107/0x153
    [] kzalloc.constprop.8+0xe/0x10
    [] __register_sysctl_paths+0xe1/0x160
    [] register_sysctl_paths+0x1b/0x1d
    [] register_sysctl_table+0x18/0x1a
    [] sysctl_init+0x10/0x14
    [] proc_sys_init+0x2f/0x31
    [] proc_root_init+0xa5/0xa7
    [] start_kernel+0x3d0/0x40a
    [] x86_64_start_reservations+0xae/0xb2
    [] x86_64_start_kernel+0x102/0x111
    [] 0xffffffffffffffff

    The sysctl_base_table used by sysctl itself is one such instance that
    registers the table to never be unregistered.

    Use kmemleak_not_leak() to suppress the kmemleak false positive.

    Signed-off-by: Steven Rostedt
    Acked-by: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     
  • When suid_dumpable=2, detect unsafe core_pattern settings and warn when
    they are seen.

    Signed-off-by: Kees Cook
    Suggested-by: Andrew Morton
    Cc: Alexander Viro
    Cc: Alan Cox
    Cc: "Eric W. Biederman"
    Cc: Doug Ledford
    Cc: Serge Hallyn
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

30 Jul, 2012

1 commit

  • This adds symlink and hardlink restrictions to the Linux VFS.

    Symlinks:

    A long-standing class of security issues is the symlink-based
    time-of-check-time-of-use race, most commonly seen in world-writable
    directories like /tmp. The common method of exploitation of this flaw
    is to cross privilege boundaries when following a given symlink (i.e. a
    root process follows a symlink belonging to another user). For a likely
    incomplete list of hundreds of examples across the years, please see:
    http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp

    The solution is to permit symlinks to only be followed when outside
    a sticky world-writable directory, or when the uid of the symlink and
    follower match, or when the directory owner matches the symlink's owner.

    Some pointers to the history of earlier discussion that I could find:

    1996 Aug, Zygo Blaxell
    http://marc.info/?l=bugtraq&m=87602167419830&w=2
    1996 Oct, Andrew Tridgell
    http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
    1997 Dec, Albert D Cahalan
    http://lkml.org/lkml/1997/12/16/4
    2005 Feb, Lorenzo Hernández García-Hierro
    http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
    2010 May, Kees Cook
    https://lkml.org/lkml/2010/5/30/144

    Past objections and rebuttals could be summarized as:

    - Violates POSIX.
    - POSIX didn't consider this situation and it's not useful to follow
    a broken specification at the cost of security.
    - Might break unknown applications that use this feature.
    - Applications that break because of the change are easy to spot and
    fix. Applications that are vulnerable to symlink ToCToU by not having
    the change aren't. Additionally, no applications have yet been found
    that rely on this behavior.
    - Applications should just use mkstemp() or O_CREATE|O_EXCL.
    - True, but applications are not perfect, and new software is written
    all the time that makes these mistakes; blocking this flaw at the
    kernel is a single solution to the entire class of vulnerability.
    - This should live in the core VFS.
    - This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
    - This should live in an LSM.
    - This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)

    Hardlinks:

    On systems that have user-writable directories on the same partition
    as system files, a long-standing class of security issues is the
    hardlink-based time-of-check-time-of-use race, most commonly seen in
    world-writable directories like /tmp. The common method of exploitation
    of this flaw is to cross privilege boundaries when following a given
    hardlink (i.e. a root process follows a hardlink created by another
    user). Additionally, an issue exists where users can "pin" a potentially
    vulnerable setuid/setgid file so that an administrator will not actually
    upgrade a system fully.

    The solution is to permit hardlinks to only be created when the user is
    already the existing file's owner, or if they already have read/write
    access to the existing file.

    Many Linux users are surprised when they learn they can link to files
    they have no access to, so this change appears to follow the doctrine
    of "least surprise". Additionally, this change does not violate POSIX,
    which states "the implementation may require that the calling process
    has permission to access the existing file"[1].

    This change is known to break some implementations of the "at" daemon,
    though the version used by Fedora and Ubuntu has been fixed[2] for
    a while. Otherwise, the change has been undisruptive while in use in
    Ubuntu for the last 1.5 years.

    [1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html
    [2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279

    This patch is based on the patches in Openwall and grsecurity, along with
    suggestions from Al Viro. I have added a sysctl to enable the protected
    behavior, and documentation.

    Signed-off-by: Kees Cook
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Kees Cook
     

05 Apr, 2012

1 commit

  • Commit bfdc0b4 adds code to restrict access to dmesg_restrict,
    however, it incorrectly alters kptr_restrict rather than
    dmesg_restrict.

    The original patch from Richard Weinberger
    (https://lkml.org/lkml/2011/3/14/362) alters dmesg_restrict as
    expected, and so the patch seems to have been misapplied.

    This adds the CAP_SYS_ADMIN check to both dmesg_restrict and
    kptr_restrict, since both are sensitive.

    Reported-by: Phillip Lougher
    Signed-off-by: Kees Cook
    Acked-by: Serge Hallyn
    Acked-by: Richard Weinberger
    Cc: stable@vger.kernel.org
    Signed-off-by: James Morris

    Kees Cook
     

29 Mar, 2012

5 commits

  • Merge third batch of patches from Andrew Morton:
    - Some MM stragglers
    - core SMP library cleanups (on_each_cpu_mask)
    - Some IPI optimisations
    - kexec
    - kdump
    - IPMI
    - the radix-tree iterator work
    - various other misc bits.

    "That'll do for -rc1. I still have ~10 patches for 3.4, will send
    those along when they've baked a little more."

    * emailed from Andrew Morton : (35 commits)
    backlight: fix typo in tosa_lcd.c
    crc32: add help text for the algorithm select option
    mm: move hugepage test examples to tools/testing/selftests/vm
    mm: move slabinfo.c to tools/vm
    mm: move page-types.c from Documentation to tools/vm
    selftests/Makefile: make `run_tests' depend on `all'
    selftests: launch individual selftests from the main Makefile
    radix-tree: use iterators in find_get_pages* functions
    radix-tree: rewrite gang lookup using iterator
    radix-tree: introduce bit-optimized iterator
    fs/proc/namespaces.c: prevent crash when ns_entries[] is empty
    nbd: rename the nbd_device variable from lo to nbd
    pidns: add reboot_pid_ns() to handle the reboot syscall
    sysctl: use bitmap library functions
    ipmi: use locks on watchdog timeout set on reboot
    ipmi: simplify locking
    ipmi: fix message handling during panics
    ipmi: use a tasklet for handling received messages
    ipmi: increase KCS timeouts
    ipmi: decrease the IPMI message transaction time in interrupt mode
    ...

    Linus Torvalds
     
  • Use bitmap_set() instead of using set_bit() for each bit. This conversion
    is valid because the bitmap is private in the function call and atomic
    bitops were unnecessary.

    This also includes minor change.
    - Use bitmap_copy() for shorter typing

    Signed-off-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • …m/linux/kernel/git/dhowells/linux-asm_system

    Pull "Disintegrate and delete asm/system.h" from David Howells:
    "Here are a bunch of patches to disintegrate asm/system.h into a set of
    separate bits to relieve the problem of circular inclusion
    dependencies.

    I've built all the working defconfigs from all the arches that I can
    and made sure that they don't break.

    The reason for these patches is that I recently encountered a circular
    dependency problem that came about when I produced some patches to
    optimise get_order() by rewriting it to use ilog2().

    This uses bitops - and on the SH arch asm/bitops.h drags in
    asm-generic/get_order.h by a circuituous route involving asm/system.h.

    The main difficulty seems to be asm/system.h. It holds a number of
    low level bits with no/few dependencies that are commonly used (eg.
    memory barriers) and a number of bits with more dependencies that
    aren't used in many places (eg. switch_to()).

    These patches break asm/system.h up into the following core pieces:

    (1) asm/barrier.h

    Move memory barriers here. This already done for MIPS and Alpha.

    (2) asm/switch_to.h

    Move switch_to() and related stuff here.

    (3) asm/exec.h

    Move arch_align_stack() here. Other process execution related bits
    could perhaps go here from asm/processor.h.

    (4) asm/cmpxchg.h

    Move xchg() and cmpxchg() here as they're full word atomic ops and
    frequently used by atomic_xchg() and atomic_cmpxchg().

    (5) asm/bug.h

    Move die() and related bits.

    (6) asm/auxvec.h

    Move AT_VECTOR_SIZE_ARCH here.

    Other arch headers are created as needed on a per-arch basis."

    Fixed up some conflicts from other header file cleanups and moving code
    around that has happened in the meantime, so David's testing is somewhat
    weakened by that. We'll find out anything that got broken and fix it..

    * tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system: (38 commits)
    Delete all instances of asm/system.h
    Remove all #inclusions of asm/system.h
    Add #includes needed to permit the removal of asm/system.h
    Move all declarations of free_initmem() to linux/mm.h
    Disintegrate asm/system.h for OpenRISC
    Split arch_align_stack() out from asm-generic/system.h
    Split the switch_to() wrapper out of asm-generic/system.h
    Move the asm-generic/system.h xchg() implementation to asm-generic/cmpxchg.h
    Create asm-generic/barrier.h
    Make asm-generic/cmpxchg.h #include asm-generic/cmpxchg-local.h
    Disintegrate asm/system.h for Xtensa
    Disintegrate asm/system.h for Unicore32 [based on ver #3, changed by gxt]
    Disintegrate asm/system.h for Tile
    Disintegrate asm/system.h for Sparc
    Disintegrate asm/system.h for SH
    Disintegrate asm/system.h for Score
    Disintegrate asm/system.h for S390
    Disintegrate asm/system.h for PowerPC
    Disintegrate asm/system.h for PA-RISC
    Disintegrate asm/system.h for MN10300
    ...

    Linus Torvalds
     
  • Remove all #inclusions of asm/system.h preparatory to splitting and killing
    it. Performed with the following command:

    perl -p -i -e 's!^#\s*include\s*.*\n!!' `grep -Irl '^#\s*include\s*' *`

    Signed-off-by: David Howells

    David Howells
     
  • Disintegrate asm/system.h for Sparc.

    Signed-off-by: David Howells
    cc: sparclinux@vger.kernel.org

    David Howells
     

24 Mar, 2012

1 commit

  • Pull sysctl updates from Eric Biederman:

    - Rewrite of sysctl for speed and clarity.

    Insert/remove/Lookup in sysctl are all now O(NlogN) operations, and
    are no longer bottlenecks in the process of adding and removing
    network devices.

    sysctl is now focused on being a filesystem instead of system call
    and the code can all be found in fs/proc/proc_sysctl.c. Hopefully
    this means the code is now approachable.

    Much thanks is owed to Lucian Grinjincu for keeping at this until
    something was found that was usable.

    - The recent proc_sys_poll oops found by the fuzzer during hibernation
    is fixed.

    * git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl: (36 commits)
    sysctl: protect poll() in entries that may go away
    sysctl: Don't call sysctl_follow_link unless we are a link.
    sysctl: Comments to make the code clearer.
    sysctl: Correct error return from get_subdir
    sysctl: An easier to read version of find_subdir
    sysctl: fix memset parameters in setup_sysctl_set()
    sysctl: remove an unused variable
    sysctl: Add register_sysctl for normal sysctl users
    sysctl: Index sysctl directories with rbtrees.
    sysctl: Make the header lists per directory.
    sysctl: Move sysctl_check_dups into insert_header
    sysctl: Modify __register_sysctl_paths to take a set instead of a root and an nsproxy
    sysctl: Replace root_list with links between sysctl_table_sets.
    sysctl: Add sysctl_print_dir and use it in get_subdir
    sysctl: Stop requiring explicit management of sysctl directories
    sysctl: Add a root pointer to ctl_table_set
    sysctl: Rewrite proc_sys_readdir in terms of first_entry and next_entry
    sysctl: Rewrite proc_sys_lookup introducing find_entry and lookup_entry.
    sysctl: Normalize the root_table data structure.
    sysctl: Factor out insert_header and erase_header
    ...

    Linus Torvalds
     

14 Feb, 2012

1 commit


25 Jan, 2012

3 commits

  • Move the core sysctl code from kernel/sysctl.c and kernel/sysctl_check.c
    into fs/proc/proc_sysctl.c.

    Currently sysctl maintenance is hampered by the sysctl implementation
    being split across 3 files with artificial layering between them.
    Consolidate the entire sysctl implementation into 1 file so that
    it is easier to see what is going on and hopefully allowing for
    simpler maintenance.

    For functions that are now only used in fs/proc/proc_sysctl.c remove
    their declarations from sysctl.h and make them static in fs/proc/proc_sysctl.c

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • Simplify the code by treating the base sysctl table like any other
    sysctl table and register it with register_sysctl_table.

    To ensure this table is registered early enough to avoid problems
    call sysctl_init from proc_sys_init.

    Rename sysctl_net.c:sysctl_init() to net_sysctl_init() to avoid
    name conflicts now that kernel/sysctl.c:sysctl_init() is no longer
    static.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • - In sysctl.h move functions only available if CONFIG_SYSCL
    is defined inside of #ifdef CONFIG_SYSCTL

    - Move the stub function definitions for !CONFIG_SYSCTL
    into sysctl.h and make them static inlines.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

05 Dec, 2011

1 commit

  • Currently, messages are just output on the detection of stack
    overflow, which is not sufficient for systems that need a
    high reliability. This is because in general the overflow may
    corrupt data, and the additional corruption may occur due to
    reading them unless systems stop.

    This patch adds the sysctl parameter
    kernel.panic_on_stackoverflow and causes a panic when detecting
    the overflows of kernel, IRQ and exception stacks except user
    stack according to the parameter. It is disabled by default.

    Signed-off-by: Mitsuo Hayasaka
    Cc: yrl.pp-manager.tt@hitachi.com
    Cc: Randy Dunlap
    Cc: "H. Peter Anvin"
    Link: http://lkml.kernel.org/r/20111129060836.11076.12323.stgit@ltc219.sdl.hitachi.co.jp
    Signed-off-by: Ingo Molnar

    Mitsuo Hayasaka
     

01 Nov, 2011

2 commits

  • Quoth Andrew:

    - Most of MM. Still waiting for the poweroc guys to get off their
    butts and review some threaded hugepages patches.

    - alpha

    - vfs bits

    - drivers/misc

    - a few core kerenl tweaks

    - printk() features

    - MAINTAINERS updates

    - backlight merge

    - leds merge

    - various lib/ updates

    - checkpatch updates

    * akpm: (127 commits)
    epoll: fix spurious lockdep warnings
    checkpatch: add a --strict check for utf-8 in commit logs
    kernel.h/checkpatch: mark strict_strto and simple_strto as obsolete
    llist-return-whether-list-is-empty-before-adding-in-llist_add-fix
    wireless: at76c50x: follow rename pack_hex_byte to hex_byte_pack
    fat: follow rename pack_hex_byte() to hex_byte_pack()
    security: follow rename pack_hex_byte() to hex_byte_pack()
    kgdb: follow rename pack_hex_byte() to hex_byte_pack()
    lib: rename pack_hex_byte() to hex_byte_pack()
    lib/string.c: fix strim() semantics for strings that have only blanks
    lib/idr.c: fix comment for ida_get_new_above()
    lib/percpu_counter.c: enclose hotplug only variables in hotplug ifdef
    lib/bitmap.c: quiet sparse noise about address space
    lib/spinlock_debug.c: print owner on spinlock lockup
    lib/kstrtox: common code between kstrto*() and simple_strto*() functions
    drivers/leds/leds-lp5521.c: check if reset is successful
    leds: turn the blink_timer off before starting to blink
    leds: save the delay values after a successful call to blink_set()
    drivers/leds/leds-gpio.c: use gpio_get_value_cansleep() when initializing
    drivers/leds/leds-lm3530.c: add __devexit_p where needed
    ...

    Linus Torvalds
     
  • Userspace needs to know the highest valid capability of the running
    kernel, which right now cannot reliably be retrieved from the header files
    only. The fact that this value cannot be determined properly right now
    creates various problems for libraries compiled on newer header files
    which are run on older kernels. They assume capabilities are available
    which actually aren't. libcap-ng is one example. And we ran into the
    same problem with systemd too.

    Now the capability is exported in /proc/sys/kernel/cap_last_cap.

    [akpm@linux-foundation.org: make cap_last_cap const, per Ulrich]
    Signed-off-by: Dan Ballard
    Cc: Randy Dunlap
    Cc: Ingo Molnar
    Cc: Lennart Poettering
    Cc: Kay Sievers
    Cc: Ulrich Drepper
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Ballard