16 Aug, 2014

1 commit

  • The commit

    4982223e51e8 module: set nx before marking module MODULE_STATE_COMING.

    introduced a regression: if a module fails to parse its arguments or
    if mod_sysfs_setup fails, then the module's memory will be freed
    while still read-only. Anything that reuses that memory will crash
    as soon as it tries to write to it.

    Cc: stable@vger.kernel.org # v3.16
    Cc: Rusty Russell
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Rusty Russell

    Andy Lutomirski
     

15 Aug, 2014

4 commits

  • Pull more ACPI and power management updates from Rafael Wysocki:
    "These are a couple of regression fixes, cpuidle menu governor
    optimizations, fixes for ACPI proccessor and battery drivers,
    hibernation fix to avoid problems related to the e820 memory map,
    fixes for a few cpufreq drivers and a new version of the suspend
    profiling tool analyze_suspend.py.

    Specifics:

    - Fix for an ACPI-based device hotplug regression introduced in 3.14
    that causes a kernel panic to trigger when memory hot-remove is
    attempted with CONFIG_ACPI_HOTPLUG_MEMORY unset from Tang Chen

    - Fix for a cpufreq regression introduced in 3.16 that triggers a
    "sleeping function called from invalid context" bug in
    dev_pm_opp_init_cpufreq_table() from Stephen Boyd

    - ACPI battery driver fix for a warning message added in 3.16 that
    prints silly stuff sometimes from Mariusz Ceier

    - Hibernation fix for safer handling of mismatches in the 820 memory
    map between the configurations during image creation and during the
    subsequent restore from Chun-Yi Lee

    - ACPI processor driver fix to handle CPU hotplug notifications
    correctly during system suspend/resume from Lan Tianyu

    - Series of four cpuidle menu governor cleanups that also should
    speed it up a bit from Mel Gorman

    - Fixes for the speedstep-smi, integrator, cpu0 and arm_big_little
    cpufreq drivers from Hans Wennborg, Himangi Saraogi, Markus
    Pargmann and Uwe Kleine-König

    - Version 3.0 of the analyze_suspend.py suspend profiling tool from
    Todd E Brandt"

    * tag 'pm+acpi-3.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    ACPI / battery: Fix warning message in acpi_battery_get_state()
    PM / tools: analyze_suspend.py: update to v3.0
    cpufreq: arm_big_little: fix module license spec
    cpufreq: speedstep-smi: fix decimal printf specifiers
    ACPI / hotplug: Check scan handlers in acpi_scan_hot_remove()
    cpufreq: OPP: Avoid sleeping while atomic
    cpufreq: cpu0: Do not print error message when deferring
    cpufreq: integrator: Use set_cpus_allowed_ptr
    PM / hibernate: avoid unsafe pages in e820 reserved regions
    ACPI / processor: Make acpi_cpu_soft_notify() process CPU FROZEN events
    cpuidle: menu: Lookup CPU runqueues less
    cpuidle: menu: Call nr_iowait_cpu less times
    cpuidle: menu: Use ktime_to_us instead of reinventing the wheel
    cpuidle: menu: Use shifts when calculating averages where possible

    Linus Torvalds
     
  • Benjamin Herrenschmidt pointed out that I further missed modifying
    update_vsyscall after the wall_to_mono value was changed to a
    timespec64. This causes issues on powerpc32, which expects a 32bit
    timespec.

    This patch fixes the problem by properly converting from a timespec64 to
    a timespec before passing the value on to the arch-specific vsyscall
    logic.

    [ Thomas is currently on vacation, but reviewed it and wanted me to send
    this fix on to you directly. ]

    Cc: LKML
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Benjamin Herrenschmidt
    Reported-by: Benjamin Herrenschmidt
    Reviewed-by: Thomas Gleixner
    Signed-off-by: John Stultz
    Signed-off-by: Linus Torvalds

    John Stultz
     
  • Pull more powerpc updates from Ben Herrenschmidt:
    "Here are some more powerpc bits for 3.17, essentially fixes.

    The biggest series, also aimed at -stable, is from Aneesh and is the
    result of weeks and weeks of debugging to find out why the heck or THP
    implementation was occasionally triggering multi-hit errors in our
    level 1 TLB. It ended up being a combination of issues including
    subtleties as to how we should invalidate those special 'MPSS' pages
    we use to allow the use of 16M pages inside 4K/64K "base page size"
    segments (you really have to love our MMU !)

    Another interesting one in the "OMG" category is the series from
    Michael adding memory barriers to spin_is_locked(). That's also the
    result of many days of debugging to figure out why the semaphore code
    would occasionally crash in ways that made no sense. It ended up
    being some creative lock stacking that was defeated by the fact that
    our locks allow a load inside the locked section to be re-ordered with
    the load of the lock value itself (I'm still of two mind about whether
    to kill that once and for all by putting a heavier barrier back into
    our lock implementation...). The fixes come with a long explanation
    in the cset comments, feel free to read it if you feel like having a
    headache today"

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (25 commits)
    powerpc/thp: Add tracepoints to track hugepage invalidate
    powerpc/mm: Use read barrier when creating real_pte
    powerpc/thp: Use ACCESS_ONCE when loading pmdp
    powerpc/thp: Invalidate with vpn in loop
    powerpc/thp: Handle combo pages in invalidate
    powerpc/thp: Invalidate old 64K based hash page mapping before insert of 4k pte
    powerpc/thp: Don't recompute vsid and ssize in loop on invalidate
    powerpc/thp: Add write barrier after updating the valid bit
    powerpc: reorder per-cpu NUMA information's initialization
    powerpc/perf/hv-24x7: Use kmem_cache_free
    powerpc/pseries/hvcserver: Fix endian issue in hvcs_get_partner_info
    powerpc: Hard disable interrupts in xmon
    powerpc: remove duplicate definition of TEXASR_FS
    powerpc/pseries: Avoid deadlock on removing ddw
    powerpc/pseries: Failure on removing device node
    powerpc/boot: Use correct zlib types for comparison
    powerpc/powernv: Interface to register/unregister opal dump region
    printk: Add function to return log buffer address and size
    powerpc: Add POWER8 features to CPU_FTRS_POSSIBLE/ALWAYS
    powerpc/ppc476: Disable BTAC
    ...

    Linus Torvalds
     
  • Pull seccomp fix from James Morris.

    BUG(!spin_is_locked()) really doesn't work very well in UP
    configurations without any actual spinlock state. Which is very much
    why we have that "assert_spin_lock()" function for this.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
    seccomp: Replace BUG(!spin_is_locked()) with assert_spin_lock

    Linus Torvalds
     

13 Aug, 2014

1 commit

  • Platforms like IBM Power Systems supports service processor
    assisted dump. It provides interface to add memory region to
    be captured when system is crashed.

    During initialization/running we can add kernel memory region
    to be collected.

    Presently we don't have a way to get the log buffer base address
    and size. This patch adds support to return log buffer address
    and size.

    Signed-off-by: Vasant Hegde
    Signed-off-by: Benjamin Herrenschmidt
    Acked-by: Andrew Morton

    Vasant Hegde
     

12 Aug, 2014

3 commits

  • * pm-sleep:
    PM / hibernate: avoid unsafe pages in e820 reserved regions

    * pm-cpufreq:
    cpufreq: arm_big_little: fix module license spec
    cpufreq: speedstep-smi: fix decimal printf specifiers
    cpufreq: OPP: Avoid sleeping while atomic
    cpufreq: cpu0: Do not print error message when deferring
    cpufreq: integrator: Use set_cpus_allowed_ptr

    * pm-cpuidle:
    cpuidle: menu: Lookup CPU runqueues less
    cpuidle: menu: Call nr_iowait_cpu less times
    cpuidle: menu: Use ktime_to_us instead of reinventing the wheel
    cpuidle: menu: Use shifts when calculating averages where possible

    Rafael J. Wysocki
     
  • Current upstream kernel hangs with mips and powerpc targets in
    uniprocessor mode if SECCOMP is configured.

    Bisect points to commit dbd952127d11 ("seccomp: introduce writer locking").
    Turns out that code such as
    BUG_ON(!spin_is_locked(&list_lock));
    can not be used in uniprocessor mode because spin_is_locked() always
    returns false in this configuration, and that assert_spin_locked()
    exists for that very purpose and must be used instead.

    Fixes: dbd952127d11 ("seccomp: introduce writer locking")
    Cc: Kees Cook
    Signed-off-by: Guenter Roeck
    Signed-off-by: Kees Cook

    Guenter Roeck
     
  • Pull vfs updates from Al Viro:
    "Stuff in here:

    - acct.c fixes and general rework of mnt_pin mechanism. That allows
    to go for delayed-mntput stuff, which will permit mntput() on deep
    stack without worrying about stack overflows - fs shutdown will
    happen on shallow stack. IOW, we can do Eric's umount-on-rmdir
    series without introducing tons of stack overflows on new mntput()
    call chains it introduces.
    - Bruce's d_splice_alias() patches
    - more Miklos' rename() stuff.
    - a couple of regression fixes (stable fodder, in the end of branch)
    and a fix for API idiocy in iov_iter.c.

    There definitely will be another pile, maybe even two. I'd like to
    get Eric's series in this time, but even if we miss it, it'll go right
    in the beginning of for-next in the next cycle - the tricky part of
    prereqs is in this pile"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
    fix copy_tree() regression
    __generic_file_write_iter(): fix handling of sync error after DIO
    switch iov_iter_get_pages() to passing maximal number of pages
    fs: mark __d_obtain_alias static
    dcache: d_splice_alias should detect loops
    exportfs: update Exporting documentation
    dcache: d_find_alias needn't recheck IS_ROOT && DCACHE_DISCONNECTED
    dcache: remove unused d_find_alias parameter
    dcache: d_obtain_alias callers don't all want DISCONNECTED
    dcache: d_splice_alias should ignore DCACHE_DISCONNECTED
    dcache: d_splice_alias mustn't create directory aliases
    dcache: close d_move race in d_splice_alias
    dcache: move d_splice_alias
    namei: trivial fix to vfs_rename_dir comment
    VFS: allow ->d_manage() to declare -EISDIR in rcu_walk mode.
    cifs: support RENAME_NOREPLACE
    hostfs: support rename flags
    shmem: support RENAME_EXCHANGE
    shmem: support RENAME_NOREPLACE
    btrfs: add RENAME_NOREPLACE
    ...

    Linus Torvalds
     

11 Aug, 2014

1 commit

  • Pull module updates from Rusty Russell:
    "This finally applies the stricter sysfs perms checking we pulled out
    before last merge window. A few stragglers are fixed (thanks
    linux-next!)"

    * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
    arch/powerpc/platforms/powernv/opal-dump.c: fix world-writable sysfs files
    arch/powerpc/platforms/powernv/opal-elog.c: fix world-writable sysfs files
    drivers/video/fbdev/s3c2410fb.c: don't make debug world-writable.
    ARM: avoid ARM binutils leaking ELF local symbols
    scripts: modpost: Remove numeric suffix pattern matching
    scripts: modpost: fix compilation warning
    sysfs: disallow world-writable files.
    module: return bool from within_module*()
    module: add within_module() function
    modules: Fix build error in moduleloader.h

    Linus Torvalds
     

10 Aug, 2014

3 commits

  • Pull trace file read iterator fixes from Steven Rostedt:
    "This contains a fix for two long standing bugs. Both of which are
    rarely ever hit, and requires the user to do something that users
    rarely do. It took a few special test cases to even trigger this bug,
    and one of them was just one test in the process of finishing up as
    another one started.

    Both bugs have to do with the ring buffer iterator rb_iter_peek(), but
    one is more indirect than the other.

    The fist bug fix is simply an increase in the safety net loop counter.
    The counter makes sure that the rb_iter_peek() only iterates the
    number of times we expect it can, and no more. Well, there was one
    way it could iterate one more than we expected, and that caused the
    ring buffer to shutdown with a nasty warning. The fix was simply to
    up that counter by one.

    The other bug has to be with rb_iter_reset() (called by
    rb_iter_peek()). This happens when a user reads both the trace_pipe
    and trace files. The trace_pipe is a consuming read and does not use
    the ring buffer iterator, but the trace file is not a consuming read
    and does use the ring buffer iterator. When the trace file is being
    read, if it detects that a consuming read occurred, it resets the
    iterator and starts over. But the reset code that does this
    (rb_iter_reset()), checks if the reader_page is linked to the ring
    buffer or not, and will look into the ring buffer itself if it is not.
    This is wrong, as it should always try to read the reader page first.
    Not to mention, the code that looked into the ring buffer did it
    wrong, and used the header_page "read" offset to start reading on that
    page. That offset is bogus for pages in the writable ring buffer, and
    was corrupting the iterator, and it would start returning bogus
    events"

    * tag 'trace-fixes-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    ring-buffer: Always reset iterator to reader page
    ring-buffer: Up rb_iter_peek() loop count to 3

    Linus Torvalds
     
  • Pull namespace updates from Eric Biederman:
    "This is a bunch of small changes built against 3.16-rc6. The most
    significant change for users is the first patch which makes setns
    drmatically faster by removing unneded rcu handling.

    The next chunk of changes are so that "mount -o remount,.." will not
    allow the user namespace root to drop flags on a mount set by the
    system wide root. Aks this forces read-only mounts to stay read-only,
    no-dev mounts to stay no-dev, no-suid mounts to stay no-suid, no-exec
    mounts to stay no exec and it prevents unprivileged users from messing
    with a mounts atime settings. I have included my test case as the
    last patch in this series so people performing backports can verify
    this change works correctly.

    The next change fixes a bug in NFS that was discovered while auditing
    nsproxy users for the first optimization. Today you can oops the
    kernel by reading /proc/fs/nfsfs/{servers,volumes} if you are clever
    with pid namespaces. I rebased and fixed the build of the
    !CONFIG_NFS_FS case yesterday when a build bot caught my typo. Given
    that no one to my knowledge bases anything on my tree fixing the typo
    in place seems more responsible that requiring a typo-fix to be
    backported as well.

    The last change is a small semantic cleanup introducing
    /proc/thread-self and pointing /proc/mounts and /proc/net at it. This
    prevents several kinds of problemantic corner cases. It is a
    user-visible change so it has a minute chance of causing regressions
    so the change to /proc/mounts and /proc/net are individual one line
    commits that can be trivially reverted. Unfortunately I lost and
    could not find the email of the original reporter so he is not
    credited. From at least one perspective this change to /proc/net is a
    refgression fix to allow pthread /proc/net uses that were broken by
    the introduction of the network namespace"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    proc: Point /proc/mounts at /proc/thread-self/mounts instead of /proc/self/mounts
    proc: Point /proc/net at /proc/thread-self/net instead of /proc/self/net
    proc: Implement /proc/thread-self to point at the directory of the current thread
    proc: Have net show up under /proc//task/
    NFS: Fix /proc/fs/nfsfs/servers and /proc/fs/nfsfs/volumes
    mnt: Add tests for unprivileged remount cases that have found to be faulty
    mnt: Change the default remount atime from relatime to the existing value
    mnt: Correct permission checks in do_remount
    mnt: Move the test for MNT_LOCK_READONLY from change_mount_flags into do_remount
    mnt: Only change user settable mount flags in remount
    namespaces: Use task_lock and not rcu to protect nsproxy

    Linus Torvalds
     
  • Pull arch signal handling cleanup from Richard Weinberger:
    "This patch series moves all remaining archs to the get_signal(),
    signal_setup_done() and sigsp() functions.

    Currently these archs use open coded variants of the said functions.
    Further, unused parameters get removed from get_signal_to_deliver(),
    tracehook_signal_handler() and signal_delivered().

    At the end of the day we save around 500 lines of code."

    * 'signal-cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/misc: (43 commits)
    powerpc: Use sigsp()
    openrisc: Use sigsp()
    mn10300: Use sigsp()
    mips: Use sigsp()
    microblaze: Use sigsp()
    metag: Use sigsp()
    m68k: Use sigsp()
    m32r: Use sigsp()
    hexagon: Use sigsp()
    frv: Use sigsp()
    cris: Use sigsp()
    c6x: Use sigsp()
    blackfin: Use sigsp()
    avr32: Use sigsp()
    arm64: Use sigsp()
    arc: Use sigsp()
    sas_ss_flags: Remove nested ternary if
    Rip out get_signal_to_deliver()
    Clean up signal_delivered()
    tracehook_signal_handler: Remove sig, info, ka and regs
    ...

    Linus Torvalds
     

09 Aug, 2014

27 commits

  • This is the final piece of the puzzle of verifying kernel image signature
    during kexec_file_load() syscall.

    This patch calls into PE file routines to verify signature of bzImage. If
    signature are valid, kexec_file_load() succeeds otherwise it fails.

    Two new config options have been introduced. First one is
    CONFIG_KEXEC_VERIFY_SIG. This option enforces that kernel has to be
    validly signed otherwise kernel load will fail. If this option is not
    set, no signature verification will be done. Only exception will be when
    secureboot is enabled. In that case signature verification should be
    automatically enforced when secureboot is enabled. But that will happen
    when secureboot patches are merged.

    Second config option is CONFIG_KEXEC_BZIMAGE_VERIFY_SIG. This option
    enables signature verification support on bzImage. If this option is not
    set and previous one is set, kernel image loading will fail because kernel
    does not have support to verify signature of bzImage.

    I tested these patches with both "pesign" and "sbsign" signed bzImages.

    I used signing_key.priv key and signing_key.x509 cert for signing as
    generated during kernel build process (if module signing is enabled).

    Used following method to sign bzImage.

    pesign
    ======
    - Convert DER format cert to PEM format cert
    openssl x509 -in signing_key.x509 -inform DER -out signing_key.x509.PEM -outform
    PEM

    - Generate a .p12 file from existing cert and private key file
    openssl pkcs12 -export -out kernel-key.p12 -inkey signing_key.priv -in
    signing_key.x509.PEM

    - Import .p12 file into pesign db
    pk12util -i /tmp/kernel-key.p12 -d /etc/pki/pesign

    - Sign bzImage
    pesign -i /boot/vmlinuz-3.16.0-rc3+ -o /boot/vmlinuz-3.16.0-rc3+.signed.pesign
    -c "Glacier signing key - Magrathea" -s

    sbsign
    ======
    sbsign --key signing_key.priv --cert signing_key.x509.PEM --output
    /boot/vmlinuz-3.16.0-rc3+.signed.sbsign /boot/vmlinuz-3.16.0-rc3+

    Patch details:

    Well all the hard work is done in previous patches. Now bzImage loader
    has just call into that code and verify whether bzImage signature are
    valid or not.

    Also create two config options. First one is CONFIG_KEXEC_VERIFY_SIG.
    This option enforces that kernel has to be validly signed otherwise kernel
    load will fail. If this option is not set, no signature verification will
    be done. Only exception will be when secureboot is enabled. In that case
    signature verification should be automatically enforced when secureboot is
    enabled. But that will happen when secureboot patches are merged.

    Second config option is CONFIG_KEXEC_BZIMAGE_VERIFY_SIG. This option
    enables signature verification support on bzImage. If this option is not
    set and previous one is set, kernel image loading will fail because kernel
    does not have support to verify signature of bzImage.

    Signed-off-by: Vivek Goyal
    Cc: Borislav Petkov
    Cc: Michael Kerrisk
    Cc: Yinghai Lu
    Cc: Eric Biederman
    Cc: H. Peter Anvin
    Cc: Matthew Garrett
    Cc: Greg Kroah-Hartman
    Cc: Dave Young
    Cc: WANG Chao
    Cc: Baoquan He
    Cc: Andy Lutomirski
    Cc: Matt Fleming
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vivek Goyal
     
  • This patch adds support for loading a kexec on panic (kdump) kernel usning
    new system call.

    It prepares ELF headers for memory areas to be dumped and for saved cpu
    registers. Also prepares the memory map for second kernel and limits its
    boot to reserved areas only.

    Signed-off-by: Vivek Goyal
    Cc: Borislav Petkov
    Cc: Michael Kerrisk
    Cc: Yinghai Lu
    Cc: Eric Biederman
    Cc: H. Peter Anvin
    Cc: Matthew Garrett
    Cc: Greg Kroah-Hartman
    Cc: Dave Young
    Cc: WANG Chao
    Cc: Baoquan He
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vivek Goyal
     
  • This is loader specific code which can load bzImage and set it up for
    64bit entry. This does not take care of 32bit entry or real mode entry.

    32bit mode entry can be implemented if somebody needs it.

    Signed-off-by: Vivek Goyal
    Cc: Borislav Petkov
    Cc: Michael Kerrisk
    Cc: Yinghai Lu
    Cc: Eric Biederman
    Cc: H. Peter Anvin
    Cc: Matthew Garrett
    Cc: Greg Kroah-Hartman
    Cc: Dave Young
    Cc: WANG Chao
    Cc: Baoquan He
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vivek Goyal
     
  • Load purgatory code in RAM and relocate it based on the location.
    Relocation code has been inspired by module relocation code and purgatory
    relocation code in kexec-tools.

    Also compute the checksums of loaded kexec segments and store them in
    purgatory.

    Arch independent code provides this functionality so that arch dependent
    bootloaders can make use of it.

    Helper functions are provided to get/set symbol values in purgatory which
    are used by bootloaders later to set things like stack and entry point of
    second kernel etc.

    Signed-off-by: Vivek Goyal
    Cc: Borislav Petkov
    Cc: Michael Kerrisk
    Cc: Yinghai Lu
    Cc: Eric Biederman
    Cc: H. Peter Anvin
    Cc: Matthew Garrett
    Cc: Greg Kroah-Hartman
    Cc: Dave Young
    Cc: WANG Chao
    Cc: Baoquan He
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vivek Goyal
     
  • Previous patch provided the interface definition and this patch prvides
    implementation of new syscall.

    Previously segment list was prepared in user space. Now user space just
    passes kernel fd, initrd fd and command line and kernel will create a
    segment list internally.

    This patch contains generic part of the code. Actual segment preparation
    and loading is done by arch and image specific loader. Which comes in
    next patch.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Vivek Goyal
    Cc: Borislav Petkov
    Cc: Michael Kerrisk
    Cc: Yinghai Lu
    Cc: Eric Biederman
    Cc: H. Peter Anvin
    Cc: Matthew Garrett
    Cc: Greg Kroah-Hartman
    Cc: Dave Young
    Cc: WANG Chao
    Cc: Baoquan He
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vivek Goyal
     
  • This is the new syscall kexec_file_load() declaration/interface. I have
    reserved the syscall number only for x86_64 so far. Other architectures
    (including i386) can reserve syscall number when they enable the support
    for this new syscall.

    Signed-off-by: Vivek Goyal
    Cc: Michael Kerrisk
    Cc: Borislav Petkov
    Cc: Yinghai Lu
    Cc: Eric Biederman
    Cc: H. Peter Anvin
    Cc: Matthew Garrett
    Cc: Greg Kroah-Hartman
    Cc: Dave Young
    Cc: WANG Chao
    Cc: Baoquan He
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vivek Goyal
     
  • I have added two more functions to walk through resources.

    Currently walk_system_ram_range() deals with pfn and /proc/iomem can
    contain partial pages. By dealing in pfn, callback function loses the
    info that last page of a memory range is a partial page and not the full
    page. So I implemented walk_system_ram_res() which returns u64 values to
    callback functions and now it properly return start and end address.

    walk_system_ram_range() uses find_next_system_ram() to find the next ram
    resource. This in turn only travels through siblings of top level child
    and does not travers through all the nodes of the resoruce tree. I also
    need another function where I can walk through all the resources, for
    example figure out where "GART" aperture is. Figure out where ACPI memory
    is.

    So I wrote another function walk_iomem_res() which walks through all
    /proc/iomem resources and returns matches as asked by caller. Caller can
    specify "name" of resource, start and end and flags.

    Got rid of find_next_system_ram_res() and instead implemented more generic
    find_next_iomem_res() which can be used to traverse top level children
    only based on an argument.

    Signed-off-by: Vivek Goyal
    Cc: Yinghai Lu
    Cc: Borislav Petkov
    Cc: Michael Kerrisk
    Cc: Eric Biederman
    Cc: H. Peter Anvin
    Cc: Matthew Garrett
    Cc: Greg Kroah-Hartman
    Cc: Dave Young
    Cc: WANG Chao
    Cc: Baoquan He
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vivek Goyal
     
  • kimage_normal_alloc() and kimage_crash_alloc() are doing lot of similar
    things and differ only little. So instead of having two separate
    functions create a common function kimage_alloc_init() and pass it the
    "flags" argument which tells whether it is normal kexec or kexec_on_panic.
    And this function should be able to deal with both the cases.

    This consolidation also helps later where we can use a common function
    kimage_file_alloc_init() to handle normal and crash cases for new file
    based kexec syscall.

    Signed-off-by: Vivek Goyal
    Cc: Borislav Petkov
    Cc: Michael Kerrisk
    Cc: Yinghai Lu
    Cc: Eric Biederman
    Cc: H. Peter Anvin
    Cc: Matthew Garrett
    Cc: Greg Kroah-Hartman
    Cc: Dave Young
    Cc: WANG Chao
    Cc: Baoquan He
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vivek Goyal
     
  • Previously do_kimage_alloc() will allocate a kimage structure, copy
    segment list from user space and then do the segment list sanity
    verification.

    Break down this function in 3 parts. do_kimage_alloc_init() to do actual
    allocation and basic initialization of kimage structure.
    copy_user_segment_list() to copy segment list from user space and
    sanity_check_segment_list() to verify the sanity of segment list as passed
    by user space.

    In later patches, I need to only allocate kimage and not copy segment list
    from user space. So breaking down in smaller functions enables re-use of
    code at other places.

    Signed-off-by: Vivek Goyal
    Cc: Borislav Petkov
    Cc: Michael Kerrisk
    Cc: Yinghai Lu
    Cc: Eric Biederman
    Cc: H. Peter Anvin
    Cc: Matthew Garrett
    Cc: Greg Kroah-Hartman
    Cc: Dave Young
    Cc: WANG Chao
    Cc: Baoquan He
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vivek Goyal
     
  • Let's use the more common "unusable".

    This patch was originally written and posted by Boris. I am including it
    in this patch series.

    Signed-off-by: Borislav Petkov
    Signed-off-by: Vivek Goyal
    Cc: Borislav Petkov
    Cc: Michael Kerrisk
    Cc: Yinghai Lu
    Cc: Eric Biederman
    Cc: H. Peter Anvin
    Cc: Matthew Garrett
    Cc: Greg Kroah-Hartman
    Cc: Dave Young
    Cc: WANG Chao
    Cc: Baoquan He
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vivek Goyal
     
  • This patch series does not do kernel signature verification yet. I plan
    to post another patch series for that. Now distributions are already
    signing PE/COFF bzImage with PKCS7 signature I plan to parse and verify
    those signatures.

    Primary goal of this patchset is to prepare groundwork so that kernel
    image can be signed and signatures be verified during kexec load. This
    should help with two things.

    - It should allow kexec/kdump on secureboot enabled machines.

    - In general it can help even without secureboot. By being able to verify
    kernel image signature in kexec, it should help with avoiding module
    signing restrictions. Matthew Garret showed how to boot into a custom
    kernel, modify first kernel's memory and then jump back to old kernel and
    bypass any policy one wants to.

    This patch (of 15):

    Kexec wants to use bin2c and it wants to use it really early in the build
    process. See arch/x86/purgatory/ code in later patches.

    So move bin2c in scripts/basic so that it can be built very early and
    be usable by arch/x86/purgatory/

    Signed-off-by: Vivek Goyal
    Cc: Borislav Petkov
    Cc: Michael Kerrisk
    Cc: Yinghai Lu
    Cc: Eric Biederman
    Cc: H. Peter Anvin
    Cc: Matthew Garrett
    Cc: Greg Kroah-Hartman
    Cc: Dave Young
    Cc: WANG Chao
    Cc: Baoquan He
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vivek Goyal
     
  • memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
    that you can pass to mmap(). It can support sealing and avoids any
    connection to user-visible mount-points. Thus, it's not subject to quotas
    on mounted file-systems, but can be used like malloc()'ed memory, but with
    a file-descriptor to it.

    memfd_create() returns the raw shmem file, so calls like ftruncate() can
    be used to modify the underlying inode. Also calls like fstat() will
    return proper information and mark the file as regular file. If you want
    sealing, you can specify MFD_ALLOW_SEALING. Otherwise, sealing is not
    supported (like on all other regular files).

    Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
    subject to a filesystem size limit. It is still properly accounted to
    memcg limits, though, and to the same overcommit or no-overcommit
    accounting as all user memory.

    Signed-off-by: David Herrmann
    Acked-by: Hugh Dickins
    Cc: Michael Kerrisk
    Cc: Ryan Lortie
    Cc: Lennart Poettering
    Cc: Daniel Mack
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Herrmann
     
  • This patch (of 6):

    The i_mmap_writable field counts existing writable mappings of an
    address_space. To allow drivers to prevent new writable mappings, make
    this counter signed and prevent new writable mappings if it is negative.
    This is modelled after i_writecount and DENYWRITE.

    This will be required by the shmem-sealing infrastructure to prevent any
    new writable mappings after the WRITE seal has been set. In case there
    exists a writable mapping, this operation will fail with EBUSY.

    Note that we rely on the fact that iff you already own a writable mapping,
    you can increase the counter without using the helpers. This is the same
    that we do for i_writecount.

    Signed-off-by: David Herrmann
    Acked-by: Hugh Dickins
    Cc: Michael Kerrisk
    Cc: Ryan Lortie
    Cc: Lennart Poettering
    Cc: Daniel Mack
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Herrmann
     
  • Signed-off-by: Ionut Alexa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ionut Alexa
     
  • This is small set of patches our team has had kicking around for a few
    versions internally that fixes tasks getting hung on shm_exit when there
    are many threads hammering it at once.

    Anton wrote a simple test to cause the issue:

    http://ozlabs.org/~anton/junkcode/bust_shm_exit.c

    Before applying this patchset, this test code will cause either hanging
    tracebacks or pthread out of memory errors.

    After this patchset, it will still produce output like:

    root@somehost:~# ./bust_shm_exit 1024 160
    ...
    INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 116, t=2111 jiffies, g=241, c=240, q=7113)
    INFO: Stall ended before state dump start
    ...

    But the task will continue to run along happily, so we consider this an
    improvement over hanging, even if it's a bit noisy.

    This patch (of 3):

    exit_shm obtains the ipc_ns shm rwsem for write and holds it while it
    walks every shared memory segment in the namespace. Thus the amount of
    work is related to the number of shm segments in the namespace not the
    number of segments that might need to be cleaned.

    In addition, this occurs after the task has been notified the thread has
    exited, so the number of tasks waiting for the ns shm rwsem can grow
    without bound until memory is exausted.

    Add a list to the task struct of all shmids allocated by this task. Init
    the list head in copy_process. Use the ns->rwsem for locking. Add
    segments after id is added, remove before removing from id.

    On unshare of NEW_IPCNS orphan any ids as if the task had exited, similar
    to handling of semaphore undo.

    I chose a define for the init sequence since its a simple list init,
    otherwise it would require a function call to avoid include loops between
    the semaphore code and the task struct. Converting the list_del to
    list_del_init for the unshare cases would remove the exit followed by
    init, but I left it blow up if not inited.

    Signed-off-by: Milton Miller
    Signed-off-by: Jack Miller
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Cc: Anton Blanchard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jack Miller
     
  • This taint flag will be set if the system has ever entered a softlockup
    state. Similar to TAINT_WARN it is useful to know whether or not the
    system has been in a softlockup state when debugging.

    [akpm@linux-foundation.org: apply the taint before calling panic()]
    Signed-off-by: Josh Hunt
    Cc: Jason Baron
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josh Hunt
     
  • This fixes checkpatch warning:

    WARNING: debugfs_remove(NULL) is safe this check is probably not required

    Signed-off-by: Fabian Frederick
    Cc: Peter Oberparleiter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • It's only used in fork.c:mm_init().

    Signed-off-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • If a forking process has a thread calling (un)mmap (silly but still),
    the child process may have some of its mm's vm usage counters (total_vm
    and friends) screwed up, because currently they are copied from oldmm
    w/o holding any locks (memcpy in dup_mm).

    This patch moves the counters initialization to dup_mmap() to be called
    under oldmm->mmap_sem, which eliminates any possibility of race.

    Signed-off-by: Vladimir Davydov
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • mm->pinned_vm counts pages of mm's address space that were permanently
    pinned in memory by increasing their reference counter. The counter was
    introduced by commit bc3e53f682d9 ("mm: distinguish between mlocked and
    pinned pages"), while before it locked_vm had been used for such pages.

    Obviously, we should reset the counter on fork if !CLONE_VM, just like
    we do with locked_vm, but currently we don't. Let's fix it.

    This patch will fix the contents of /proc/pid/status:VmPin.

    ib_umem_get[infiniband] and perf_mmap still check pinned_vm against
    RLIMIT_MEMLOCK. It's left from the times when pinned pages were accounted
    under locked_vm, but today it looks wrong. It isn't clear how we should
    deal with it.

    We still have some drivers accounting pinned pages under mm->locked_vm -
    this is what commit bc3e53f682d9 was fighting against. It's
    infiniband/usnic and vfio.

    Signed-off-by: Vladimir Davydov
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Christoph Lameter
    Cc: Roland Dreier
    Cc: Sean Hefty
    Cc: Hal Rosenstock
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • mm initialization on fork/exec is spread all over the place, which makes
    the code look inconsistent.

    We have mm_init(), which is supposed to init/nullify mm's internals, but
    it doesn't init all the fields it should:

    - on fork ->mmap,mm_rb,vmacache_seqnum,map_count,mm_cpumask,locked_vm
    are zeroed in dup_mmap();

    - on fork ->pmd_huge_pte is zeroed in dup_mm(), immediately before
    calling mm_init();

    - ->cpu_vm_mask_var ptr is initialized by mm_init_cpumask(), which is
    called before mm_init() on both fork and exec;

    - ->context is initialized by init_new_context(), which is called after
    mm_init() on both fork and exec;

    Let's consolidate all the initializations in mm_init() to make the code
    look cleaner.

    Signed-off-by: Vladimir Davydov
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • proc_uid_seq_operations, proc_gid_seq_operations and
    proc_projid_seq_operations are only called in proc_id_map_open with
    seq_open as const struct seq_operations so we can constify the 3
    structures and update proc_id_map_open prototype.

    text data bss dec hex filename
    6817 404 1984 9205 23f5 kernel/user_namespace.o-before
    6913 308 1984 9205 23f5 kernel/user_namespace.o-after

    Signed-off-by: Fabian Frederick
    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • Fixed coding style warnings and errors.

    Signed-off-by: Ionut Alexa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ionut Alexa
     
  • - Add pr_fmt
    - Coalesce formats
    - Use current pr_foo() functions instead of printk
    - Remove unnecessary "failed" display (already in log level).

    Signed-off-by: Fabian Frederick
    Cc: Ananth N Mavinakayanahalli
    Cc: Anil S Keshavamurthy
    Cc: "David S. Miller"
    Cc: Masami Hiramatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • __sprint_symbol() should restore original address when kallsyms_lookup()
    failed to find a symbol. It's reported when dumpstack shows an address in
    a dynamically allocated trampoline for ftrace.

    [ 1314.612287] [] dump_stack+0x45/0x56
    [ 1314.612290] [] ? meminfo_proc_open+0x30/0x30
    [ 1314.612293] [] kpatch_ftrace_handler+0x14/0xf0 [kpatch]
    [ 1314.612306] [] 0xffffffffa00160c3

    You can see a difference in the hex address - c4 and c3. Fix it.

    Signed-off-by: Namhyung Kim
    Reported-by: Masami Hiramatsu
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    Cc: Josh Poimboeuf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • It's not used anywhere today, so let's remove it.

    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Pages are now uncharged at release time, and all sources of batched
    uncharges operate on lists of pages. Directly use those lists, and
    get rid of the per-task batching state.

    This also batches statistics accounting, in addition to the res
    counter charges, to reduce IRQ-disabling and re-enabling.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Cc: Naoya Horiguchi
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner