06 May, 2013

3 commits

  • Pull 'full dynticks' support from Ingo Molnar:
    "This tree from Frederic Weisbecker adds a new, (exciting! :-) core
    kernel feature to the timer and scheduler subsystems: 'full dynticks',
    or CONFIG_NO_HZ_FULL=y.

    This feature extends the nohz variable-size timer tick feature from
    idle to busy CPUs (running at most one task) as well, potentially
    reducing the number of timer interrupts significantly.

    This feature got motivated by real-time folks and the -rt tree, but
    the general utility and motivation of full-dynticks runs wider than
    that:

    - HPC workloads get faster: CPUs running a single task should be able
    to utilize a maximum amount of CPU power. A periodic timer tick at
    HZ=1000 can cause a constant overhead of up to 1.0%. This feature
    removes that overhead - and speeds up the system by 0.5%-1.0% on
    typical distro configs even on modern systems.

    - Real-time workload latency reduction: CPUs running critical tasks
    should experience as little jitter as possible. The last remaining
    source of kernel-related jitter was the periodic timer tick.

    - A single task executing on a CPU is a pretty common situation,
    especially with an increasing number of cores/CPUs, so this feature
    helps desktop and mobile workloads as well.

    The cost of the feature is mainly related to increased timer
    reprogramming overhead when a CPU switches its tick period, and thus
    slightly longer to-idle and from-idle latency.

    Configuration-wise a third mode of operation is added to the existing
    two NOHZ kconfig modes:

    - CONFIG_HZ_PERIODIC: [formerly !CONFIG_NO_HZ], now explicitly named
    as a config option. This is the traditional Linux periodic tick
    design: there's a HZ tick going on all the time, regardless of
    whether a CPU is idle or not.

    - CONFIG_NO_HZ_IDLE: [formerly CONFIG_NO_HZ=y], this turns off the
    periodic tick when a CPU enters idle mode.

    - CONFIG_NO_HZ_FULL: this new mode, in addition to turning off the
    tick when a CPU is idle, also slows the tick down to 1 Hz (one
    timer interrupt per second) when only a single task is running on a
    CPU.

    The .config behavior is compatible: existing !CONFIG_NO_HZ and
    CONFIG_NO_HZ=y settings get translated to the new values, without the
    user having to configure anything. CONFIG_NO_HZ_FULL is turned off by
    default.

    This feature is based on a lot of infrastructure work that has been
    steadily going upstream in the last 2-3 cycles: related RCU support
    and non-periodic cputime support in particular is upstream already.

    This tree adds the final pieces and activates the feature. The pull
    request is marked RFC because:

    - it's marked 64-bit only at the moment - the 32-bit support patch is
    small but did not get ready in time.

    - it has a number of fresh commits that came in after the merge
    window. The overwhelming majority of commits are from before the
    merge window, but still some aspects of the tree are fresh and so I
    marked it RFC.

    - it's a pretty wide-reaching feature with lots of effects - and
    while the components have been in testing for some time, the full
    combination is still not very widely used. That it's default-off
    should reduce its regression abilities and obviously there are no
    known regressions with CONFIG_NO_HZ_FULL=y enabled either.

    - the feature is not completely idempotent: there is no 100%
    equivalent replacement for a periodic scheduler/timer tick. In
    particular there's ongoing work to map out and reduce its effects
    on scheduler load-balancing and statistics. This should not impact
    correctness though, there are no known regressions related to this
    feature at this point.

    - it's a pretty ambitious feature that with time will likely be
    enabled by most Linux distros, and we'd like you to make input on
    its design/implementation, if you dislike some aspect we missed.
    Without flaming us to crisp! :-)

    Future plans:

    - there's ongoing work to reduce 1Hz to 0Hz, to essentially shut off
    the periodic tick altogether when there's a single busy task on a
    CPU. We'd first like 1 Hz to be exposed more widely before we go
    for the 0 Hz target though.

    - once we reach 0 Hz we can remove the periodic tick assumption from
    nr_running>=2 as well, by essentially interrupting busy tasks only
    as frequently as the sched_latency constraints require us to do -
    once every 4-40 msecs, depending on nr_running.

    I am personally leaning towards biting the bullet and doing this in
    v3.10, like the -rt tree this effort has been going on for too long -
    but the final word is up to you as usual.

    More technical details can be found in Documentation/timers/NO_HZ.txt"

    * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
    sched: Keep at least 1 tick per second for active dynticks tasks
    rcu: Fix full dynticks' dependency on wide RCU nocb mode
    nohz: Protect smp_processor_id() in tick_nohz_task_switch()
    nohz_full: Add documentation.
    cputime_nsecs: use math64.h for nsec resolution conversion helpers
    nohz: Select VIRT_CPU_ACCOUNTING_GEN from full dynticks config
    nohz: Reduce overhead under high-freq idling patterns
    nohz: Remove full dynticks' superfluous dependency on RCU tree
    nohz: Fix unavailable tick_stop tracepoint in dynticks idle
    nohz: Add basic tracing
    nohz: Select wide RCU nocb for full dynticks
    nohz: Disable the tick when irq resume in full dynticks CPU
    nohz: Re-evaluate the tick for the new task after a context switch
    nohz: Prepare to stop the tick on irq exit
    nohz: Implement full dynticks kick
    nohz: Re-evaluate the tick from the scheduler IPI
    sched: New helper to prevent from stopping the tick in full dynticks
    sched: Kick full dynticks CPU that have more than one task enqueued.
    perf: New helper to prevent full dynticks CPUs from stopping tick
    perf: Kick full dynticks CPU if events rotation is needed
    ...

    Linus Torvalds
     
  • Pull perf fixes from Ingo Molnar:
    "Misc fixes plus a small hw-enablement patch for Intel IB model 58
    uncore events"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/x86/intel/lbr: Demand proper privileges for PERF_SAMPLE_BRANCH_KERNEL
    perf/x86/intel/lbr: Fix LBR filter
    perf/x86: Blacklist all MEM_*_RETIRED events for Ivy Bridge
    perf: Fix vmalloc ring buffer pages handling
    perf/x86/intel: Fix unintended variable name reuse
    perf/x86/intel: Add support for IvyBridge model 58 Uncore
    perf/x86/intel: Fix typo in perf_event_intel_uncore.c
    x86: Eliminate irq_mis_count counted in arch_irq_stat

    Linus Torvalds
     
  • Pull mudule updates from Rusty Russell:
    "We get rid of the general module prefix confusion with a binary config
    option, fix a remove/insert race which Never Happens, and (my
    favorite) handle the case when we have too many modules for a single
    commandline. Seriously, the kernel is full, please go away!"

    * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
    modpost: fix unwanted VMLINUX_SYMBOL_STR expansion
    X.509: Support parse long form of length octets in Authority Key Identifier
    module: don't unlink the module until we've removed all exposure.
    kernel: kallsyms: memory override issue, need check destination buffer length
    MODSIGN: do not send garbage to stderr when enabling modules signature
    modpost: handle huge numbers of modules.
    modpost: add -T option to read module names from file/stdin.
    modpost: minor cleanup.
    genksyms: pass symbol-prefix instead of arch
    module: fix symbol versioning with symbol prefixes
    CONFIG_SYMBOL_PREFIX: cleanup.

    Linus Torvalds
     

05 May, 2013

3 commits

  • Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Al Viro
     
  • Pull second round of VFS updates from Al Viro:
    "Assorted fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    xtensa simdisk: fix braino in "xtensa simdisk: switch to proc_create_data()"
    hostfs: use kmalloc instead of kzalloc
    hostfs: move HOSTFS_SUPER_MAGIC to
    hostfs: remove "will unlock" comment
    vfs: use list_move instead of list_del/list_add
    proc_devtree: Replace include linux/module.h with linux/export.h
    create_mnt_ns: unidiomatic use of list_add()
    fs: remove dentry_lru_prune()
    Removed unused typedef to avoid "unused local typedef" warnings.
    kill fs/read_write.h
    fs: Fix hang with BSD accounting on frozen filesystem
    sun3_scsi: add ->show_info()
    nubus: Kill nubus_proc_detach_device()
    more mode_t whack-a-mole...
    do_coredump(): don't wait for thaw if coredump has already been interrupted
    do_mount(): fix a leak introduced in 3.9 ("mount: consolidate permission checks")

    Linus Torvalds
     
  • When BSD process accounting is enabled and logs information to a
    filesystem which gets frozen, system easily becomes unusable because
    each attempt to account process information blocks. Thus e.g. every task
    gets blocked in exit.

    It seems better to drop accounting information (which can already happen
    when filesystem is running out of space) instead of locking system up.
    So we just skip the write if the filesystem is frozen.

    Reported-by: Nikola Ciprich
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     

04 May, 2013

2 commits

  • The scheduler doesn't yet fully support environments
    with a single task running without a periodic tick.

    In order to ensure we still maintain the duties of scheduler_tick(),
    keep at least 1 tick per second.

    This makes sure that we keep the progression of various scheduler
    accounting and background maintainance even with a very low granularity.
    Examples include cpu load, sched average, CFS entity vruntime,
    avenrun and events such as load balancing, amongst other details
    handled in sched_class::task_tick().

    This limitation will be removed in the future once we get
    these individual items to work in full dynticks CPUs.

    Suggested-by: Ingo Molnar
    Signed-off-by: Frederic Weisbecker
    Cc: Christoph Lameter
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     
  • Commit 0637e029392386e6996f5d6574aadccee8315efa
    ("nohz: Select wide RCU nocb for full dynticks") intended
    to force CONFIG_RCU_NOCB_CPU_ALL=y when full dynticks is
    enabled.

    However this option is part of a choice menu and Kconfig's
    "select" instruction has no effect on such targets.

    Fix this by using reverse dependencies on the targets we
    don't want instead.

    Reviewed-by: Paul E. McKenney
    Signed-off-by: Frederic Weisbecker
    Cc: Christoph Lameter
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     

03 May, 2013

2 commits

  • Pull drm updates from Dave Airlie:
    "This is the main drm pull request for 3.10.

    Wierd bits:
    - OMAP drm changes required OMAP dss changes, in drivers/video, so I
    took them in here.
    - one more fbcon fix for font handover
    - VT switch avoidance in pm code
    - scatterlist helpers for gpu drivers - have acks from akpm

    Highlights:
    - qxl kms driver - driver for the spice qxl virtual GPU

    Nouveau:
    - fermi/kepler VRAM compression
    - GK110/nvf0 modesetting support.

    Tegra:
    - host1x core merged with 2D engine support

    i915:
    - vt switchless resume
    - more valleyview support
    - vblank fixes
    - modesetting pipe config rework

    radeon:
    - UVD engine support
    - SI chip tiling support
    - GPU registers initialisation from golden values.

    exynos:
    - device tree changes
    - fimc block support

    Otherwise:
    - bunches of fixes all over the place."

    * 'drm-next' of git://people.freedesktop.org/~airlied/linux: (513 commits)
    qxl: update to new idr interfaces.
    drm/nouveau: fix build with nv50->nvc0
    drm/radeon: fix handling of v6 power tables
    drm/radeon: clarify family checks in pm table parsing
    drm/radeon: consolidate UVD clock programming
    drm/radeon: fix UPLL_REF_DIV_MASK definition
    radeon: add bo tracking debugfs
    drm/radeon: add new richland pci ids
    drm/radeon: add some new SI PCI ids
    drm/radeon: fix scratch reg handling for UVD fence
    drm/radeon: allocate SA bo in the requested domain
    drm/radeon: fix possible segfault when parsing pm tables
    drm/radeon: fix endian bugs in atom_allocate_fb_scratch()
    OMAPDSS: TFP410: return EPROBE_DEFER if the i2c adapter not found
    OMAPDSS: VENC: Add error handling for venc_probe_pdata
    OMAPDSS: HDMI: Add error handling for hdmi_probe_pdata
    OMAPDSS: RFBI: Add error handling for rfbi_probe_pdata
    OMAPDSS: DSI: Add error handling for dsi_probe_pdata
    OMAPDSS: SDI: Add error handling for sdi_probe_pdata
    OMAPDSS: DPI: Add error handling for dpi_probe_pdata
    ...

    Linus Torvalds
     
  • Pull scheduler fixes from Ingo Molnar:
    "This fixes the cputime scaling overflow problems for good without
    having bad 32-bit overhead, and gets rid of the div64_u64_rem() helper
    as well."

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    Revert "math64: New div64_u64_rem helper"
    sched: Avoid prev->stime underflow
    sched: Do not account bogus utime
    sched: Avoid cputime scaling overflow

    Linus Torvalds
     

02 May, 2013

7 commits

  • The full dynticks tree needs the latest RCU and sched
    upstream updates in order to fix some dependencies.

    Merge a common upstream merge point that has these
    updates.

    Conflicts:
    include/linux/perf_event.h
    kernel/rcutree.h
    kernel/rcutree_plugin.h

    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     
  • Pull VFS updates from Al Viro,

    Misc cleanups all over the place, mainly wrt /proc interfaces (switch
    create_proc_entry to proc_create(), get rid of the deprecated
    create_proc_read_entry() in favor of using proc_create_data() and
    seq_file etc).

    7kloc removed.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
    don't bother with deferred freeing of fdtables
    proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
    proc: Make the PROC_I() and PDE() macros internal to procfs
    proc: Supply a function to remove a proc entry by PDE
    take cgroup_open() and cpuset_open() to fs/proc/base.c
    ppc: Clean up scanlog
    ppc: Clean up rtas_flash driver somewhat
    hostap: proc: Use remove_proc_subtree()
    drm: proc: Use remove_proc_subtree()
    drm: proc: Use minor->index to label things, not PDE->name
    drm: Constify drm_proc_list[]
    zoran: Don't print proc_dir_entry data in debug
    reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
    proc: Supply an accessor for getting the data from a PDE's parent
    airo: Use remove_proc_subtree()
    rtl8192u: Don't need to save device proc dir PDE
    rtl8187se: Use a dir under /proc/net/r8180/
    proc: Add proc_mkdir_data()
    proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
    proc: Move PDE_NET() to fs/proc/proc_net.c
    ...

    Linus Torvalds
     
  • Supply a function (proc_remove()) to remove a proc entry (and any subtree
    rooted there) by proc_dir_entry pointer rather than by name and (optionally)
    root dir entry pointer. This allows us to eliminate all remaining pde->name
    accesses outside of procfs.

    Signed-off-by: David Howells
    Acked-by: Grant Likely
    cc: linux-acpi@vger.kernel.org
    cc: openipmi-developer@lists.sourceforge.net
    cc: devicetree-discuss@lists.ozlabs.org
    cc: linux-pci@vger.kernel.org
    cc: netdev@vger.kernel.org
    cc: netfilter-devel@vger.kernel.org
    cc: alsa-devel@alsa-project.org
    Signed-off-by: Al Viro

    David Howells
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Split the proc namespace stuff out into linux/proc_ns.h.

    Signed-off-by: David Howells
    cc: netdev@vger.kernel.org
    cc: Serge E. Hallyn
    cc: Eric W. Biederman
    Signed-off-by: Al Viro

    David Howells
     
  • Supply accessor functions to set attributes in proc_dir_entry structs.

    The following are supplied: proc_set_size() and proc_set_user().

    Signed-off-by: David Howells
    Acked-by: Mauro Carvalho Chehab
    cc: linuxppc-dev@lists.ozlabs.org
    cc: linux-media@vger.kernel.org
    cc: netdev@vger.kernel.org
    cc: linux-wireless@vger.kernel.org
    cc: linux-pci@vger.kernel.org
    cc: netfilter-devel@vger.kernel.org
    cc: alsa-devel@alsa-project.org
    Signed-off-by: Al Viro

    David Howells
     
  • Pull networking updates from David Miller:
    "Highlights (1721 non-merge commits, this has to be a record of some
    sort):

    1) Add 'random' mode to team driver, from Jiri Pirko and Eric
    Dumazet.

    2) Make it so that any driver that supports configuration of multiple
    MAC addresses can provide the forwarding database add and del
    calls by providing a default implementation and hooking that up if
    the driver doesn't have an explicit set of handlers. From Vlad
    Yasevich.

    3) Support GSO segmentation over tunnels and other encapsulating
    devices such as VXLAN, from Pravin B Shelar.

    4) Support L2 GRE tunnels in the flow dissector, from Michael Dalton.

    5) Implement Tail Loss Probe (TLP) detection in TCP, from Nandita
    Dukkipati.

    6) In the PHY layer, allow supporting wake-on-lan in situations where
    the PHY registers have to be written for it to be configured.

    Use it to support wake-on-lan in mv643xx_eth.

    From Michael Stapelberg.

    7) Significantly improve firewire IPV6 support, from YOSHIFUJI
    Hideaki.

    8) Allow multiple packets to be sent in a single transmission using
    network coding in batman-adv, from Martin Hundebøll.

    9) Add support for T5 cxgb4 chips, from Santosh Rastapur.

    10) Generalize the VXLAN forwarding tables so that there is more
    flexibility in configurating various aspects of the endpoints.
    From David Stevens.

    11) Support RSS and TSO in hardware over GRE tunnels in bxn2x driver,
    from Dmitry Kravkov.

    12) Zero copy support in nfnelink_queue, from Eric Dumazet and Pablo
    Neira Ayuso.

    13) Start adding networking selftests.

    14) In situations of overload on the same AF_PACKET fanout socket, or
    per-cpu packet receive queue, minimize drop by distributing the
    load to other cpus/fanouts. From Willem de Bruijn and Eric
    Dumazet.

    15) Add support for new payload offset BPF instruction, from Daniel
    Borkmann.

    16) Convert several drivers over to mdoule_platform_driver(), from
    Sachin Kamat.

    17) Provide a minimal BPF JIT image disassembler userspace tool, from
    Daniel Borkmann.

    18) Rewrite F-RTO implementation in TCP to match the final
    specification of it in RFC4138 and RFC5682. From Yuchung Cheng.

    19) Provide netlink socket diag of netlink sockets ("Yo dawg, I hear
    you like netlink, so I implemented netlink dumping of netlink
    sockets.") From Andrey Vagin.

    20) Remove ugly passing of rtnetlink attributes into rtnl_doit
    functions, from Thomas Graf.

    21) Allow userspace to be able to see if a configuration change occurs
    in the middle of an address or device list dump, from Nicolas
    Dichtel.

    22) Support RFC3168 ECN protection for ipv6 fragments, from Hannes
    Frederic Sowa.

    23) Increase accuracy of packet length used by packet scheduler, from
    Jason Wang.

    24) Beginning set of changes to make ipv4/ipv6 fragment handling more
    scalable and less susceptible to overload and locking contention,
    from Jesper Dangaard Brouer.

    25) Get rid of using non-type-safe NLMSG_* macros and use nlmsg_*()
    instead. From Hong Zhiguo.

    26) Optimize route usage in IPVS by avoiding reference counting where
    possible, from Julian Anastasov.

    27) Convert IPVS schedulers to RCU, also from Julian Anastasov.

    28) Support cpu fanouts in xt_NFQUEUE netfilter target, from Holger
    Eitzenberger.

    29) Network namespace support for nf_log, ebt_log, xt_LOG, ipt_ULOG,
    nfnetlink_log, and nfnetlink_queue. From Gao feng.

    30) Implement RFC3168 ECN protection, from Hannes Frederic Sowa.

    31) Support several new r8169 chips, from Hayes Wang.

    32) Support tokenized interface identifiers in ipv6, from Daniel
    Borkmann.

    33) Use usbnet_link_change() helper in USB net driver, from Ming Lei.

    34) Add 802.1ad vlan offload support, from Patrick McHardy.

    35) Support mmap() based netlink communication, also from Patrick
    McHardy.

    36) Support HW timestamping in mlx4 driver, from Amir Vadai.

    37) Rationalize AF_PACKET packet timestamping when transmitting, from
    Willem de Bruijn and Daniel Borkmann.

    38) Bring parity to what's provided by /proc/net/packet socket dumping
    and the info provided by netlink socket dumping of AF_PACKET
    sockets. From Nicolas Dichtel.

    39) Fix peeking beyond zero sized SKBs in AF_UNIX, from Benjamin
    Poirier"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits)
    filter: fix va_list build error
    af_unix: fix a fatal race with bit fields
    bnx2x: Prevent memory leak when cnic is absent
    bnx2x: correct reading of speed capabilities
    net: sctp: attribute printl with __printf for gcc fmt checks
    netlink: kconfig: move mmap i/o into netlink kconfig
    netpoll: convert mutex into a semaphore
    netlink: Fix skb ref counting.
    net_sched: act_ipt forward compat with xtables
    mlx4_en: fix a build error on 32bit arches
    Revert "bnx2x: allow nvram test to run when device is down"
    bridge: avoid OOPS if root port not found
    drivers: net: cpsw: fix kernel warn on cpsw irq enable
    sh_eth: use random MAC address if no valid one supplied
    3c509.c: call SET_NETDEV_DEV for all device types (ISA/ISAPnP/EISA)
    tg3: fix to append hardware time stamping flags
    unix/stream: fix peeking with an offset larger than data in queue
    unix/dgram: fix peeking with an offset larger than data in queue
    unix/dgram: peek beyond 0-sized skbs
    openvswitch: Remove unneeded ovs_netdev_get_ifindex()
    ...

    Linus Torvalds
     

01 May, 2013

23 commits

  • Pull compat cleanup from Al Viro:
    "Mostly about syscall wrappers this time; there will be another pile
    with patches in the same general area from various people, but I'd
    rather push those after both that and vfs.git pile are in."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    syscalls.h: slightly reduce the jungles of macros
    get rid of union semop in sys_semctl(2) arguments
    make do_mremap() static
    sparc: no need to sign-extend in sync_file_range() wrapper
    ppc compat wrappers for add_key(2) and request_key(2) are pointless
    x86: trim sys_ia32.h
    x86: sys32_kill and sys32_mprotect are pointless
    get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC
    merge compat sys_ipc instances
    consolidate compat lookup_dcookie()
    convert vmsplice to COMPAT_SYSCALL_DEFINE
    switch getrusage() to COMPAT_SYSCALL_DEFINE
    switch epoll_pwait to COMPAT_SYSCALL_DEFINE
    convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
    switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE
    make SYSCALL_DEFINE-generated wrappers do asmlinkage_protect
    make HAVE_SYSCALL_WRAPPERS unconditional
    consolidate cond_syscall and SYSCALL_ALIAS declarations
    teach SYSCALL_DEFINE how to deal with long long/unsigned long long
    get rid of duplicate logics in __SC_....[1-6] definitions

    Linus Torvalds
     
  • If we allocate perf ring buffer with the size of single (user)
    page, we will get memory corruption when releasing itin
    rb_free_work function (for CONFIG_PERF_USE_VMALLOC option).

    For single page sized ring buffer the page_order is -1 (because
    nr_pages is 0). This needs to be recognized in the rb_free_work
    function to release proper amount of pages.

    Adding data_page_nr function that returns number of allocated
    data pages. Customizing the rest of the code to use it.

    Reported-by: Jan Stancek
    Original-patch-by: Peter Zijlstra
    Acked-by: Peter Zijlstra
    Cc: Corey Ashford
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Namhyung Kim
    Cc: Paul Mackerras
    Cc: Arnaldo Carvalho de Melo
    Signed-off-by: Jiri Olsa
    Link: http://lkml.kernel.org/r/20130319143509.GA1128@krava.brq.redhat.com
    Signed-off-by: Ingo Molnar

    Jiri Olsa
     
  • Merge third batch of fixes from Andrew Morton:
    "Most of the rest. I still have two large patchsets against AIO and
    IPC, but they're a bit stuck behind other trees and I'm about to
    vanish for six days.

    - random fixlets
    - inotify
    - more of the MM queue
    - show_stack() cleanups
    - DMI update
    - kthread/workqueue things
    - compat cleanups
    - epoll udpates
    - binfmt updates
    - nilfs2
    - hfs
    - hfsplus
    - ptrace
    - kmod
    - coredump
    - kexec
    - rbtree
    - pids
    - pidns
    - pps
    - semaphore tweaks
    - some w1 patches
    - relay updates
    - core Kconfig changes
    - sysrq tweaks"

    * emailed patches from Andrew Morton : (109 commits)
    Documentation/sysrq: fix inconstistent help message of sysrq key
    ethernet/emac/sysrq: fix inconstistent help message of sysrq key
    sparc/sysrq: fix inconstistent help message of sysrq key
    powerpc/xmon/sysrq: fix inconstistent help message of sysrq key
    ARM/etm/sysrq: fix inconstistent help message of sysrq key
    power/sysrq: fix inconstistent help message of sysrq key
    kgdb/sysrq: fix inconstistent help message of sysrq key
    lib/decompress.c: fix initconst
    notifier-error-inject: fix module names in Kconfig
    kernel/sys.c: make prctl(PR_SET_MM) generally available
    UAPI: remove empty Kbuild files
    menuconfig: print more info for symbol without prompts
    init/Kconfig: re-order CONFIG_EXPERT options to fix menuconfig display
    kconfig menu: move Virtualization drivers near other virtualization options
    Kconfig: consolidate CONFIG_DEBUG_STRICT_USER_COPY_CHECKS
    relay: use macro PAGE_ALIGN instead of FIX_SIZE
    kernel/relay.c: move FIX_SIZE macro into relay.c
    kernel/relay.c: remove unused function argument actor
    drivers/w1/slaves/w1_ds2760.c: fix the error handling in w1_ds2760_add_slave()
    drivers/w1/slaves/w1_ds2781.c: fix the error handling in w1_ds2781_add_slave()
    ...

    Linus Torvalds
     
  • Currently help message of /proc/sysrq-trigger highlight its
    upper-case characters, like below:

    SysRq : HELP : loglevel(0-9) reBoot Crash terminate-all-tasks(E)
    memory-full-oom-kill(F) kill-all-tasks(I) ...

    this would confuse user trigger sysrq by upper-case character, which is
    inconsistent with the real lower-case character registed key.

    This inconsistent help message will also lead more confused when
    26 upper-case letters put into use in future.

    This patch fix power off sysrq key: "poweroff(o)"

    Signed-off-by: zhangwei(Jovi)
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhangwei(Jovi)
     
  • Currently help message of /proc/sysrq-trigger highlight its upper-case
    characters, like below:

    SysRq : HELP : loglevel(0-9) reBoot Crash terminate-all-tasks(E)
    memory-full-oom-kill(F) kill-all-tasks(I) ...

    this would confuse user trigger sysrq by upper-case character, which is
    inconsistent with the real lower-case character registed key.

    This inconsistent help message will also lead more confused when
    26 upper-case letters put into use in future.

    This patch fix kgdb sysrq key: "debug(g)"

    Signed-off-by: zhangwei(Jovi)
    Cc: Jason Wessel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhangwei(Jovi)
     
  • The purpose of this patch is to allow privileged processes to set
    their own per-memory memory-region fields:

    start_code, end_code, start_data, end_data, start_brk, brk,
    start_stack, arg_start, arg_end, env_start, env_end.

    This functionality is needed by any application or package that needs to
    reconstruct Linux processes, that is, to start them in any way other than
    by means of an "execve()" from an executable file. This includes:

    1. Restoring processes from a checkpoint-file (by all potential
    user-level checkpointing packages, not only CRIU's).
    2. Restarting processes on another node after process migration.
    3. Starting duplicated copies of a running process (for reliability
    and high-availablity).
    4. Starting a process from an executable format that is not supported
    by Linux, thus requiring a "manual execve" by a user-level utility.
    5. Similarly, starting a process from a networked and/or crypted
    executable that, for confidentiality, licensing or other reasons,
    may not be written to the local file-systems.

    The code that does that was already included in the Linux kernel by the
    CRIU group, in the form of "prctl(PR_SET_MM)", but prior to this was
    enclosed within their private "#ifdef CONFIG_CHECKPOINT_RESTORE", which is
    normally disabled. The patch removes those ifdefs.

    Signed-off-by: Amnon Shiloh
    Cc: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Amnon Shiloh
     
  • Macro FIX_SIZE is same as PAGE_ALIGN at present, so use PAGE_ALIGN
    instead.

    Thanks Andrew found this.

    Signed-off-by: zhangwei(Jovi)
    Cc: Jens Axboe
    Cc: Al Viro
    Cc: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhangwei(Jovi)
     
  • It's better to place FIX_SIZE macro in relay.c, instead of relay.h

    Signed-off-by: zhangwei(Jovi)
    Cc: Jens Axboe
    Cc: Al Viro
    Cc: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhangwei(Jovi)
     
  • Currently argument `actor' is never used in the relay reading path, so
    remove it.

    Signed-off-by: zhangwei(Jovi)
    Cc: Jens Axboe
    Cc: Al Viro
    Cc: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhangwei(Jovi)
     
  • Signed-off-by: liguang
    Cc: Jiri Kosina
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    liguang
     
  • Signed-off-by: liguang
    Cc: Jiri Kosina
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    liguang
     
  • Move BITS_PER_PAGE from pid_namespace.c to pid_namespace.h, since we can
    simplify the define PID_MAP_ENTRIES by using the BITS_PER_PAGE.

    [akpm@linux-foundation.org: kernel/pid.c:54:1: warning: "BITS_PER_PAGE" redefined]
    Signed-off-by: Raphael S.Carvalho
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Raphael S.Carvalho
     
  • find_next_offset() searches for an available "cleaned bit" in the
    respective pid bitmap (page), so returns the offset if found, otherwise
    it returns a value equals to BITS_PER_PAGE.

    For example, suppose find_next_offset didn't find any available bit, so
    there's no purpose to call mk_pid (Wasteful Cpu Cycles).

    Therefore, I found it could be better to call mk_pid after the checking
    (offset < BITS_PER_PAGE) returned sucessfully! Another point: If (offset
    < BITS_PER_PAGE) results in a "failure", then mk_pid would be called
    again afterwards.

    [akpm@linux-foundation.org: simplify code]
    Signed-off-by: Raphael S. Carvalho
    Cc: "Eric W. Biederman"
    Cc: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Raphael S. Carvalho
     
  • Simplify the logic of variable assignments.

    [akpm@linux-foundation.org: replace min_t with min, remove unneeded casts]
    Signed-off-by: Zhang Yanfei
    Cc: "Eric W. Biederman"
    Reviewed-by: Simon Horman
    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • The types of the following local variables:

    - ubytes/mbytes in kimage_load_crash_segment()/kimage_load_normal_segment()

    - r in vmcoreinfo_append_str()

    are wrong, so fix them.

    Signed-off-by: Zhang Yanfei
    Cc: "Eric W. Biederman"
    Cc: Simon Horman
    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • There are 2 well known and ancient problems with coredump/signals, and a
    lot of related bug reports:

    - do_coredump() clears TIF_SIGPENDING but of course this can't help
    if, say, SIGCHLD comes after that.

    In this case the coredump can fail unexpectedly. See for example
    wait_for_dump_helper()->signal_pending() check but there are other
    reasons.

    - At the same time, dumping a huge core on the slow media can take a
    lot of time/resources and there is no way to kill the coredumping
    task reliably. In particular this is not oom_kill-friendly.

    This patch tries to fix the 1st problem, and makes the preparation for the
    next changes.

    We add the new SIGNAL_GROUP_COREDUMP flag set by zap_threads() to indicate
    that this process dumps the core. prepare_signal() checks this flag and
    nacks any signal except SIGKILL.

    Note that this check tries to be conservative, in the long term we should
    probably treat the SIGNAL_GROUP_EXIT case equally but this needs more
    discussion. See marc.info/?l=linux-kernel&m=120508897917439

    Notes:
    - recalc_sigpending() doesn't check SIGNAL_GROUP_COREDUMP.
    The patch assumes that dump_write/etc paths should never
    call it, but we can change it as well.

    - There is another source of TIF_SIGPENDING, freezer. This
    will be addressed separately.

    Signed-off-by: Oleg Nesterov
    Tested-by: Mandeep Singh Baines
    Cc: Ingo Molnar
    Cc: Neil Horman
    Cc: "Rafael J. Wysocki"
    Cc: Roland McGrath
    Cc: Tejun Heo
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • This function suffers from not being able to determine if the cleanup is
    called in case it returns -ENOMEM. Nobody is using it anymore, so let's
    remove it.

    Signed-off-by: Lucas De Marchi
    Cc: Oleg Nesterov
    Cc: David Howells
    Cc: James Morris
    Cc: Al Viro
    Cc: Tejun Heo
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lucas De Marchi
     
  • Use call_usermodehelper_setup() + call_usermodehelper_exec() instead of
    calling call_usermodehelper_fns(). In case the latter returns -ENOMEM the
    cleanup function may had not been called - in this case we would not free
    argv and module_name.

    Signed-off-by: Lucas De Marchi
    Cc: Oleg Nesterov
    Cc: David Howells
    Cc: James Morris
    Cc: Al Viro
    Cc: Tejun Heo
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lucas De Marchi
     
  • call_usermodehelper_setup() + call_usermodehelper_exec() need to be
    called instead of call_usermodehelper_fns() when the cleanup function
    needs to be called even when an ENOMEM error occurs. In this case using
    call_usermodehelper_fns() the user can't distinguish if the cleanup
    function was called or not.

    [akpm@linux-foundation.org: export call_usermodehelper_setup() to modules]
    Signed-off-by: Lucas De Marchi
    Reviewed-by: Oleg Nesterov
    Cc: David Howells
    Cc: James Morris
    Cc: Al Viro
    Cc: Tejun Heo
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lucas De Marchi
     
  • This patch adds a new ptrace request PTRACE_PEEKSIGINFO.

    This request is used to retrieve information about pending signals
    starting with the specified sequence number. Siginfo_t structures are
    copied from the child into the buffer starting at "data".

    The argument "addr" is a pointer to struct ptrace_peeksiginfo_args.
    struct ptrace_peeksiginfo_args {
    u64 off; /* from which siginfo to start */
    u32 flags;
    s32 nr; /* how may siginfos to take */
    };

    "nr" has type "s32", because ptrace() returns "long", which has 32 bits on
    i386 and a negative values is used for errors.

    Currently here is only one flag PTRACE_PEEKSIGINFO_SHARED for dumping
    signals from process-wide queue. If this flag is not set, signals are
    read from a per-thread queue.

    The request PTRACE_PEEKSIGINFO returns a number of dumped signals. If a
    signal with the specified sequence number doesn't exist, ptrace returns
    zero. The request returns an error, if no signal has been dumped.

    Errors:
    EINVAL - one or more specified flags are not supported or nr is negative
    EFAULT - buf or addr is outside your accessible address space.

    A result siginfo contains a kernel part of si_code which usually striped,
    but it's required for queuing the same siginfo back during restore of
    pending signals.

    This functionality is required for checkpointing pending signals. Pedro
    Alves suggested using it in "gdb" to peek at pending signals. gdb already
    uses PTRACE_GETSIGINFO to get the siginfo for the signal which was already
    dequeued. This functionality allows gdb to look at the pending signals
    which were not reported yet.

    The prototype of this code was developed by Oleg Nesterov.

    Signed-off-by: Andrew Vagin
    Cc: Roland McGrath
    Cc: Oleg Nesterov
    Cc: "Paul E. McKenney"
    Cc: David Howells
    Cc: Dave Jones
    Cc: "Michael Kerrisk (man-pages)"
    Cc: Pavel Emelyanov
    Cc: Linus Torvalds
    Cc: Pedro Alves
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Vagin
     
  • Andrew Morton noted:

    akpm3:/usr/src/25> grep SYSCALL kernel/timer.c
    SYSCALL_DEFINE1(alarm, unsigned int, seconds)
    SYSCALL_DEFINE0(getpid)
    SYSCALL_DEFINE0(getppid)
    SYSCALL_DEFINE0(getuid)
    SYSCALL_DEFINE0(geteuid)
    SYSCALL_DEFINE0(getgid)
    SYSCALL_DEFINE0(getegid)
    SYSCALL_DEFINE0(gettid)
    SYSCALL_DEFINE1(sysinfo, struct sysinfo __user *, info)
    COMPAT_SYSCALL_DEFINE1(sysinfo, struct compat_sysinfo __user *, info)

    Only one of those should be in kernel/timer.c. Who wrote this thing?

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Stephen Rothwell
    Acked-by: Thomas Gleixner
    Cc: Guenter Roeck
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Rothwell
     
  • Signed-off-by: Stephen Rothwell
    Cc: Thomas Gleixner
    Cc: Guenter Roeck
    Cc: Al Viro
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Rothwell
     
  • The only use outside of kernel/timer.c was in kernel/compat.c, so move
    compat_sys_sysinfo() next to sys_sysinfo() in kernel/timer.c.

    Signed-off-by: Stephen Rothwell
    Cc: Thomas Gleixner
    Cc: Guenter Roeck
    Cc: Al Viro
    Acked-by: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Rothwell