20 Jul, 2012

3 commits


01 Jun, 2012

1 commit

  • commit 31a67102f4762df5544bc2dfb34a931233d2a5b2 upstream.

    During early boot, when the scheduler hasn't really been fully set up,
    we really can't do blocking allocations because with certain (dubious)
    configurations the "might_resched()" calls can actually result in
    scheduling events.

    We could just make such users always use GFP_ATOMIC, but quite often the
    code that does the allocation isn't really aware of the fact that the
    scheduler isn't up yet, and forcing that kind of random knowledge on the
    initialization code is just annoying and not good for anybody.

    And we actually have a the 'gfp_allowed_mask' exactly for this reason:
    it's just that the kernel init sequence happens to set it to allow
    blocking allocations much too early.

    So move the 'gfp_allowed_mask' initialization from 'start_kernel()'
    (which is some of the earliest init code, and runs with preemption
    disabled for good reasons) into 'kernel_init()'. kernel_init() is run
    in the newly created thread that will become the 'init' process, as
    opposed to the early startup code that runs within the context of what
    will be the first idle thread.

    So by the time we reach 'kernel_init()', we know that the scheduler must
    be at least limping along, because we've already scheduled from the idle
    thread into the init thread.

    Reported-by: Steven Rostedt
    Cc: David Rientjes
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     

22 May, 2012

1 commit

  • commit 377485f6244af255b04d662cf19cddbbc4ae4310 upstream.

    Currently, we'll try mounting any device who's major device number is
    UNNAMED_MAJOR as NFS root. This would happen for non-NFS devices as
    well (such as 9p devices) but it wouldn't cause any issues since
    mounting the device as NFS would fail quickly and the code proceeded to
    doing the proper mount:

    [ 101.522716] VFS: Unable to mount root fs via NFS, trying floppy.
    [ 101.534499] VFS: Mounted root (9p filesystem) on device 0:18.

    Commit 6829a048102a ("NFS: Retry mounting NFSROOT") introduced retries
    when mounting NFS root, which means that now we don't immediately fail
    and instead it takes an additional 90+ seconds until we stop retrying,
    which has revealed the issue this patch fixes.

    This meant that it would take an additional 90 seconds to boot when
    we're not using a device type which gets detected in order before NFS.

    This patch modifies the NFS type check to require device type to be
    'Root_NFS' instead of requiring the device to have an UNNAMED_MAJOR
    major. This makes boot process cleaner since we now won't go through
    the NFS mounting code at all when the device isn't an NFS root
    ("/dev/nfs").

    Signed-off-by: Sasha Levin
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Sasha Levin
     

26 Jan, 2012

1 commit

  • commit 43717c7daebf10b43f12e68512484b3095bb1ba5 upstream.

    Lukas Razik reports that on his SPARC system,
    booting with an NFS root file system stopped working after commit
    56463e50 "NFS: Use super.c for NFSROOT mount option parsing."

    We found that the network switch to which Lukas' client was attached
    was delaying access to the LAN after the client's NIC driver reported
    that its link was up. The delay was longer than the timeouts used in
    the NFS client during mounting.

    NFSROOT worked for Lukas before commit 56463e50 because in those
    kernels, the client's first operation was an rpcbind request to
    determine which port the NFS server was listening on. When that
    request failed after a long timeout, the client simply selected the
    default NFS port (2049). By that time the switch was allowing access
    to the LAN, and the mount succeeded.

    Neither of these client behaviors is desirable, so reverting 56463e50
    is really not a choice. Instead, introduce a mechanism that retries
    the NFSROOT mount request several times. This is the same tactic that
    normal user space NFS mounts employ to overcome server and network
    delays.

    Signed-off-by: Lukas Razik
    [ cel: match kernel coding style, add proper patch description ]
    [ cel: add exponential back-off ]
    Signed-off-by: Chuck Lever
    Tested-by: Lukas Razik
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Chuck Lever
     

23 Jun, 2011

1 commit

  • Secondary CPU bringup typically calls calibrate_delay() during its
    initialization. However, calibrate_delay() modifies a global variable
    (loops_per_jiffy) used for udelay() and __delay().

    A side effect of 71c696b1 ("calibrate: extract fall-back calculation
    into own helper") introduced in the 2.6.39 merge window means that we
    end up with a substantial period where loops_per_jiffy is zero. This
    causes the spinlock debugging code to malfunction:

    u64 loops = loops_per_jiffy * HZ;
    for (;;) {
    for (i = 0; i < loops; i++) {
    if (arch_spin_trylock(&lock->raw_lock))
    return;
    __delay(1);
    }
    ...
    }

    by never calling arch_spin_trylock() - resulting in the CPU locking
    up in an infinite loop inside __spin_lock_debug().

    Work around this by only writing to loops_per_jiffy only once we have
    completed all the calibration decisions.

    Tested-by: Santosh Shilimkar
    Signed-off-by: Russell King
    Cc: (2.6.39-stable)
    --
    Better solutions (such as omitting the calibration for secondary CPUs,
    or arranging for calibrate_delay() to return the LPJ value and leave
    it to the caller to decide where to store it) are a possibility, but
    would be much more invasive into each architecture.

    I think this is the best solution for -rc and stable, but it should be
    revisited for the next merge window.

    init/calibrate.c | 14 ++++++++------
    1 files changed, 8 insertions(+), 6 deletions(-)
    Signed-off-by: Linus Torvalds

    Russell King
     

17 Jun, 2011

1 commit

  • There is a problem that kdump(2nd kernel) sometimes hangs up due
    to a pending IPI from 1st kernel. Kernel panic occurs because IPI
    comes before call_single_queue is initialized.

    To fix the crash, rename init_call_single_data() to call_function_init()
    and call it in start_kernel() so that call_single_queue can be
    initialized before enabling interrupts.

    The details of the crash are:

    (1) 2nd kernel boots up

    (2) A pending IPI from 1st kernel comes when irqs are first enabled
    in start_kernel().

    (3) Kernel tries to handle the interrupt, but call_single_queue
    is not initialized yet at this point. As a result, in the
    generic_smp_call_function_single_interrupt(), NULL pointer
    dereference occurs when list_replace_init() tries to access
    &q->list.next.

    Therefore this patch changes the name of init_call_single_data()
    to call_function_init() and calls it before local_irq_enable()
    in start_kernel().

    Signed-off-by: Takao Indoh
    Reviewed-by: WANG Cong
    Acked-by: Neil Horman
    Acked-by: Vivek Goyal
    Acked-by: Peter Zijlstra
    Cc: Milton Miller
    Cc: Jens Axboe
    Cc: Paul E. McKenney
    Cc: kexec@lists.infradead.org
    Link: http://lkml.kernel.org/r/D6CBEE2F420741indou.takao@jp.fujitsu.com
    Signed-off-by: Ingo Molnar

    Takao Indoh
     

16 Jun, 2011

3 commits

  • CONFIG_CONSTRUCTORS controls support for running constructor functions at
    kernel init time. According to commit b99b87f70c7785ab ("kernel:
    constructor support"), gcov (CONFIG_GCOV_KERNEL) needs this. However,
    CONFIG_CONSTRUCTORS currently defaults to y, with no option to disable it,
    and CONFIG_GCOV_KERNEL depends on it. Instead, default it to n and have
    CONFIG_GCOV_KERNEL select it, so that the normal case of
    CONFIG_GCOV_KERNEL=n will result in CONFIG_CONSTRUCTORS=n.

    Observed in the short list of =y values in a minimal kernel configuration.

    Signed-off-by: Josh Triplett
    Acked-by: WANG Cong
    Acked-by: Peter Oberparleiter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josh Triplett
     
  • Remove calibrate_delay_direct()'s KERN_DEBUG printk related to bogomips
    calculation as it appears when booting every core on setups with
    'ignore_loglevel' which dmesg people scan for possible issues. As the
    message doesn't show very useful information to the widest audience of
    kernel boot message gazers, it should be removed.

    Introduced by commit d2b463135f84 ("init/calibrate.c: fix for critical
    bogoMIPS intermittent calculation failure").

    Signed-off-by: Borislav Petkov
    Cc: Andrew Worsley
    Cc: Phil Carmody
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Borislav Petkov
     
  • The "hostname" tool falls back to setting the hostname to "localhost" if
    /etc/hostname does not exist. Distribution init scripts have the same
    fallback. However, if userspace never calls sethostname, such as when
    booting with init=/bin/sh, or otherwise booting a minimal system without
    the usual init scripts, the default hostname of "(none)" remains,
    unhelpfully appearing in various places such as prompts ("root@(none):~#")
    and logs. Furthermore, "(none)" doesn't typically resolve to anything
    useful.

    Make the default hostname configurable. This removes the need for the
    standard fallback, provides a useful default for systems that never call
    sethostname, and makes minimal systems that much more useful with less
    configuration. Distributions could choose to use "localhost" here to
    avoid the fallback, while embedded systems may wish to use a specific
    target hostname.

    Signed-off-by: Josh Triplett
    Acked-by: Linus Torvalds
    Acked-by: David Miller
    Cc: Serge Hallyn
    Cc: Kel Modderman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josh Triplett
     

30 May, 2011

1 commit

  • Thomas Gleixner reports that we now have a boot crash triggered by
    CONFIG_CPUMASK_OFFSTACK=y:

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [] find_next_bit+0x55/0xb0
    Call Trace:
    [] cpumask_any_but+0x2a/0x70
    [] flush_tlb_mm+0x2b/0x80
    [] pud_populate+0x35/0x50
    [] pgd_alloc+0x9a/0xf0
    [] mm_init+0xec/0x120
    [] mm_alloc+0x53/0xd0

    which was introduced by commit de03c72cfce5 ("mm: convert
    mm->cpu_vm_cpumask into cpumask_var_t"), and is due to wrong ordering of
    mm_init() vs mm_init_cpumask

    Thomas wrote a patch to just fix the ordering of initialization, but I
    hate the new double allocation in the fork path, so I ended up instead
    doing some more radical surgery to clean it all up.

    Reported-by: Thomas Gleixner
    Reported-by: Ingo Molnar
    Cc: KOSAKI Motohiro
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

27 May, 2011

1 commit

  • The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier and
    leads to some problems:

    * cgroup creation is out-of-control
    * cgroup name can conflict when pids are looping
    * it is not possible to have a single process handling a lot of
    namespaces without falling in a exponential creation time
    * we may want to create a namespace without creating a cgroup

    The ns_cgroup was replaced by a compatibility flag 'clone_children',
    where a newly created cgroup will copy the parent cgroup values.
    The userspace has to manually create a cgroup and add a task to
    the 'tasks' file.

    This patch removes the ns_cgroup as suggested in the following thread:

    https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html

    The 'cgroup_clone' function is removed because it is no longer used.

    This is a userspace-visible change. Commit 45531757b45c ("cgroup: notify
    ns_cgroup deprecated") (merged into 2.6.27) caused the kernel to emit a
    printk warning users that the feature is planned for removal. Since that
    time we have heard from XXX users who were affected by this.

    Signed-off-by: Daniel Lezcano
    Signed-off-by: Serge E. Hallyn
    Cc: Eric W. Biederman
    Cc: Jamal Hadi Salim
    Reviewed-by: Li Zefan
    Acked-by: Paul Menage
    Acked-by: Matt Helsley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Lezcano
     

25 May, 2011

4 commits

  • On larger systems, because of the numerous ACPI, Bootmem and EFI messages,
    the static log buffer overflows before the larger one specified by the
    log_buf_len param is allocated. Minimize the overflow by allocating the
    new log buffer as soon as possible.

    On kernels without memblock, a later call to setup_log_buf from
    kernel/init.c is the fallback.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix CONFIG_PRINTK=n build]
    Signed-off-by: Mike Travis
    Cc: Yinghai Lu
    Cc: "H. Peter Anvin"
    Cc: Jack Steiner
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Travis
     
  • A fix to the TSC (Time Stamp Counter) based bogoMIPS calculation used on
    secondary CPUs which has two faults:

    1: Not handling wrapping of the lower 32 bits of the TSC counter on
    32bit kernel - perhaps TSC is not reset by a warm reset?

    2: TSC and Jiffies are no incrementing together properly. Either
    jiffies increment too quickly or Time Stamp Counter isn't incremented
    in during an SMI but the real time clock is and jiffies are
    incremented.

    Case 1 can result in a factor of 16 too large a value which makes udelay()
    values too small and can cause mysterious driver errors. Case 2 appears
    to give smaller 10-15% errors after averaging but enough to cause
    occasional failures on my own board

    I have tested this code on my own branch and attach patch suitable for
    current kernel code. See below for examples of the failures and how the
    fix handles these situations now.

    I reported this issue earlier here:
    Intermittent problem with BogoMIPs calculation on Intel AP CPUs -
    http://marc.info/?l=linux-kernel&m=129947246316875&w=4

    I suspect this issue has been seen by others but as it is intermittent and
    bogoMIPS for secondary CPUs are no longer printed out it might have been
    difficult to identify this as the cause. Perhaps these unresolved issues,
    although quite old, might be relevant as possibly this fault has been
    around for a while. In particular Case 1 may only be relevant to 32bit
    kernels on newer HW (most people run 64bit kernels?). Case 2 is less
    dramatic since the earlier fix in this area and also intermittent.

    Re: bogomips discrepancy on Intel Core2 Quad CPU -
    http://marc.info/?l=linux-kernel&m=118929277524298&w=4
    slow system and bogus bogomips -
    http://marc.info/?l=linux-kernel&m=116791286716107&w=4
    Re: Re: [RFC-PATCH] clocksource: update lpj if clocksource has -
    http://marc.info/?l=linux-kernel&m=128952775819467&w=4

    This issue is masked a little by commit feae3203d711db0a ("timers, init:
    Limit the number of per cpu calibration bootup messages") which only
    prints out the first bogoMIPS value making it much harder to notice other
    values differing. Perhaps it should be changed to only suppress them when
    they are similar values?

    Here are some outputs showing faults occurring and the new code handling
    them properly. See my earlier message for examples of the original
    failure.

    Case 1: A Time Stamp Counter wrap:
    ...
    Calibrating delay loop (skipped), value calculated using timer
    frequency.. 6332.70 BogoMIPS (lpj=31663540)
    ....
    calibrate_delay_direct() timer_rate_max=31666493
    timer_rate_min=31666151 pre_start=4170369255 pre_end=4202035539
    calibrate_delay_direct() timer_rate_max=2425955274
    timer_rate_min=2425954941 pre_start=4265368533 pre_end=2396356387
    calibrate_delay_direct() ignoring timer_rate as we had a TSC wrap
    around start=4265368581 >=post_end=2396356511
    calibrate_delay_direct() timer_rate_max=31666274
    timer_rate_min=31665942 pre_start=2440373374 pre_end=2472039515
    calibrate_delay_direct() timer_rate_max=31666492
    timer_rate_min=31666160 pre_start=2535372139 pre_end=2567038422
    calibrate_delay_direct() timer_rate_max=31666455
    timer_rate_min=31666207 pre_start=2630371084 pre_end=2662037415
    Calibrating delay using timer specific routine.. 6333.28 BogoMIPS (lpj=31666428)
    Total of 2 processors activated (12665.99 BogoMIPS).
    ....

    Case 2: Some thing (presumably the SMM interrupt?) causing the
    very low increase in TSC counter for the DELAY_CALIBRATION_TICKS
    increase in jiffies
    ...
    Calibrating delay loop (skipped), value calculated using timer
    frequency.. 6333.25 BogoMIPS (lpj=31666270)
    ...
    calibrate_delay_direct() timer_rate_max=31666483
    timer_rate_min=31666074 pre_start=4199536526 pre_end=4231202809
    calibrate_delay_direct() timer_rate_max=864348 timer_rate_min=864016
    pre_start=2405343672 pre_end=2406207897
    calibrate_delay_direct() timer_rate_max=31666483
    timer_rate_min=31666179 pre_start=2469540464 pre_end=2501206823
    calibrate_delay_direct() timer_rate_max=31666511
    timer_rate_min=31666122 pre_start=2564539400 pre_end=2596205712
    calibrate_delay_direct() timer_rate_max=31666084
    timer_rate_min=31665685 pre_start=2659538782 pre_end=2691204657
    calibrate_delay_direct() dropping min bogoMips estimate 1 = 864348
    Calibrating delay using timer specific routine.. 6333.27 BogoMIPS (lpj=31666390)
    Total of 2 processors activated (12666.53 BogoMIPS).
    ...

    After 70 boots I saw 2 variations
    Reviewed-by: Phil Carmody
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Worsley
     
  • cpumask_t is very big struct and cpu_vm_mask is placed wrong position.
    It might lead to reduce cache hit ratio.

    This patch has two change.
    1) Move the place of cpumask into last of mm_struct. Because usually cpumask
    is accessed only front bits when the system has cpu-hotplug capability
    2) Convert cpu_vm_mask into cpumask_var_t. It may help to reduce memory
    footprint if cpumask_size() will use nr_cpumask_bits properly in future.

    In addition, this patch change the name of cpu_vm_mask with cpu_vm_mask_var.
    It may help to detect out of tree cpu_vm_mask users.

    This patch has no functional change.

    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KOSAKI Motohiro
    Cc: David Howells
    Cc: Koichi Yasutake
    Cc: Hugh Dickins
    Cc: Chris Metcalf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • * 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild-2.6:
    kbuild: make KBUILD_NOCMDDEP=1 handle empty built-in.o
    scripts/kallsyms.c: fix potential segfault
    scripts/gen_initramfs_list.sh: Convert to a /bin/sh script
    kbuild: Fix GNU make v3.80 compatibility
    kbuild: Fix passing -Wno-* options to gcc 4.4+
    kbuild: move scripts/basic/docproc.c to scripts/docproc.c
    kbuild: Fix Makefile.asm-generic for um
    kbuild: Allow to combine multiple W= levels
    kbuild: Disable -Wunused-but-set-variable for gcc 4.6.0
    Fix handling of backlash character in LINUX_COMPILE_BY name
    kbuild: asm-generic support
    kbuild: implement several W= levels
    kbuild: Fix build with binutils <= 2.19
    initramfs: Use KBUILD_BUILD_TIMESTAMP for generated entries
    kbuild: Allow to override LINUX_COMPILE_BY and LINUX_COMPILE_HOST macros
    kbuild: Drop unused LINUX_COMPILE_TIME and LINUX_COMPILE_DOMAIN macros
    kbuild: Use the deterministic mode of ar
    kbuild: Call gzip with -n
    kbuild: move KALLSYMS_EXTRA_PASS from Kconfig to Makefile
    Kconfig: improve KALLSYMS_ALL documentation

    Fix up trivial conflict in Makefile

    Linus Torvalds
     

23 May, 2011

2 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next-2.6: (28 commits)
    sparc32: fix build, fix missing cpu_relax declaration
    SCHED_TTWU_QUEUE is not longer needed since sparc32 now implements IPI
    sparc32,leon: Remove unnecessary page_address calls in LEON DMA API.
    sparc: convert old cpumask API into new one
    sparc32, sun4d: Implemented SMP IPIs support for SUN4D machines
    sparc32, sun4m: Implemented SMP IPIs support for SUN4M machines
    sparc32,leon: Implemented SMP IPIs for LEON CPU
    sparc32: implement SMP IPIs using the generic functions
    sparc32,leon: SMP power down implementation
    sparc32,leon: added some SMP comments
    sparc: add {read,write}*_be routines
    sparc32,leon: don't rely on bootloader to mask IRQs
    sparc32,leon: operate on boot-cpu IRQ controller registers
    sparc32: always define boot_cpu_id
    sparc32: removed unused code, implemented by generic code
    sparc32: avoid build warning at mm/percpu.c:1647
    sparc32: always register a PROM based early console
    sparc32: probe for cpu info only during startup
    sparc: consolidate show_cpuinfo in cpu.c
    sparc32,leon: implement genirq CPU affinity
    ...

    Linus Torvalds
     
  • I still happen to believe that I$ miss costs are a major thing, but
    sadly, -Os doesn't seem to be the solution. With or without it, gcc
    will miss some obvious code size improvements, and with it enabled gcc
    will sometimes make choices that aren't good even with high I$ miss
    ratios.

    For example, with -Os, gcc on x86 will turn a 20-byte constant memcpy
    into a "rep movsl". While I sincerely hope that x86 CPU's will some day
    do a good job at that, they certainly don't do it yet, and the cost is
    higher than a L1 I$ miss would be.

    Some day I hope we can re-enable this.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

21 May, 2011

1 commit


20 May, 2011

3 commits

  • * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (78 commits)
    Revert "rcu: Decrease memory-barrier usage based on semi-formal proof"
    net,rcu: convert call_rcu(prl_entry_destroy_rcu) to kfree
    batman,rcu: convert call_rcu(softif_neigh_free_rcu) to kfree_rcu
    batman,rcu: convert call_rcu(neigh_node_free_rcu) to kfree()
    batman,rcu: convert call_rcu(gw_node_free_rcu) to kfree_rcu
    net,rcu: convert call_rcu(kfree_tid_tx) to kfree_rcu()
    net,rcu: convert call_rcu(xt_osf_finger_free_rcu) to kfree_rcu()
    net/mac80211,rcu: convert call_rcu(work_free_rcu) to kfree_rcu()
    net,rcu: convert call_rcu(wq_free_rcu) to kfree_rcu()
    net,rcu: convert call_rcu(phonet_device_rcu_free) to kfree_rcu()
    perf,rcu: convert call_rcu(swevent_hlist_release_rcu) to kfree_rcu()
    perf,rcu: convert call_rcu(free_ctx) to kfree_rcu()
    net,rcu: convert call_rcu(__nf_ct_ext_free_rcu) to kfree_rcu()
    net,rcu: convert call_rcu(net_generic_release) to kfree_rcu()
    net,rcu: convert call_rcu(netlbl_unlhsh_free_addr6) to kfree_rcu()
    net,rcu: convert call_rcu(netlbl_unlhsh_free_addr4) to kfree_rcu()
    security,rcu: convert call_rcu(sel_netif_free) to kfree_rcu()
    net,rcu: convert call_rcu(xps_dev_maps_release) to kfree_rcu()
    net,rcu: convert call_rcu(xps_map_release) to kfree_rcu()
    net,rcu: convert call_rcu(rps_map_release) to kfree_rcu()
    ...

    Linus Torvalds
     
  • …kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (60 commits)
    sched: Fix and optimise calculation of the weight-inverse
    sched: Avoid going ahead if ->cpus_allowed is not changed
    sched, rt: Update rq clock when unthrottling of an otherwise idle CPU
    sched: Remove unused parameters from sched_fork() and wake_up_new_task()
    sched: Shorten the construction of the span cpu mask of sched domain
    sched: Wrap the 'cfs_rq->nr_spread_over' field with CONFIG_SCHED_DEBUG
    sched: Remove unused 'this_best_prio arg' from balance_tasks()
    sched: Remove noop in alloc_rt_sched_group()
    sched: Get rid of lock_depth
    sched: Remove obsolete comment from scheduler_tick()
    sched: Fix sched_domain iterations vs. RCU
    sched: Next buddy hint on sleep and preempt path
    sched: Make set_*_buddy() work on non-task entities
    sched: Remove need_migrate_task()
    sched: Move the second half of ttwu() to the remote cpu
    sched: Restructure ttwu() some more
    sched: Rename ttwu_post_activation() to ttwu_do_wakeup()
    sched: Remove rq argument from ttwu_stat()
    sched: Remove rq->lock from the first half of ttwu()
    sched: Drop rq->lock from sched_exec()
    ...

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    sched: Fix rt_rq runtime leakage bug

    Linus Torvalds
     
  • Kmemleak frees objects via RCU and when CONFIG_DEBUG_OBJECTS_RCU_HEAD
    is enabled, the RCU callback triggers a call to free_object() in
    lib/debugobjects.c. Since kmemleak is initialised before debug objects
    initialisation, it may result in a kernel panic during booting. This
    patch moves the kmemleak_init() call after debug_objects_mem_init().

    Reported-by: Marcin Slusarz
    Tested-by: Tejun Heo
    Signed-off-by: Catalin Marinas
    Cc:

    Catalin Marinas
     

12 May, 2011

1 commit


11 May, 2011

1 commit

  • This reverts commit 4a5fa3590f09, which did not allow SLUB to be used
    on architectures that use DISCONTIGMEM without compiling NUMA support
    without CONFIG_BROKEN also set.

    The slub panic that it was intended to prevent is addressed by
    d9b41e0b54fd ("[PARISC] set memory ranges in N_NORMAL_MEMORY when
    onlined") on parisc so there is no further slub issues with such a
    configuration.

    The reverts allows SLUB now to be used on such architectures since
    there haven't been any reports of additional errors.

    Cc: James Bottomley
    Signed-off-by: David Rientjes
    Signed-off-by: Linus Torvalds

    David Rientjes
     

06 May, 2011

1 commit

  • Add priority boosting for TREE_PREEMPT_RCU, similar to that for
    TINY_PREEMPT_RCU. This is enabled by the default-off RCU_BOOST
    kernel parameter. The priority to which to boost preempted
    RCU readers is controlled by the RCU_BOOST_PRIO kernel parameter
    (defaulting to real-time priority 1) and the time to wait before
    boosting the readers who are blocking a given grace period is
    controlled by the RCU_BOOST_DELAY kernel parameter (defaulting to
    500 milliseconds).

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     

28 Apr, 2011

1 commit


27 Apr, 2011

1 commit

  • The EXPERT menu list was recently broken by the insertion of a
    kconfig symbol (EMBEDDED) at the beginning of the EXPERT list of
    kconfig items. Broken by:

    commit 6a108a14fa356ef607be308b68337939e56ea94e
    Author: David Rientjes
    Date: Thu Jan 20 14:44:16 2011 -0800
    kconfig: rename CONFIG_EMBEDDED to CONFIG_EXPERT

    Restore the EXPERT menu list -- don't inject a symbol (EMBEDDED)
    that does not depend on EXPERT into the list.

    Signed-off-by: Randy Dunlap
    Cc: David Rientjes
    Cc: Peter Foley
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

23 Apr, 2011

1 commit

  • Slub makes assumptions about page_to_nid() which are violated by
    DISCONTIGMEM and !NUMA. This violation results in a panic because
    page_to_nid() can be non-zero for pages in the discontiguous ranges and
    this leads to a null return by get_node(). The assertion by the
    maintainer is that DISCONTIGMEM should only be allowed when NUMA is also
    defined. However, at least six architectures: alpha, ia64, m32r, m68k,
    mips, parisc violate this. The panic is a regression against slab, so
    just mark slub broken in the problem configuration to prevent users
    reporting these panics.

    Cc: stable@kernel.org
    Acked-by: David Rientjes
    Acked-by: Pekka Enberg
    Signed-off-by: James Bottomley

    James Bottomley
     

15 Apr, 2011

2 commits

  • At the moment we have the CONFIG_KALLSYMS_EXTRA_PASS Kconfig switch,
    which users can enable or disable while configuring the kernel. This
    option is then used by 'make' to determine whether an extra kallsyms
    pass is needed or not.

    However, this approach is not nice and confusing, and this patch moves
    CONFIG_KALLSYMS_EXTRA_PASS from Kconfig to Makefile instead. The
    rationale is below.

    1. CONFIG_KALLSYMS_EXTRA_PASS is really about the build time, not
    run-time. There is no real need for it to be in Kconfig. It is
    just an additional work-around which should be used only in rare
    cases, when someone breaks kallsyms, so Kbuild/Makefile is much
    better place for this option.
    2. Grepping CONFIG_KALLSYMS_EXTRA_PASS shows that many defconfigs have
    it enabled, probably not because they try to work-around a kallsyms
    bug, but just because the Kconfig help text is confusing and does
    not really make it clear that this option should not be used unless
    except when kallsyms is broken.
    3. And since many people have CONFIG_KALLSYMS_EXTRA_PASS enabled in
    their Kconfig, we do might fail to notice kallsyms bugs in time. E.g.,
    many testers use "make allyesconfig" to test builds, which will enable
    CONFIG_KALLSYMS_EXTRA_PASS and kallsyms breakage will not be noticed.

    To address that, this patch:

    1. Kills CONFIG_KALLSYMS_EXTRA_PASS
    2. Changes Makefile so that people can use "make KALLSYMS_EXTRA_PASS=1"
    to enable the extra pass if needed. Additionally, they may define
    KALLSYMS_EXTRA_PASS as an environment variable.
    3. By default KALLSYMS_EXTRA_PASS is disabled and if kallsyms has issues,
    "make" should print a warning and suggest using KALLSYMS_EXTRA_PASS

    Signed-off-by: Artem Bityutskiy
    [mmarek: Removed make help text, is not necessary]
    Signed-off-by: Michal Marek

    Artem Bityutskiy
     
  • Dumb users like myself are not able to grasp from the existing KALLSYMS_ALL
    documentation that this option is not what they need. Improve the help
    message and make it clearer that KALLSYMS is enough in the majority of
    use cases, and KALLSYMS_ALL should really be used very rarely.

    Signed-off-by: Artem Bityutskiy
    Signed-off-by: Michal Marek

    Artem Bityutskiy
     

14 Apr, 2011

1 commit

  • Now that we've removed the rq->lock requirement from the first part of
    ttwu() and can compute placement without holding any rq->lock, ensure
    we execute the second half of ttwu() on the actual cpu we want the
    task to run on.

    This avoids having to take rq->lock and doing the task enqueue
    remotely, saving lots on cacheline transfers.

    As measured using: http://oss.oracle.com/~mason/sembench.c

    $ for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor ; do echo performance > $i; done
    $ echo 4096 32000 64 128 > /proc/sys/kernel/sem
    $ ./sembench -t 2048 -w 1900 -o 0

    unpatched: run time 30 seconds 647278 worker burns per second
    patched: run time 30 seconds 816715 worker burns per second

    Reviewed-by: Frank Rowand
    Cc: Mike Galbraith
    Cc: Nick Piggin
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110405152729.515897185@chello.nl

    Peter Zijlstra
     

31 Mar, 2011

1 commit


24 Mar, 2011

2 commits

  • The expected course of development for user namespaces targeted
    capabilities is laid out at https://wiki.ubuntu.com/UserNamespace.

    Goals:

    - Make it safe for an unprivileged user to unshare namespaces. They
    will be privileged with respect to the new namespace, but this should
    only include resources which the unprivileged user already owns.

    - Provide separate limits and accounting for userids in different
    namespaces.

    Status:

    Currently (as of 2.6.38) you can clone with the CLONE_NEWUSER flag to
    get a new user namespace if you have the CAP_SYS_ADMIN, CAP_SETUID, and
    CAP_SETGID capabilities. What this gets you is a whole new set of
    userids, meaning that user 500 will have a different 'struct user' in
    your namespace than in other namespaces. So any accounting information
    stored in struct user will be unique to your namespace.

    However, throughout the kernel there are checks which

    - simply check for a capability. Since root in a child namespace
    has all capabilities, this means that a child namespace is not
    constrained.

    - simply compare uid1 == uid2. Since these are the integer uids,
    uid 500 in namespace 1 will be said to be equal to uid 500 in
    namespace 2.

    As a result, the lxc implementation at lxc.sf.net does not use user
    namespaces. This is actually helpful because it leaves us free to
    develop user namespaces in such a way that, for some time, user
    namespaces may be unuseful.

    Bugs aside, this patchset is supposed to not at all affect systems which
    are not actively using user namespaces, and only restrict what tasks in
    child user namespace can do. They begin to limit privilege to a user
    namespace, so that root in a container cannot kill or ptrace tasks in the
    parent user namespace, and can only get world access rights to files.
    Since all files currently belong to the initila user namespace, that means
    that child user namespaces can only get world access rights to *all*
    files. While this temporarily makes user namespaces bad for system
    containers, it starts to get useful for some sandboxing.

    I've run the 'runltplite.sh' with and without this patchset and found no
    difference.

    This patch:

    copy_process() handles CLONE_NEWUSER before the rest of the namespaces.
    So in the case of clone(CLONE_NEWUSER|CLONE_NEWUTS) the new uts namespace
    will have the new user namespace as its owner. That is what we want,
    since we want root in that new userns to be able to have privilege over
    it.

    Changelog:
    Feb 15: don't set uts_ns->user_ns if we didn't create
    a new uts_ns.
    Feb 23: Move extern init_user_ns declaration from
    init/version.c to utsname.h.

    Signed-off-by: Serge E. Hallyn
    Acked-by: "Eric W. Biederman"
    Acked-by: Daniel Lezcano
    Acked-by: David Howells
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • This patchset is a cleanup and a preparation to unshare the pid namespace.
    These prerequisites prepare for Eric's patchset to give a file descriptor
    to a namespace and join an existing namespace.

    This patch:

    It turns out that the existing assignment in copy_process of the
    child_reaper can handle the initial assignment of child_reaper we just
    need to generalize the test in kernel/fork.c

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Daniel Lezcano
    Cc: Oleg Nesterov
    Cc: Alexey Dobriyan
    Acked-by: Serge E. Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

23 Mar, 2011

5 commits

  • In do_mounts_rd() if memory cannot be allocated, return -ENOMEM.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Systems with unmaskable interrupts such as SMIs may massively
    underestimate loops_per_jiffy, and fail to converge anywhere near the real
    value. A case seen on x86_64 was an initial estimate of 256<<<
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Tested-by: Stephen Boyd
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Phil Carmody
     
  • Binary chop with a jiffy-resync on each step to find an upper bound is
    slow, so just race in a tight-ish loop to find an underestimate.

    If done with lots of individual steps, sometimes several hundreds of
    iterations would be required, which would impose a significant overhead,
    and make the initial estimate very low. By taking slowly increasing steps
    there will be less overhead.

    E.g. an x86_64 2.67GHz could have fitted in 613 individual small delays,
    but in reality should have been able to fit in a single delay 644 times
    longer, so underestimated by 31 steps. To reach the equivalent of 644
    small delays with the accelerating scheme now requires about 130
    iterations, so has
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Tested-by: Stephen Boyd
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Phil Carmody
     
  • The motivation for this patch series is that currently our OMAP calibrates
    itself using the trial-and-error binary chop fallback that some other
    architectures no longer need to perform. This is a lengthy process,
    taking 0.2s in an environment where boot time is of great interest.

    Patch 2/4 has two optimisations. Firstly, it replaces the initial
    repeated- doubling to find the relevant power of 2 with a tight loop that
    just does as much as it can in a jiffy. Secondly, it doesn't binary chop
    over an entire power of 2 range, it choses a much smaller range based on
    how much it squeezed in, and failed to squeeze in, during the first stage.
    Both are significant optimisations, and bring our calibration down from
    23 jiffies to 5, and, in the process, often arrive at a more accurate lpj
    value.

    The 'bands' and 'sub-logarithmic' growth may look over-engineered, but
    they only cost a small level of inaccuracy in the initial guess (for all
    architectures) in order to avoid the very large inaccuracies that appeared
    during testing (on x86_64 architectures, and presumably others with less
    metronomic operation). Note that due to the existence of the TSC and
    other timers, the x86_64 will not typically use this fallback routine, but
    I wanted to code defensively, able to cope with all kinds of processor
    behaviours and kernel command line options.

    Patch 3/4 is an additional trap for the nightmare scenario where the
    initial estimate is very inaccurate, possibly due to things like SMIs.
    It simply retries with a larger bound.

    Stephen said:

    I tried this patch set out on an MSM7630.
    :
    : Before:
    :
    : Calibrating delay loop... 681.57 BogoMIPS (lpj=3407872)
    :
    : After:
    :
    : Calibrating delay loop... 680.75 BogoMIPS (lpj=3403776)
    :
    : But the really good news is calibration time dropped from ~247ms to ~56ms.
    : Sadly we won't be able to benefit from this should my udelay patches make
    : it into ARM because we would be using calibrate_delay_direct() instead (at
    : least on machines who choose to). Can we somehow reapply the logic behind
    : this to calibrate_delay_direct()? That would be even better, but this is
    : definitely a boot time improvement.
    :
    : Or maybe we could just replace calibrate_delay_direct() with this fallback
    : calculation? If __delay() is a thin wrapper around read_current_timer()
    : it should work just as well (plus patch 3 makes it handle SMIs). I'll try
    : that out.

    This patch:

    ... so that it can be modified more clinically.

    This is almost entirely cosmetic. The only change to the operation
    is that the global variable is only set once after the estimation is
    completed, rather than taking on all the intermediate values. However,
    there are no readers of that variable, so this change is unimportant.

    Signed-off-by: Phil Carmody
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Tested-by: Stephen Boyd
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Phil Carmody
     
  • Move setup_nr_cpu_ids(), smp_init() and some other SMP boot parameter
    setup functions from init/main.c to kenrel/smp.c, saves some #ifdef
    CONFIG_SMP.

    Signed-off-by: WANG Cong
    Cc: Rakib Mullick
    Cc: David Howells
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Tejun Heo
    Cc: Arnd Bergmann
    Cc: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Amerigo Wang