21 Dec, 2012

1 commit

  • Pull signal handling cleanups from Al Viro:
    "sigaltstack infrastructure + conversion for x86, alpha and um,
    COMPAT_SYSCALL_DEFINE infrastructure.

    Note that there are several conflicts between "unify
    SS_ONSTACK/SS_DISABLE definitions" and UAPI patches in mainline;
    resolution is trivial - just remove definitions of SS_ONSTACK and
    SS_DISABLED from arch/*/uapi/asm/signal.h; they are all identical and
    include/uapi/linux/signal.h contains the unified variant."

    Fixed up conflicts as per Al.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    alpha: switch to generic sigaltstack
    new helpers: __save_altstack/__compat_save_altstack, switch x86 and um to those
    generic compat_sys_sigaltstack()
    introduce generic sys_sigaltstack(), switch x86 and um to it
    new helper: compat_user_stack_pointer()
    new helper: restore_altstack()
    unify SS_ONSTACK/SS_DISABLE definitions
    new helper: current_user_stack_pointer()
    missing user_stack_pointer() instances
    Bury the conditionals from kernel_thread/kernel_execve series
    COMPAT_SYSCALL_DEFINE: infrastructure

    Linus Torvalds
     

20 Dec, 2012

1 commit

  • All architectures have
    CONFIG_GENERIC_KERNEL_THREAD
    CONFIG_GENERIC_KERNEL_EXECVE
    __ARCH_WANT_SYS_EXECVE
    None of them have __ARCH_WANT_KERNEL_EXECVE and there are only two callers
    of kernel_execve() (which is a trivial wrapper for do_execve() now) left.
    Kill the conditionals and make both callers use do_execve().

    Signed-off-by: Al Viro

    Al Viro
     

19 Dec, 2012

2 commits

  • The page allocator is able to bind a page to a memcg when it is
    allocated. But for the caches, we'd like to have as many objects as
    possible in a page belonging to the same cache.

    This is done in this patch by calling memcg_kmem_get_cache in the
    beginning of every allocation function. This function is patched out by
    static branches when kernel memory controller is not being used.

    It assumes that the task allocating, which determines the memcg in the
    page allocator, belongs to the same cgroup throughout the whole process.
    Misaccounting can happen if the task calls memcg_kmem_get_cache() while
    belonging to a cgroup, and later on changes. This is considered
    acceptable, and should only happen upon task migration.

    Before the cache is created by the memcg core, there is also a possible
    imbalance: the task belongs to a memcg, but the cache being allocated from
    is the global cache, since the child cache is not yet guaranteed to be
    ready. This case is also fine, since in this case the GFP_KMEMCG will not
    be passed and the page allocator will not attempt any cgroup accounting.

    Signed-off-by: Glauber Costa
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Frederic Weisbecker
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: JoonSoo Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Cc: Suleiman Souhlal
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • Add the basic infrastructure for the accounting of kernel memory. To
    control that, the following files are created:

    * memory.kmem.usage_in_bytes
    * memory.kmem.limit_in_bytes
    * memory.kmem.failcnt
    * memory.kmem.max_usage_in_bytes

    They have the same meaning of their user memory counterparts. They
    reflect the state of the "kmem" res_counter.

    Per cgroup kmem memory accounting is not enabled until a limit is set for
    the group. Once the limit is set the accounting cannot be disabled for
    that group. This means that after the patch is applied, no behavioral
    changes exists for whoever is still using memcg to control their memory
    usage, until memory.kmem.limit_in_bytes is set for the first time.

    We always account to both user and kernel resource_counters. This
    effectively means that an independent kernel limit is in place when the
    limit is set to a lower value than the user memory. A equal or higher
    value means that the user limit will always hit first, meaning that kmem
    is effectively unlimited.

    People who want to track kernel memory but not limit it, can set this
    limit to a very high number (like RESOURCE_MAX - 1page - that no one will
    ever hit, or equal to the user memory)

    [akpm@linux-foundation.org: MEMCG_MMEM only works with slab and slub]
    Signed-off-by: Glauber Costa
    Acked-by: Kamezawa Hiroyuki
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Frederic Weisbecker
    Cc: Greg Thelen
    Cc: JoonSoo Kim
    Cc: Mel Gorman
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     

18 Dec, 2012

2 commits

  • Pull user namespace changes from Eric Biederman:
    "While small this set of changes is very significant with respect to
    containers in general and user namespaces in particular. The user
    space interface is now complete.

    This set of changes adds support for unprivileged users to create user
    namespaces and as a user namespace root to create other namespaces.
    The tyranny of supporting suid root preventing unprivileged users from
    using cool new kernel features is broken.

    This set of changes completes the work on setns, adding support for
    the pid, user, mount namespaces.

    This set of changes includes a bunch of basic pid namespace
    cleanups/simplifications. Of particular significance is the rework of
    the pid namespace cleanup so it no longer requires sending out
    tendrils into all kinds of unexpected cleanup paths for operation. At
    least one case of broken error handling is fixed by this cleanup.

    The files under /proc//ns/ have been converted from regular files
    to magic symlinks which prevents incorrect caching by the VFS,
    ensuring the files always refer to the namespace the process is
    currently using and ensuring that the ptrace_mayaccess permission
    checks are always applied.

    The files under /proc//ns/ have been given stable inode numbers
    so it is now possible to see if different processes share the same
    namespaces.

    Through the David Miller's net tree are changes to relax many of the
    permission checks in the networking stack to allowing the user
    namespace root to usefully use the networking stack. Similar changes
    for the mount namespace and the pid namespace are coming through my
    tree.

    Two small changes to add user namespace support were commited here adn
    in David Miller's -net tree so that I could complete the work on the
    /proc//ns/ files in this tree.

    Work remains to make it safe to build user namespaces and 9p, afs,
    ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the
    Kconfig guard remains in place preventing that user namespaces from
    being built when any of those filesystems are enabled.

    Future design work remains to allow root users outside of the initial
    user namespace to mount more than just /proc and /sys."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (38 commits)
    proc: Usable inode numbers for the namespace file descriptors.
    proc: Fix the namespace inode permission checks.
    proc: Generalize proc inode allocation
    userns: Allow unprivilged mounts of proc and sysfs
    userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file
    procfs: Print task uids and gids in the userns that opened the proc file
    userns: Implement unshare of the user namespace
    userns: Implent proc namespace operations
    userns: Kill task_user_ns
    userns: Make create_new_namespaces take a user_ns parameter
    userns: Allow unprivileged use of setns.
    userns: Allow unprivileged users to create new namespaces
    userns: Allow setting a userns mapping to your current uid.
    userns: Allow chown and setgid preservation
    userns: Allow unprivileged users to create user namespaces.
    userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped
    userns: fix return value on mntns_install() failure
    vfs: Allow unprivileged manipulation of the mount namespace.
    vfs: Only support slave subtrees across different user namespaces
    vfs: Add a user namespace reference from struct mnt_namespace
    ...

    Linus Torvalds
     
  • Pull block driver update from Jens Axboe:
    "Now that the core bits are in, here are the driver bits for 3.8. The
    branch contains:

    - A huge pile of drbd bits that were dumped from the 3.7 merge
    window. Following that, it was both made perfectly clear that
    there is going to be no more over-the-wall pulls and how the
    situation on individual pulls can be improved.

    - A few cleanups from Akinobu Mita for drbd and cciss.

    - Queue improvement for loop from Lukas. This grew into adding a
    generic interface for waiting/checking an even with a specific
    lock, allowing this to be pulled out of md and now loop and drbd is
    also using it.

    - A few fixes for xen back/front block driver from Roger Pau Monne.

    - Partition improvements from Stephen Warren, allowing partiion UUID
    to be used as an identifier."

    * 'for-3.8/drivers' of git://git.kernel.dk/linux-block: (609 commits)
    drbd: update Kconfig to match current dependencies
    drbd: Fix drbdsetup wait-connect, wait-sync etc... commands
    drbd: close race between drbd_set_role and drbd_connect
    drbd: respect no-md-barriers setting also when changed online via disk-options
    drbd: Remove obsolete check
    drbd: fixup after wait_even_lock_irq() addition to generic code
    loop: Limit the number of requests in the bio list
    wait: add wait_event_lock_irq() interface
    xen-blkfront: free allocated page
    xen-blkback: move free persistent grants code
    block: partition: msdos: provide UUIDs for partitions
    init: reduce PARTUUID min length to 1 from 36
    block: store partition_meta_info.uuid as a string
    cciss: use check_signature()
    cciss: cleanup bitops usage
    drbd: use copy_highpage
    drbd: if the replication link breaks during handshake, keep retrying
    drbd: check return of kmalloc in receive_uuids
    drbd: Broadcast sync progress no more often than once per second
    drbd: don't try to clear bits once the disk has failed
    ...

    Linus Torvalds
     

17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

16 Dec, 2012

1 commit

  • This reverts commit bd52276fa1d4 ("x86-64/efi: Use EFI to deal with
    platform wall clock (again)"), and the two supporting commits:

    da5a108d05b4: "x86/kernel: remove tboot 1:1 page table creation code"

    185034e72d59: "x86, efi: 1:1 pagetable mapping for virtual EFI calls")

    as they all depend semantically on commit 53b87cf088e2 ("x86, mm:
    Include the entire kernel memory map in trampoline_pgd") that got
    reverted earlier due to the problems it caused.

    This was pointed out by Yinghai Lu, and verified by me on my Macbook Air
    that uses EFI.

    Pointed-out-by: Yinghai Lu
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

15 Dec, 2012

1 commit

  • Pull x86 EFI update from Peter Anvin:
    "EFI tree, from Matt Fleming. Most of the patches are the new efivarfs
    filesystem by Matt Garrett & co. The balance are support for EFI
    wallclock in the absence of a hardware-specific driver, and various
    fixes and cleanups."

    * 'core-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
    efivarfs: Make efivarfs_fill_super() static
    x86, efi: Check table header length in efi_bgrt_init()
    efivarfs: Use query_variable_info() to limit kmalloc()
    efivarfs: Fix return value of efivarfs_file_write()
    efivarfs: Return a consistent error when efivarfs_get_inode() fails
    efivarfs: Make 'datasize' unsigned long
    efivarfs: Add unique magic number
    efivarfs: Replace magic number with sizeof(attributes)
    efivarfs: Return an error if we fail to read a variable
    efi: Clarify GUID length calculations
    efivarfs: Implement exclusive access for {get,set}_variable
    efivarfs: efivarfs_fill_super() ensure we clean up correctly on error
    efivarfs: efivarfs_fill_super() ensure we free our temporary name
    efivarfs: efivarfs_fill_super() fix inode reference counts
    efivarfs: efivarfs_create() ensure we drop our reference on inode on error
    efivarfs: efivarfs_file_read ensure we free data in error paths
    x86-64/efi: Use EFI to deal with platform wall clock (again)
    x86/kernel: remove tboot 1:1 page table creation code
    x86, efi: 1:1 pagetable mapping for virtual EFI calls
    x86, mm: Include the entire kernel memory map in trampoline_pgd
    ...

    Linus Torvalds
     

13 Dec, 2012

1 commit

  • N_HIGH_MEMORY stands for the nodes that has normal or high memory.
    N_MEMORY stands for the nodes that has any memory.

    The code here need to handle with the nodes which have memory, we should
    use N_MEMORY instead.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Wen Congyang
    Cc: Christoph Lameter
    Cc: Hillf Danton
    Cc: Lin Feng
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     

11 Dec, 2012

2 commits

  • This patch adds Kconfig options and kernel parameters to allow the
    enabling and disabling of automatic NUMA balancing. The existance
    of such a switch was and is very important when debugging problems
    related to transparent hugepages and we should have the same for
    automatic NUMA placement.

    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • Implement pte_numa and pmd_numa.

    We must atomically set the numa bit and clear the present bit to
    define a pte_numa or pmd_numa.

    Once a pte or pmd has been set as pte_numa or pmd_numa, the next time
    a thread touches a virtual address in the corresponding virtual range,
    a NUMA hinting page fault will trigger. The NUMA hinting page fault
    will clear the NUMA bit and set the present bit again to resolve the
    page fault.

    The expectation is that a NUMA hinting page fault is used as part
    of a placement policy that decides if a page should remain on the
    current node or migrated to a different node.

    Acked-by: Rik van Riel
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Mel Gorman

    Andrea Arcangeli
     

03 Dec, 2012

1 commit

  • …/linux-rcu into core/rcu

    Conflicts:
    arch/x86/kernel/ptrace.c

    Pull the latest RCU tree from Paul E. McKenney:

    " The major features of this series are:

    1. A first version of no-callbacks CPUs. This version prohibits
    offlining CPU 0, but only when enabled via CONFIG_RCU_NOCB_CPU=y.
    Relaxing this constraint is in progress, but not yet ready
    for prime time. These commits were posted to LKML at
    https://lkml.org/lkml/2012/10/30/724, and are at branch rcu/nocb.

    2. Changes to SRCU that allows statically initialized srcu_struct
    structures. These commits were posted to LKML at
    https://lkml.org/lkml/2012/10/30/296, and are at branch rcu/srcu.

    3. Restructuring of RCU's debugfs output. These commits were posted
    to LKML at https://lkml.org/lkml/2012/10/30/341, and are at
    branch rcu/tracing.

    4. Additional CPU-hotplug/RCU improvements, posted to LKML at
    https://lkml.org/lkml/2012/10/30/327, and are at branch rcu/hotplug.
    Note that the commit eliminating __stop_machine() was judged to
    be too-high of risk, so is deferred to 3.9.

    5. Changes to RCU's idle interface, most notably a new module
    parameter that redirects normal grace-period operations to
    their expedited equivalents. These were posted to LKML at
    https://lkml.org/lkml/2012/10/30/739, and are at branch rcu/idle.

    6. Additional diagnostics for RCU's CPU stall warning facility,
    posted to LKML at https://lkml.org/lkml/2012/10/30/315, and
    are at branch rcu/stall. The most notable change reduces the
    default RCU CPU stall-warning time from 60 seconds to 21 seconds,
    so that it once again happens sooner than the softlockup timeout.

    7. Documentation updates, which were posted to LKML at
    https://lkml.org/lkml/2012/10/30/280, and are at branch rcu/doc.
    A couple of late-breaking changes were posted at
    https://lkml.org/lkml/2012/11/16/634 and
    https://lkml.org/lkml/2012/11/16/547.

    8. Miscellaneous fixes, which were posted to LKML at
    https://lkml.org/lkml/2012/10/30/309, along with a late-breaking
    change posted at Fri, 16 Nov 2012 11:26:25 -0800 with message-ID
    <20121116192625.GA447@linux.vnet.ibm.com>, but which lkml.org
    seems to have missed. These are at branch rcu/fixes.

    9. Finally, a fix for an lockdep-RCU splat was posted to LKML
    at https://lkml.org/lkml/2012/11/7/486. This is at rcu/next. "

    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     

01 Dec, 2012

1 commit

  • Create a new subsystem that probes on kernel boundaries
    to keep track of the transitions between level contexts
    with two basic initial contexts: user or kernel.

    This is an abstraction of some RCU code that use such tracking
    to implement its userspace extended quiescent state.

    We need to pull this up from RCU into this new level of indirection
    because this tracking is also going to be used to implement an "on
    demand" generic virtual cputime accounting. A necessary step to
    shutdown the tick while still accounting the cputime.

    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: H. Peter Anvin
    Cc: Ingo Molnar
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Li Zhong
    Cc: Gilad Ben-Yossef
    Reviewed-by: Steven Rostedt
    [ paulmck: fix whitespace error and email address. ]
    Signed-off-by: Paul E. McKenney

    Frederic Weisbecker
     

23 Nov, 2012

3 commits

  • The MSDOS/MBR partition table includes a 32-bit unique ID, often referred
    to as the NT disk signature. When combined with a partition number within
    the table, this can form a unique ID similar in concept to EFI/GPT's
    partition UUID. Constructing and recording this value in struct
    partition_meta_info allows MSDOS partitions to be referred to on the
    kernel command-line using the following syntax:

    root=PARTUUID=0002dd75-01

    Signed-off-by: Stephen Warren
    Cc: Tejun Heo
    Cc: Will Drewry
    Cc: Kay Sievers
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Stephen Warren
     
  • Reduce the minimum length for a root=PARTUUID= parameter to be considered
    valid from 36 to 1. EFI/GPT partition UUIDs are always exactly 36
    characters long, hence the previous limit. However, the next patch will
    support DOS/MBR UUIDs too, which have a different, shorter, format.
    Instead of validating any particular length, just ensure that at least
    some non-empty value was given by the user.

    Also, consider a missing UUID value to be a parsing error, in the same
    vein as if /PARTNROFF exists and can't be parsed. As such, make both
    error cases print a message and disable rootwait. Convert to pr_err while
    we're at it.

    Signed-off-by: Stephen Warren
    Cc: Tejun Heo
    Cc: Will Drewry
    Cc: Kay Sievers
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Stephen Warren
     
  • This will allow other types of UUID to be stored here, aside from true
    UUIDs. This also simplifies code that uses this field, since it's usually
    constructed from a, used as a, or compared to other, strings.

    Note: A simplistic approach here would be to set uuid_str[36]=0 whenever a
    /PARTNROFF option was found to be present. However, this modifies the
    input string, and causes subsequent calls to devt_from_partuuid() not to
    see the /PARTNROFF option, which causes different results. In order to
    avoid misleading future maintainers, this parameter is marked const.

    Signed-off-by: Stephen Warren
    Cc: Tejun Heo
    Cc: Will Drewry
    Cc: Kay Sievers
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Stephen Warren
     

20 Nov, 2012

1 commit

  • Assign a unique proc inode to each namespace, and use that
    inode number to ensure we only allocate at most one proc
    inode for every namespace in proc.

    A single proc inode per namespace allows userspace to test
    to see if two processes are in the same namespace.

    This has been a long requested feature and only blocked because
    a naive implementation would put the id in a global space and
    would ultimately require having a namespace for the names of
    namespaces, making migration and certain virtualization tricks
    impossible.

    We still don't have per superblock inode numbers for proc, which
    appears necessary for application unaware checkpoint/restart and
    migrations (if the application is using namespace file descriptors)
    but that is now allowd by the design if it becomes important.

    I have preallocated the ipc and uts initial proc inode numbers so
    their structures can be statically initialized.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

19 Nov, 2012

1 commit

  • Instead of setting child_reaper and SIGNAL_UNKILLABLE one way
    for the system init process, and another way for pid namespace
    init processes test pid->nr == 1 and use the same code for both.

    For the global init this results in SIGNAL_UNKILLABLE being set
    much earlier in the initialization process.

    This is a small cleanup and it paves the way for allowing unshare and
    enter of the pid namespace as that path like our global init also will
    not set CLONE_NEWPID.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

17 Nov, 2012

1 commit

  • RCU callback execution can add significant OS jitter and also can
    degrade both scheduling latency and, in asymmetric multiprocessors,
    energy efficiency. This commit therefore adds the ability for selected
    CPUs ("rcu_nocbs=" boot parameter) to have their callbacks offloaded
    to kthreads. If the "rcu_nocb_poll" boot parameter is also specified,
    these kthreads will do polling, removing the need for the offloaded
    CPUs to do wakeups. At least one CPU must be doing normal callback
    processing: currently CPU 0 cannot be selected as a no-CBs CPU.
    In addition, attempts to offline the last normal-CBs CPU will fail.

    This feature was inspired by Jim Houston's and Joe Korty's JRCU, and
    this commit includes fixes to problems located by Fengguang Wu's
    kbuild test robot.

    [ paulmck: Added gfp.h include file as suggested by Fengguang Wu. ]

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

15 Nov, 2012

2 commits

  • Use kuid_t and kgid_t in struct fuse_conn and struct fuse_mount_data.

    The connection between between a fuse filesystem and a fuse daemon is
    established when a fuse filesystem is mounted and provided with a file
    descriptor the fuse daemon created by opening /dev/fuse.

    For now restrict the communication of uids and gids between the fuse
    filesystem and the fuse daemon to the initial user namespace. Enforce
    this by verifying the file descriptor passed to the mount of fuse was
    opened in the initial user namespace. Ensuring the mount happens in
    the initial user namespace is not necessary as mounts from non-initial
    user namespaces are not yet allowed.

    In fuse_req_init_context convert the currrent fsuid and fsgid into the
    initial user namespace for the request that will be sent to the fuse
    daemon.

    In fuse_fill_attr convert the uid and gid passed from the fuse daemon
    from the initial user namespace into kuids and kgids.

    In iattr_to_fattr called from fuse_setattr convert kuids and kgids
    into the uids and gids in the initial user namespace before passing
    them to the fuse filesystem.

    In fuse_change_attributes_common called from fuse_dentry_revalidate,
    fuse_permission, fuse_geattr, and fuse_setattr, and fuse_iget convert
    the uid and gid from the fuse daemon into a kuid and a kgid to store
    on the fuse inode.

    By default fuse mounts are restricted to task whose uid, suid, and
    euid matches the fuse user_id and whose gid, sgid, and egid matches
    the fuse group id. Convert the user_id and group_id mount options
    into kuids and kgids at mount time, and use uid_eq and gid_eq to
    compare the in fuse_allow_task.

    Cc: Miklos Szeredi
    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • Use kuid_t and kgid_t in struct autofs_info and struct autofs_wait_queue.

    When creating directories and symlinks default the uid and gid of
    the mount requester to the global root uid and gid. autofs4_wait
    will update these fields when a mount is requested.

    When generating autofsv5 packets report the uid and gid of the mount
    requestor in user namespace of the process that opened the pipe,
    reporting unmapped uids and gids as overflowuid and overflowgid.

    In autofs_dev_ioctl_requester return the uid and gid of the last mount
    requester converted into the calling processes user namespace. When the
    uid or gid don't map return overflowuid and overflowgid as appropriate,
    allowing failure to find a mount requester to be distinguished from
    failure to map a mount requester.

    The uid and gid mount options specifying the user and group of the
    root autofs inode are converted into kuid and kgid as they are parsed
    defaulting to the current uid and current gid of the process that
    mounts autofs.

    Mounting of autofs for the present remains confined to processes in
    the initial user namespace.

    Cc: Ian Kent
    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

02 Nov, 2012

1 commit


30 Oct, 2012

1 commit

  • Other than ix86, x86-64 on EFI so far didn't set the
    {g,s}et_wallclock accessors to the EFI routines, thus
    incorrectly using raw RTC accesses instead.

    Simply removing the #ifdef around the respective code isn't
    enough, however: While so far early get-time calls were done in
    physical mode, this doesn't work properly for x86-64, as virtual
    addresses would still need to be set up for all runtime regions
    (which wasn't the case on the system I have access to), so
    instead the patch moves the call to efi_enter_virtual_mode()
    ahead (which in turn allows to drop all code related to calling
    efi-get-time in physical mode).

    Additionally the earlier calling of efi_set_executable()
    requires the CPA code to cope, i.e. during early boot it must be
    avoided to call cpa_flush_array(), as the first thing this
    function does is a BUG_ON(irqs_disabled()).

    Also make the two EFI functions in question here static -
    they're not being referenced elsewhere.

    History:

    This commit was originally merged as bacef661acdb ("x86-64/efi:
    Use EFI to deal with platform wall clock") but it resulted in some
    ASUS machines no longer booting due to a firmware bug, and so was
    reverted in f026cfa82f62. A pre-emptive fix for the buggy ASUS
    firmware was merged in 03a1c254975e ("x86, efi: 1:1 pagetable
    mapping for virtual EFI calls") so now this patch can be
    reapplied.

    Signed-off-by: Jan Beulich
    Tested-by: Matt Fleming
    Acked-by: Matthew Garrett
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: H. Peter Anvin
    Signed-off-by: Matt Fleming [added commit history]

    Jan Beulich
     

25 Oct, 2012

1 commit


24 Oct, 2012

1 commit

  • The RCU_FAST_NO_HZ help text included a warning about overhead on large
    systems, but that issue has since been resolved. The main remaining
    issue with RCU_FAST_NO_HZ is increased real-time latency. This commit
    therefore updates the help text accordingly.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

15 Oct, 2012

1 commit

  • Pull module signing support from Rusty Russell:
    "module signing is the highlight, but it's an all-over David Howells frenzy..."

    Hmm "Magrathea: Glacier signing key". Somebody has been reading too much HHGTTG.

    * 'modules-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (37 commits)
    X.509: Fix indefinite length element skip error handling
    X.509: Convert some printk calls to pr_devel
    asymmetric keys: fix printk format warning
    MODSIGN: Fix 32-bit overflow in X.509 certificate validity date checking
    MODSIGN: Make mrproper should remove generated files.
    MODSIGN: Use utf8 strings in signer's name in autogenerated X.509 certs
    MODSIGN: Use the same digest for the autogen key sig as for the module sig
    MODSIGN: Sign modules during the build process
    MODSIGN: Provide a script for generating a key ID from an X.509 cert
    MODSIGN: Implement module signature checking
    MODSIGN: Provide module signing public keys to the kernel
    MODSIGN: Automatically generate module signing keys if missing
    MODSIGN: Provide Kconfig options
    MODSIGN: Provide gitignore and make clean rules for extra files
    MODSIGN: Add FIPS policy
    module: signature checking hook
    X.509: Add a crypto key parser for binary (DER) X.509 certificates
    MPILIB: Provide a function to read raw data into an MPI
    X.509: Add an ASN.1 decoder
    X.509: Add simple ASN.1 grammar compiler
    ...

    Linus Torvalds
     

13 Oct, 2012

3 commits

  • Pull third pile of kernel_execve() patches from Al Viro:
    "The last bits of infrastructure for kernel_thread() et.al., with
    alpha/arm/x86 use of those. Plus sanitizing the asm glue and
    do_notify_resume() on alpha, fixing the "disabled irq while running
    task_work stuff" breakage there.

    At that point the rest of kernel_thread/kernel_execve/sys_execve work
    can be done independently for different architectures. The only
    pending bits that do depend on having all architectures converted are
    restrictred to fs/* and kernel/* - that'll obviously have to wait for
    the next cycle.

    I thought we'd have to wait for all of them done before we start
    eliminating the longjump-style insanity in kernel_execve(), but it
    turned out there's a very simple way to do that without flagday-style
    changes."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    alpha: switch to saner kernel_execve() semantics
    arm: switch to saner kernel_execve() semantics
    x86, um: convert to saner kernel_execve() semantics
    infrastructure for saner ret_from_kernel_thread semantics
    make sure that kernel_thread() callbacks call do_exit() themselves
    make sure that we always have a return path from kernel_execve()
    ppc: eeh_event should just use kthread_run()
    don't bother with kernel_thread/kernel_execve for launching linuxrc
    alpha: get rid of switch_stack argument of do_work_pending()
    alpha: don't bother passing switch_stack separately from regs
    alpha: take SIGPENDING/NOTIFY_RESUME loop into signal.c
    alpha: simplify TIF_NEED_RESCHED handling

    Linus Torvalds
     
  • Pull third pile of VFS updates from Al Viro:
    "Stuff from Jeff Layton, mostly. Sanitizing interplay between audit
    and namei, removing a lot of insanity from audit_inode() mess and
    getting things ready for his ESTALE patchset."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    procfs: don't need a PATH_MAX allocation to hold a string representation of an int
    vfs: embed struct filename inside of names_cache allocation if possible
    audit: make audit_inode take struct filename
    vfs: make path_openat take a struct filename pointer
    vfs: turn do_path_lookup into wrapper around struct filename variant
    audit: allow audit code to satisfy getname requests from its names_list
    vfs: define struct filename and have getname() return it
    vfs: unexport getname and putname symbols
    acct: constify the name arg to acct_on
    vfs: allocate page instead of names_cache buffer in mount_block_root
    audit: overhaul __audit_inode_child to accomodate retrying
    audit: optimize audit_compare_dname_path
    audit: make audit_compare_dname_path use parent_len helper
    audit: remove dirlen argument to audit_compare_dname_path
    audit: set the name_len in audit_inode for parent lookups
    audit: add a new "type" field to audit_names struct
    audit: reverse arguments to audit_inode_child
    audit: no need to walk list in audit_inode if name is NULL
    audit: pass in dentry to audit_copy_inode wherever possible
    audit: remove unnecessary NULL ptr checks from do_path_lookup

    Linus Torvalds
     
  • * allow kernel_execve() leave the actual return to userland to
    caller (selected by CONFIG_GENERIC_KERNEL_EXECVE). Callers
    updated accordingly.
    * architecture that does select GENERIC_KERNEL_EXECVE in its
    Kconfig should have its ret_from_kernel_thread() do this:
    call schedule_tail
    call the callback left for it by copy_thread(); if it ever
    returns, that's because it has just done successful kernel_execve()
    jump to return from syscall
    IOW, its only difference from ret_from_fork() is that it does call the
    callback.
    * such an architecture should also get rid of ret_from_kernel_execve()
    and __ARCH_WANT_KERNEL_EXECVE

    This is the last part of infrastructure patches in that area - from
    that point on work on different architectures can live independently.

    Signed-off-by: Al Viro

    Al Viro
     

12 Oct, 2012

4 commits

  • Pull RCU fixes from Ingo Molnar:
    "This tree includes a shutdown/cpu-hotplug deadlock fix and a
    documentation fix."

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    rcu: Advise most users not to enable RCU user mode
    rcu: Grace-period initialization excludes only RCU notifier

    Linus Torvalds
     
  • First, it's incorrect to call putname() after __getname_gfp() since the
    bare __getname_gfp() call skips the auditing code, while putname()
    doesn't.

    mount_block_root allocates a PATH_MAX buffer via __getname_gfp, and then
    calls get_fs_names to fill the buffer. That function can call
    get_filesystem_list which assumes that that buffer is a full page in
    size. On arches where PAGE_SIZE != 4k, then this could potentially
    overrun.

    In practice, it's hard to imagine the list of filesystem names even
    approaching 4k, but it's best to be safe. Just allocate a page for this
    purpose instead.

    With this, we can also remove the __getname_gfp() definition since there
    are no more callers.

    Signed-off-by: Jeff Layton
    Signed-off-by: Al Viro

    Jeff Layton
     
  • The only place where kernel_execve() is called without a way to
    return to the caller of kernel_thread() callback is kernel_post().
    Reorganize kernel_init()/kernel_post() - instead of the former
    calling the latter in the end (and getting freed by it), have the
    latter *begin* with calling the former (and turn the latter into
    kernel_thread() callback, of course).

    Signed-off-by: Al Viro

    Al Viro
     
  • exec_usermodehelper_fns() will do just fine...

    Signed-off-by: Al Viro

    Al Viro
     

11 Oct, 2012

1 commit


10 Oct, 2012

3 commits

  • Check the signature on the module against the keys compiled into the kernel or
    available in a hardware key store.

    Currently, only RSA keys are supported - though that's easy enough to change,
    and the signature is expected to contain raw components (so not a PGP or
    PKCS#7 formatted blob).

    The signature blob is expected to consist of the following pieces in order:

    (1) The binary identifier for the key. This is expected to match the
    SubjectKeyIdentifier from an X.509 certificate. Only X.509 type
    identifiers are currently supported.

    (2) The signature data, consisting of a series of MPIs in which each is in
    the format of a 2-byte BE word sizes followed by the content data.

    (3) A 12 byte information block of the form:

    struct module_signature {
    enum pkey_algo algo : 8;
    enum pkey_hash_algo hash : 8;
    enum pkey_id_type id_type : 8;
    u8 __pad;
    __be32 id_length;
    __be32 sig_length;
    };

    The three enums are defined in crypto/public_key.h.

    'algo' contains the public-key algorithm identifier (0->DSA, 1->RSA).

    'hash' contains the digest algorithm identifier (0->MD4, 1->MD5, 2->SHA1,
    etc.).

    'id_type' contains the public-key identifier type (0->PGP, 1->X.509).

    '__pad' should be 0.

    'id_length' should contain in the binary identifier length in BE form.

    'sig_length' should contain in the signature data length in BE form.

    The lengths are in BE order rather than CPU order to make dealing with
    cross-compilation easier.

    Signed-off-by: David Howells
    Signed-off-by: Rusty Russell (minor Kconfig fix)

    David Howells
     
  • Provide kernel configuration options for module signing.

    The following configuration options are added:

    CONFIG_MODULE_SIG_SHA1
    CONFIG_MODULE_SIG_SHA224
    CONFIG_MODULE_SIG_SHA256
    CONFIG_MODULE_SIG_SHA384
    CONFIG_MODULE_SIG_SHA512

    These select the cryptographic hash used to digest the data prior to signing.
    Additionally, the crypto module selected will be built into the kernel as it
    won't be possible to load it as a module without incurring a circular
    dependency when the kernel tries to check its signature.

    Signed-off-by: David Howells
    Signed-off-by: Rusty Russell

    David Howells
     
  • We do a very simple search for a particular string appended to the module
    (which is cache-hot and about to be SHA'd anyway). There's both a config
    option and a boot parameter which control whether we accept or fail with
    unsigned modules and modules that are signed with an unknown key.

    If module signing is enabled, the kernel will be tainted if a module is
    loaded that is unsigned or has a signature for which we don't have the
    key.

    (Useful feedback and tweaks by David Howells )

    Signed-off-by: Rusty Russell
    Signed-off-by: David Howells
    Signed-off-by: Rusty Russell

    Rusty Russell
     

09 Oct, 2012

2 commits

  • After both prio_tree users have been converted to use red-black trees,
    there is no need to keep around the prio tree library anymore.

    Signed-off-by: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Hillf Danton
    Cc: Peter Zijlstra
    Cc: Catalin Marinas
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Introduce SYSCTL_EXCEPTION_TRACE config option and selec it in the
    architectures requiring support for the "exception-trace" debug_table
    entry in kernel/sysctl.c.

    Signed-off-by: Catalin Marinas
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: "David S. Miller"
    Cc: Chris Metcalf
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Catalin Marinas