19 Dec, 2012

2 commits

  • The page allocator is able to bind a page to a memcg when it is
    allocated. But for the caches, we'd like to have as many objects as
    possible in a page belonging to the same cache.

    This is done in this patch by calling memcg_kmem_get_cache in the
    beginning of every allocation function. This function is patched out by
    static branches when kernel memory controller is not being used.

    It assumes that the task allocating, which determines the memcg in the
    page allocator, belongs to the same cgroup throughout the whole process.
    Misaccounting can happen if the task calls memcg_kmem_get_cache() while
    belonging to a cgroup, and later on changes. This is considered
    acceptable, and should only happen upon task migration.

    Before the cache is created by the memcg core, there is also a possible
    imbalance: the task belongs to a memcg, but the cache being allocated from
    is the global cache, since the child cache is not yet guaranteed to be
    ready. This case is also fine, since in this case the GFP_KMEMCG will not
    be passed and the page allocator will not attempt any cgroup accounting.

    Signed-off-by: Glauber Costa
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Frederic Weisbecker
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: JoonSoo Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Cc: Suleiman Souhlal
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • Add the basic infrastructure for the accounting of kernel memory. To
    control that, the following files are created:

    * memory.kmem.usage_in_bytes
    * memory.kmem.limit_in_bytes
    * memory.kmem.failcnt
    * memory.kmem.max_usage_in_bytes

    They have the same meaning of their user memory counterparts. They
    reflect the state of the "kmem" res_counter.

    Per cgroup kmem memory accounting is not enabled until a limit is set for
    the group. Once the limit is set the accounting cannot be disabled for
    that group. This means that after the patch is applied, no behavioral
    changes exists for whoever is still using memcg to control their memory
    usage, until memory.kmem.limit_in_bytes is set for the first time.

    We always account to both user and kernel resource_counters. This
    effectively means that an independent kernel limit is in place when the
    limit is set to a lower value than the user memory. A equal or higher
    value means that the user limit will always hit first, meaning that kmem
    is effectively unlimited.

    People who want to track kernel memory but not limit it, can set this
    limit to a very high number (like RESOURCE_MAX - 1page - that no one will
    ever hit, or equal to the user memory)

    [akpm@linux-foundation.org: MEMCG_MMEM only works with slab and slub]
    Signed-off-by: Glauber Costa
    Acked-by: Kamezawa Hiroyuki
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Frederic Weisbecker
    Cc: Greg Thelen
    Cc: JoonSoo Kim
    Cc: Mel Gorman
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     

18 Dec, 2012

1 commit

  • Pull user namespace changes from Eric Biederman:
    "While small this set of changes is very significant with respect to
    containers in general and user namespaces in particular. The user
    space interface is now complete.

    This set of changes adds support for unprivileged users to create user
    namespaces and as a user namespace root to create other namespaces.
    The tyranny of supporting suid root preventing unprivileged users from
    using cool new kernel features is broken.

    This set of changes completes the work on setns, adding support for
    the pid, user, mount namespaces.

    This set of changes includes a bunch of basic pid namespace
    cleanups/simplifications. Of particular significance is the rework of
    the pid namespace cleanup so it no longer requires sending out
    tendrils into all kinds of unexpected cleanup paths for operation. At
    least one case of broken error handling is fixed by this cleanup.

    The files under /proc//ns/ have been converted from regular files
    to magic symlinks which prevents incorrect caching by the VFS,
    ensuring the files always refer to the namespace the process is
    currently using and ensuring that the ptrace_mayaccess permission
    checks are always applied.

    The files under /proc//ns/ have been given stable inode numbers
    so it is now possible to see if different processes share the same
    namespaces.

    Through the David Miller's net tree are changes to relax many of the
    permission checks in the networking stack to allowing the user
    namespace root to usefully use the networking stack. Similar changes
    for the mount namespace and the pid namespace are coming through my
    tree.

    Two small changes to add user namespace support were commited here adn
    in David Miller's -net tree so that I could complete the work on the
    /proc//ns/ files in this tree.

    Work remains to make it safe to build user namespaces and 9p, afs,
    ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the
    Kconfig guard remains in place preventing that user namespaces from
    being built when any of those filesystems are enabled.

    Future design work remains to allow root users outside of the initial
    user namespace to mount more than just /proc and /sys."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (38 commits)
    proc: Usable inode numbers for the namespace file descriptors.
    proc: Fix the namespace inode permission checks.
    proc: Generalize proc inode allocation
    userns: Allow unprivilged mounts of proc and sysfs
    userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file
    procfs: Print task uids and gids in the userns that opened the proc file
    userns: Implement unshare of the user namespace
    userns: Implent proc namespace operations
    userns: Kill task_user_ns
    userns: Make create_new_namespaces take a user_ns parameter
    userns: Allow unprivileged use of setns.
    userns: Allow unprivileged users to create new namespaces
    userns: Allow setting a userns mapping to your current uid.
    userns: Allow chown and setgid preservation
    userns: Allow unprivileged users to create user namespaces.
    userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped
    userns: fix return value on mntns_install() failure
    vfs: Allow unprivileged manipulation of the mount namespace.
    vfs: Only support slave subtrees across different user namespaces
    vfs: Add a user namespace reference from struct mnt_namespace
    ...

    Linus Torvalds
     

17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

11 Dec, 2012

2 commits

  • This patch adds Kconfig options and kernel parameters to allow the
    enabling and disabling of automatic NUMA balancing. The existance
    of such a switch was and is very important when debugging problems
    related to transparent hugepages and we should have the same for
    automatic NUMA placement.

    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • Implement pte_numa and pmd_numa.

    We must atomically set the numa bit and clear the present bit to
    define a pte_numa or pmd_numa.

    Once a pte or pmd has been set as pte_numa or pmd_numa, the next time
    a thread touches a virtual address in the corresponding virtual range,
    a NUMA hinting page fault will trigger. The NUMA hinting page fault
    will clear the NUMA bit and set the present bit again to resolve the
    page fault.

    The expectation is that a NUMA hinting page fault is used as part
    of a placement policy that decides if a page should remain on the
    current node or migrated to a different node.

    Acked-by: Rik van Riel
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Mel Gorman

    Andrea Arcangeli
     

01 Dec, 2012

1 commit

  • Create a new subsystem that probes on kernel boundaries
    to keep track of the transitions between level contexts
    with two basic initial contexts: user or kernel.

    This is an abstraction of some RCU code that use such tracking
    to implement its userspace extended quiescent state.

    We need to pull this up from RCU into this new level of indirection
    because this tracking is also going to be used to implement an "on
    demand" generic virtual cputime accounting. A necessary step to
    shutdown the tick while still accounting the cputime.

    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: H. Peter Anvin
    Cc: Ingo Molnar
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Li Zhong
    Cc: Gilad Ben-Yossef
    Reviewed-by: Steven Rostedt
    [ paulmck: fix whitespace error and email address. ]
    Signed-off-by: Paul E. McKenney

    Frederic Weisbecker
     

17 Nov, 2012

1 commit

  • RCU callback execution can add significant OS jitter and also can
    degrade both scheduling latency and, in asymmetric multiprocessors,
    energy efficiency. This commit therefore adds the ability for selected
    CPUs ("rcu_nocbs=" boot parameter) to have their callbacks offloaded
    to kthreads. If the "rcu_nocb_poll" boot parameter is also specified,
    these kthreads will do polling, removing the need for the offloaded
    CPUs to do wakeups. At least one CPU must be doing normal callback
    processing: currently CPU 0 cannot be selected as a no-CBs CPU.
    In addition, attempts to offline the last normal-CBs CPU will fail.

    This feature was inspired by Jim Houston's and Joe Korty's JRCU, and
    this commit includes fixes to problems located by Fengguang Wu's
    kbuild test robot.

    [ paulmck: Added gfp.h include file as suggested by Fengguang Wu. ]

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

15 Nov, 2012

2 commits

  • Use kuid_t and kgid_t in struct fuse_conn and struct fuse_mount_data.

    The connection between between a fuse filesystem and a fuse daemon is
    established when a fuse filesystem is mounted and provided with a file
    descriptor the fuse daemon created by opening /dev/fuse.

    For now restrict the communication of uids and gids between the fuse
    filesystem and the fuse daemon to the initial user namespace. Enforce
    this by verifying the file descriptor passed to the mount of fuse was
    opened in the initial user namespace. Ensuring the mount happens in
    the initial user namespace is not necessary as mounts from non-initial
    user namespaces are not yet allowed.

    In fuse_req_init_context convert the currrent fsuid and fsgid into the
    initial user namespace for the request that will be sent to the fuse
    daemon.

    In fuse_fill_attr convert the uid and gid passed from the fuse daemon
    from the initial user namespace into kuids and kgids.

    In iattr_to_fattr called from fuse_setattr convert kuids and kgids
    into the uids and gids in the initial user namespace before passing
    them to the fuse filesystem.

    In fuse_change_attributes_common called from fuse_dentry_revalidate,
    fuse_permission, fuse_geattr, and fuse_setattr, and fuse_iget convert
    the uid and gid from the fuse daemon into a kuid and a kgid to store
    on the fuse inode.

    By default fuse mounts are restricted to task whose uid, suid, and
    euid matches the fuse user_id and whose gid, sgid, and egid matches
    the fuse group id. Convert the user_id and group_id mount options
    into kuids and kgids at mount time, and use uid_eq and gid_eq to
    compare the in fuse_allow_task.

    Cc: Miklos Szeredi
    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • Use kuid_t and kgid_t in struct autofs_info and struct autofs_wait_queue.

    When creating directories and symlinks default the uid and gid of
    the mount requester to the global root uid and gid. autofs4_wait
    will update these fields when a mount is requested.

    When generating autofsv5 packets report the uid and gid of the mount
    requestor in user namespace of the process that opened the pipe,
    reporting unmapped uids and gids as overflowuid and overflowgid.

    In autofs_dev_ioctl_requester return the uid and gid of the last mount
    requester converted into the calling processes user namespace. When the
    uid or gid don't map return overflowuid and overflowgid as appropriate,
    allowing failure to find a mount requester to be distinguished from
    failure to map a mount requester.

    The uid and gid mount options specifying the user and group of the
    root autofs inode are converted into kuid and kgid as they are parsed
    defaulting to the current uid and current gid of the process that
    mounts autofs.

    Mounting of autofs for the present remains confined to processes in
    the initial user namespace.

    Cc: Ian Kent
    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

25 Oct, 2012

1 commit


24 Oct, 2012

1 commit

  • The RCU_FAST_NO_HZ help text included a warning about overhead on large
    systems, but that issue has since been resolved. The main remaining
    issue with RCU_FAST_NO_HZ is increased real-time latency. This commit
    therefore updates the help text accordingly.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

15 Oct, 2012

1 commit

  • Pull module signing support from Rusty Russell:
    "module signing is the highlight, but it's an all-over David Howells frenzy..."

    Hmm "Magrathea: Glacier signing key". Somebody has been reading too much HHGTTG.

    * 'modules-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (37 commits)
    X.509: Fix indefinite length element skip error handling
    X.509: Convert some printk calls to pr_devel
    asymmetric keys: fix printk format warning
    MODSIGN: Fix 32-bit overflow in X.509 certificate validity date checking
    MODSIGN: Make mrproper should remove generated files.
    MODSIGN: Use utf8 strings in signer's name in autogenerated X.509 certs
    MODSIGN: Use the same digest for the autogen key sig as for the module sig
    MODSIGN: Sign modules during the build process
    MODSIGN: Provide a script for generating a key ID from an X.509 cert
    MODSIGN: Implement module signature checking
    MODSIGN: Provide module signing public keys to the kernel
    MODSIGN: Automatically generate module signing keys if missing
    MODSIGN: Provide Kconfig options
    MODSIGN: Provide gitignore and make clean rules for extra files
    MODSIGN: Add FIPS policy
    module: signature checking hook
    X.509: Add a crypto key parser for binary (DER) X.509 certificates
    MPILIB: Provide a function to read raw data into an MPI
    X.509: Add an ASN.1 decoder
    X.509: Add simple ASN.1 grammar compiler
    ...

    Linus Torvalds
     

12 Oct, 2012

1 commit


11 Oct, 2012

1 commit


10 Oct, 2012

3 commits

  • Check the signature on the module against the keys compiled into the kernel or
    available in a hardware key store.

    Currently, only RSA keys are supported - though that's easy enough to change,
    and the signature is expected to contain raw components (so not a PGP or
    PKCS#7 formatted blob).

    The signature blob is expected to consist of the following pieces in order:

    (1) The binary identifier for the key. This is expected to match the
    SubjectKeyIdentifier from an X.509 certificate. Only X.509 type
    identifiers are currently supported.

    (2) The signature data, consisting of a series of MPIs in which each is in
    the format of a 2-byte BE word sizes followed by the content data.

    (3) A 12 byte information block of the form:

    struct module_signature {
    enum pkey_algo algo : 8;
    enum pkey_hash_algo hash : 8;
    enum pkey_id_type id_type : 8;
    u8 __pad;
    __be32 id_length;
    __be32 sig_length;
    };

    The three enums are defined in crypto/public_key.h.

    'algo' contains the public-key algorithm identifier (0->DSA, 1->RSA).

    'hash' contains the digest algorithm identifier (0->MD4, 1->MD5, 2->SHA1,
    etc.).

    'id_type' contains the public-key identifier type (0->PGP, 1->X.509).

    '__pad' should be 0.

    'id_length' should contain in the binary identifier length in BE form.

    'sig_length' should contain in the signature data length in BE form.

    The lengths are in BE order rather than CPU order to make dealing with
    cross-compilation easier.

    Signed-off-by: David Howells
    Signed-off-by: Rusty Russell (minor Kconfig fix)

    David Howells
     
  • Provide kernel configuration options for module signing.

    The following configuration options are added:

    CONFIG_MODULE_SIG_SHA1
    CONFIG_MODULE_SIG_SHA224
    CONFIG_MODULE_SIG_SHA256
    CONFIG_MODULE_SIG_SHA384
    CONFIG_MODULE_SIG_SHA512

    These select the cryptographic hash used to digest the data prior to signing.
    Additionally, the crypto module selected will be built into the kernel as it
    won't be possible to load it as a module without incurring a circular
    dependency when the kernel tries to check its signature.

    Signed-off-by: David Howells
    Signed-off-by: Rusty Russell

    David Howells
     
  • We do a very simple search for a particular string appended to the module
    (which is cache-hot and about to be SHA'd anyway). There's both a config
    option and a boot parameter which control whether we accept or fail with
    unsigned modules and modules that are signed with an unknown key.

    If module signing is enabled, the kernel will be tainted if a module is
    loaded that is unsigned or has a signature for which we don't have the
    key.

    (Useful feedback and tweaks by David Howells )

    Signed-off-by: Rusty Russell
    Signed-off-by: David Howells
    Signed-off-by: Rusty Russell

    Rusty Russell
     

09 Oct, 2012

2 commits

  • Introduce SYSCTL_EXCEPTION_TRACE config option and selec it in the
    architectures requiring support for the "exception-trace" debug_table
    entry in kernel/sysctl.c.

    Signed-off-by: Catalin Marinas
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: "David S. Miller"
    Cc: Chris Metcalf
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Catalin Marinas
     
  • Introduce HAVE_UID16 config option and select it in corresponding
    architecture Kconfig files. UID16 now only depends on HAVE_UID16.

    Signed-off-by: Catalin Marinas
    Acked-by: Geert Uytterhoeven
    Cc: Russell King
    Cc: Mike Frysinger
    Cc: Mikael Starvik
    Cc: Jesper Nilsson
    Cc: David Howells
    Cc: Yoshinori Sato
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Paul Mundt
    Cc: "David S. Miller"
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Catalin Marinas
     

08 Oct, 2012

1 commit

  • Add a simple ASN.1 grammar compiler. This produces a bytecode output that can
    be fed to a decoder to inform the decoder how to interpret the ASN.1 stream it
    is trying to parse.

    Action functions can be specified in the grammar by interpolating:

    ({ foo })

    after a type, for example:

    SubjectPublicKeyInfo ::= SEQUENCE {
    algorithm AlgorithmIdentifier,
    subjectPublicKey BIT STRING ({ do_key_data })
    }

    The decoder is expected to call these after matching this type and parsing the
    contents if it is a constructed type.

    The grammar compiler does not currently support the SET type (though it does
    support SET OF) as I can't see a good way of tracking which members have been
    encountered yet without using up extra stack space.

    Currently, the grammar compiler will fail if more than 256 bytes of bytecode
    would be produced or more than 256 actions have been specified as it uses
    8-bit jump values and action indices to keep space usage down.

    Signed-off-by: David Howells
    Signed-off-by: Rusty Russell

    David Howells
     

06 Oct, 2012

2 commits

  • Adds an expert Kconfig option, CONFIG_COREDUMP, which allows disabling of
    core dump. This saves approximately 2.6k in the compiled kernel, and
    complements CONFIG_ELF_CORE, which now depends on it.

    CONFIG_COREDUMP also disables coredump-related sysctls, except for
    suid_dumpable and related functions, which are necessary for ptrace.

    [akpm@linux-foundation.org: fix binfmt_aout.c build]
    Signed-off-by: Alex Kelly
    Reviewed-by: Josh Triplett
    Acked-by: Serge Hallyn
    Acked-by: Kees Cook
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Kelly
     
  • The PA-RISC tool chain seems to have some problem with correct
    read/write attributes on sections. This causes problems when the const
    sections are fixed up for other architecture to only contain truly
    read-only data.

    Disable const sections for PA-RISC

    This can cause a bit of noise with modpost.

    Signed-off-by: Andi Kleen
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Acked-by: Sam Ravnborg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

03 Oct, 2012

1 commit

  • Pull user namespace changes from Eric Biederman:
    "This is a mostly modest set of changes to enable basic user namespace
    support. This allows the code to code to compile with user namespaces
    enabled and removes the assumption there is only the initial user
    namespace. Everything is converted except for the most complex of the
    filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
    nfs, ocfs2 and xfs as those patches need a bit more review.

    The strategy is to push kuid_t and kgid_t values are far down into
    subsystems and filesystems as reasonable. Leaving the make_kuid and
    from_kuid operations to happen at the edge of userspace, as the values
    come off the disk, and as the values come in from the network.
    Letting compile type incompatible compile errors (present when user
    namespaces are enabled) guide me to find the issues.

    The most tricky areas have been the places where we had an implicit
    union of uid and gid values and were storing them in an unsigned int.
    Those places were converted into explicit unions. I made certain to
    handle those places with simple trivial patches.

    Out of that work I discovered we have generic interfaces for storing
    quota by projid. I had never heard of the project identifiers before.
    Adding full user namespace support for project identifiers accounts
    for most of the code size growth in my git tree.

    Ultimately there will be work to relax privlige checks from
    "capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
    root in a user names to do those things that today we only forbid to
    non-root users because it will confuse suid root applications.

    While I was pushing kuid_t and kgid_t changes deep into the audit code
    I made a few other cleanups. I capitalized on the fact we process
    netlink messages in the context of the message sender. I removed
    usage of NETLINK_CRED, and started directly using current->tty.

    Some of these patches have also made it into maintainer trees, with no
    problems from identical code from different trees showing up in
    linux-next.

    After reading through all of this code I feel like I might be able to
    win a game of kernel trivial pursuit."

    Fix up some fairly trivial conflicts in netfilter uid/git logging code.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
    userns: Convert the ufs filesystem to use kuid/kgid where appropriate
    userns: Convert the udf filesystem to use kuid/kgid where appropriate
    userns: Convert ubifs to use kuid/kgid
    userns: Convert squashfs to use kuid/kgid where appropriate
    userns: Convert reiserfs to use kuid and kgid where appropriate
    userns: Convert jfs to use kuid/kgid where appropriate
    userns: Convert jffs2 to use kuid and kgid where appropriate
    userns: Convert hpfs to use kuid and kgid where appropriate
    userns: Convert btrfs to use kuid/kgid where appropriate
    userns: Convert bfs to use kuid/kgid where appropriate
    userns: Convert affs to use kuid/kgid wherwe appropriate
    userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
    userns: On ia64 deal with current_uid and current_gid being kuid and kgid
    userns: On ppc convert current_uid from a kuid before printing.
    userns: Convert s390 getting uid and gid system calls to use kuid and kgid
    userns: Convert s390 hypfs to use kuid and kgid where appropriate
    userns: Convert binder ipc to use kuids
    userns: Teach security_path_chown to take kuids and kgids
    userns: Add user namespace support to IMA
    userns: Convert EVM to deal with kuids and kgids in it's hmac computation
    ...

    Linus Torvalds
     

02 Oct, 2012

3 commits

  • Pull driver core merge from Greg Kroah-Hartman:
    "Here is the big driver core update for 3.7-rc1.

    A number of firmware_class.c updates (as you saw a month or so ago),
    and some hyper-v updates and some printk fixes as well. All patches
    that are outside of the drivers/base area have been acked by the
    respective maintainers, and have all been in the linux-next tree for a
    while.

    Signed-off-by: Greg Kroah-Hartman "

    * tag 'driver-core-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (95 commits)
    memory: tegra{20,30}-mc: Fix reading incorrect register in mc_readl()
    device.h: Add missing inline to #ifndef CONFIG_PRINTK dev_vprintk_emit
    memory: emif: Add ifdef CONFIG_DEBUG_FS guard for emif_debugfs_[init|exit]
    Documentation: Fixes some translation error in Documentation/zh_CN/gpio.txt
    Documentation: Remove 3 byte redundant code at the head of the Documentation/zh_CN/arm/booting
    Documentation: Chinese translation of Documentation/video4linux/omap3isp.txt
    device and dynamic_debug: Use dev_vprintk_emit and dev_printk_emit
    dev: Add dev_vprintk_emit and dev_printk_emit
    netdev_printk/netif_printk: Remove a superfluous logging colon
    netdev_printk/dynamic_netdev_dbg: Directly call printk_emit
    dev_dbg/dynamic_debug: Update to use printk_emit, optimize stack
    driver-core: Shut up dev_dbg_reatelimited() without DEBUG
    tools/hv: Parse /etc/os-release
    tools/hv: Check for read/write errors
    tools/hv: Fix exit() error code
    tools/hv: Fix file handle leak
    Tools: hv: Implement the KVP verb - KVP_OP_GET_IP_INFO
    Tools: hv: Rename the function kvp_get_ip_address()
    Tools: hv: Implement the KVP verb - KVP_OP_SET_IP_INFO
    Tools: hv: Add an example script to configure an interface
    ...

    Linus Torvalds
     
  • Pull arm64 support from Catalin Marinas:
    "Linux support for the 64-bit ARM architecture (AArch64)

    Features currently supported:
    - 39-bit address space for user and kernel (each)
    - 4KB and 64KB page configurations
    - Compat (32-bit) user applications (ARMv7, EABI only)
    - Flattened Device Tree (mandated for all AArch64 platforms)
    - ARM generic timers"

    * tag 'arm64-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64: (35 commits)
    arm64: ptrace: remove obsolete ptrace request numbers from user headers
    arm64: Do not set the SMP/nAMP processor bit
    arm64: MAINTAINERS update
    arm64: Build infrastructure
    arm64: Miscellaneous header files
    arm64: Generic timers support
    arm64: Loadable modules
    arm64: Miscellaneous library functions
    arm64: Performance counters support
    arm64: Add support for /proc/sys/debug/exception-trace
    arm64: Debugging support
    arm64: Floating point and SIMD
    arm64: 32-bit (compat) applications support
    arm64: User access library functions
    arm64: Signal handling support
    arm64: VDSO support
    arm64: System calls handling
    arm64: ELF definitions
    arm64: SMP support
    arm64: DMA mapping API
    ...

    Linus Torvalds
     
  • Pull scheduler changes from Ingo Molnar:
    "Continued quest to clean up and enhance the cputime code by Frederic
    Weisbecker, in preparation for future tickless kernel features.

    Other than that, smallish changes."

    Fix up trivial conflicts due to additions next to each other in arch/{x86/}Kconfig

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
    cputime: Make finegrained irqtime accounting generally available
    cputime: Gather time/stats accounting config options into a single menu
    ia64: Reuse system and user vtime accounting functions on task switch
    ia64: Consolidate user vtime accounting
    vtime: Consolidate system/idle context detection
    cputime: Use a proper subsystem naming for vtime related APIs
    sched: cpu_power: enable ARCH_POWER
    sched/nohz: Clean up select_nohz_load_balancer()
    sched: Fix load avg vs. cpu-hotplug
    sched: Remove __ARCH_WANT_INTERRUPTS_ON_CTXSW
    sched: Fix nohz_idle_balance()
    sched: Remove useless code in yield_to()
    sched: Add time unit suffix to sched sysctl knobs
    sched/debug: Limit sd->*_idx range on sysctl
    sched: Remove AFFINE_WAKEUPS feature flag
    s390: Remove leftover account_tick_vtime() header
    cputime: Consolidate vtime handling on context switch
    sched: Move cputime code to its own file
    cputime: Generalize CONFIG_VIRT_CPU_ACCOUNTING
    tile: Remove SD_PREFER_LOCAL leftover
    ...

    Linus Torvalds
     

26 Sep, 2012

2 commits

  • Provide a config option that enables the userspace
    RCU extended quiescent state on every CPUs by default.

    This is for testing purpose.

    Signed-off-by: Frederic Weisbecker
    Cc: Alessio Igor Bogani
    Cc: Andrew Morton
    Cc: Avi Kivity
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: H. Peter Anvin
    Cc: Ingo Molnar
    Cc: Josh Triplett
    Cc: Kevin Hilman
    Cc: Max Krasnyansky
    Cc: Peter Zijlstra
    Cc: Stephen Hemminger
    Cc: Steven Rostedt
    Cc: Sven-Thorsten Dietrich
    Cc: Thomas Gleixner
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Frederic Weisbecker
     
  • Create a new config option under the RCU menu that put
    CPUs under RCU extended quiescent state (as in dynticks
    idle mode) when they run in userspace. This require
    some contribution from architectures to hook into kernel
    and userspace boundaries.

    Signed-off-by: Frederic Weisbecker
    Cc: Alessio Igor Bogani
    Cc: Andrew Morton
    Cc: Avi Kivity
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: H. Peter Anvin
    Cc: Ingo Molnar
    Cc: Josh Triplett
    Cc: Kevin Hilman
    Cc: Max Krasnyansky
    Cc: Peter Zijlstra
    Cc: Stephen Hemminger
    Cc: Steven Rostedt
    Cc: Sven-Thorsten Dietrich
    Cc: Thomas Gleixner
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Frederic Weisbecker
     

25 Sep, 2012

2 commits

  • There is no known reason for this option to be unavailable on other
    archs than x86. They just need to call enable_sched_clock_irqtime()
    if they have a sufficiently finegrained clock to make it working.

    Move it to the general option and let the user choose between
    it and pure tick based or virtual cputime accounting.

    Note that virtual cputime accounting already performs a finegrained
    irqtime accounting. CONFIG_IRQ_TIME_ACCOUNTING is a kind of middle ground
    between tick and virtual based accounting. So CONFIG_IRQ_TIME_ACCOUNTING
    and CONFIG_VIRT_CPU_ACCOUNTING are mutually exclusive choices.

    Signed-off-by: Frederic Weisbecker
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra

    Frederic Weisbecker
     
  • This debloats a bit the general config menu and make these
    config options easier to find.

    Signed-off-by: Frederic Weisbecker
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra

    Frederic Weisbecker
     

21 Sep, 2012

9 commits