16 May, 2012

3 commits


03 May, 2012

7 commits


26 Apr, 2012

2 commits

  • - Convert the old uid mapping functions into compatibility wrappers
    - Add a uid/gid mapping layer from user space uid and gids to kernel
    internal uids and gids that is extent based for simplicty and speed.
    * Working with number space after mapping uids/gids into their kernel
    internal version adds only mapping complexity over what we have today,
    leaving the kernel code easy to understand and test.
    - Add proc files /proc/self/uid_map /proc/self/gid_map
    These files display the mapping and allow a mapping to be added
    if a mapping does not exist.
    - Allow entering the user namespace without a uid or gid mapping.
    Since we are starting with an existing user our uids and gids
    still have global mappings so are still valid and useful they just don't
    have local mappings. The requirement for things to work are global uid
    and gid so it is odd but perfectly fine not to have a local uid
    and gid mapping.
    Not requiring global uid and gid mappings greatly simplifies
    the logic of setting up the uid and gid mappings by allowing
    the mappings to be set after the namespace is created which makes the
    slight weirdness worth it.
    - Make the mappings in the initial user namespace to the global
    uid/gid space explicit. Today it is an identity mapping
    but in the future we may want to twist this for debugging, similar
    to what we do with jiffies.
    - Document the memory ordering requirements of setting the uid and
    gid mappings. We only allow the mappings to be set once
    and there are no pointers involved so the requirments are
    trivial but a little atypical.

    Performance:

    In this scheme for the permission checks the performance is expected to
    stay the same as the actuall machine instructions should remain the same.

    The worst case I could think of is ls -l on a large directory where
    all of the stat results need to be translated with from kuids and
    kgids to uids and gids. So I benchmarked that case on my laptop
    with a dual core hyperthread Intel i5-2520M cpu with 3M of cpu cache.

    My benchmark consisted of going to single user mode where nothing else
    was running. On an ext4 filesystem opening 1,000,000 files and looping
    through all of the files 1000 times and calling fstat on the
    individuals files. This was to ensure I was benchmarking stat times
    where the inodes were in the kernels cache, but the inode values were
    not in the processors cache. My results:

    v3.4-rc1: ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
    v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
    v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)

    All of the configurations ran in roughly 120ns when I performed tests
    that ran in the cpu cache.

    So in summary the performance impact is:
    1ns improvement in the worst case with user namespace support compiled out.
    8ns aka 5% slowdown in the worst case with user namespace support compiled in.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • - Transform userns->creator from a user_struct reference to a simple
    kuid_t, kgid_t pair.

    In cap_capable this allows the check to see if we are the creator of
    a namespace to become the classic suser style euid permission check.

    This allows us to remove the need for a struct cred in the mapping
    functions and still be able to dispaly the user namespace creators
    uid and gid as 0.

    - Remove the now unnecessary delayed_work in free_user_ns.

    All that is left for free_user_ns to do is to call kmem_cache_free
    and put_user_ns. Those functions can be called in any context
    so call them directly from free_user_ns removing the need for delayed work.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

08 Apr, 2012

9 commits

  • Modify alloc_uid to take a kuid and make the user hash table global.
    Stop holding a reference to the user namespace in struct user_struct.

    This simplifies the code and makes the per user accounting not
    care about which user namespace a uid happens to appear in.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • Start distinguishing between internal kernel uids and gids and
    values that userspace can use. This is done by introducing two
    new types: kuid_t and kgid_t. These types and their associated
    functions are infrastructure are declared in the new header
    uidgid.h.

    Ultimately there will be a different implementation of the mapping
    functions for use with user namespaces. But to keep it simple
    we introduce the mapping functions first to separate the meat
    from the mechanical code conversions.

    Export overflowuid and overflowgid so we can use from_kuid_munged
    and from_kgid_munged in modular code.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • This represents a change in strategy of how to handle user namespaces.
    Instead of tagging everything explicitly with a user namespace and bulking
    up all of the comparisons of uids and gids in the kernel, all uids and gids
    in use will have a mapping to a flat kuid and kgid spaces respectively. This
    allows much more of the existing logic to be preserved and in general
    allows for faster code.

    In this new and improved world we allow someone to utiliize capabilities
    over an inode if the inodes owner mapps into the capabilities holders user
    namespace and the user has capabilities in their user namespace. Which
    is simple and efficient.

    Moving the fs uid comparisons to be comparisons in a flat kuid space
    follows in later patches, something that is only significant if you
    are using user namespaces.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • With a user_ns reference in struct cred the only user of the user namespace
    reference in struct user_struct is to keep the uid hash table alive.

    The user_namespace reference in struct user_struct will be going away soon, and
    I have removed all of the references. Rename the field from user_ns to _user_ns
    so that the compiler can verify nothing follows the user struct to the user
    namespace anymore.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • I am about to remove the struct user_namespace reference from struct user_struct.
    So keep an explicit track of the parent user namespace.

    Take advantage of this new reference and replace instances of user_ns->creator->user_ns
    with user_ns->parent.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • struct user_struct will shortly loose it's user_ns reference
    so make the cred user_ns reference a proper reference complete
    with reference counting.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • Optimize performance and prepare for the removal of the user_ns reference
    from user_struct. Remove the slow long walk through cred->user->user_ns and
    instead go straight to cred->user_ns.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • In struct cred the user member is and has always been declared struct user_struct *user.
    At most a constant struct cred will have a constant pointer to non-constant user_struct
    so remove this unnecessary cast.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

01 Apr, 2012

2 commits

  • Pull scheduler fixes from Ingo Molnar.

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched: Fix incorrect usage of for_each_cpu_mask() in select_fallback_rq()
    sched: Fix __schedule_bug() output when called from an interrupt
    sched/arch: Introduce the finish_arch_post_lock_switch() scheduler callback

    Linus Torvalds
     
  • Pull perf updates and fixes from Ingo Molnar:
    "It's mostly fixes, but there's also two late items:

    - preliminary GTK GUI support for perf report
    - PMU raw event format descriptors in sysfs, to be parsed by tooling

    The raw event format in sysfs is a new ABI. For example for the 'CPU'
    PMU we have:

    aldebaran:~> ll /sys/bus/event_source/devices/cpu/format/*
    -r--r--r--. 1 root root 4096 Mar 31 10:29 /sys/bus/event_source/devices/cpu/format/any
    -r--r--r--. 1 root root 4096 Mar 31 10:29 /sys/bus/event_source/devices/cpu/format/cmask
    -r--r--r--. 1 root root 4096 Mar 31 10:29 /sys/bus/event_source/devices/cpu/format/edge
    -r--r--r--. 1 root root 4096 Mar 31 10:29 /sys/bus/event_source/devices/cpu/format/event
    -r--r--r--. 1 root root 4096 Mar 31 10:29 /sys/bus/event_source/devices/cpu/format/inv
    -r--r--r--. 1 root root 4096 Mar 31 10:29 /sys/bus/event_source/devices/cpu/format/offcore_rsp
    -r--r--r--. 1 root root 4096 Mar 31 10:29 /sys/bus/event_source/devices/cpu/format/pc
    -r--r--r--. 1 root root 4096 Mar 31 10:29 /sys/bus/event_source/devices/cpu/format/umask

    those lists of fields contain a specific format:

    aldebaran:~> cat /sys/bus/event_source/devices/cpu/format/offcore_rsp
    config1:0-63

    So, those who wish to specify raw events can now use the following
    event format:

    -e cpu/cmask=1,event=2,umask=3

    Most people will not want to specify any events (let alone raw
    events), they'll just use whatever default event the tools use.

    But for more obscure PMU events that have no cross-architecture
    generic events the above syntax is more usable and a bit more
    structured than specifying hex numbers."

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
    perf tools: Remove auto-generated bison/flex files
    perf annotate: Fix off by one symbol hist size allocation and hit accounting
    perf tools: Add missing ref-cycles event back to event parser
    perf annotate: addr2line wants addresses in same format as objdump
    perf probe: Finder fails to resolve function name to address
    tracing: Fix ent_size in trace output
    perf symbols: Handle NULL dso in dso__name_len
    perf symbols: Do not include libgen.h
    perf tools: Fix bug in raw sample parsing
    perf tools: Fix display of first level of callchains
    perf tools: Switch module.h into export.h
    perf: Move mmap page data_head offset assertion out of header
    perf: Fix mmap_page capabilities and docs
    perf diff: Fix to work with new hists design
    perf tools: Fix modifier to be applied on correct events
    perf tools: Fix various casting issues for 32 bits
    perf tools: Simplify event_read_id exit path
    tracing: Fix ftrace stack trace entries
    tracing: Move the tracing_on/off() declarations into CONFIG_TRACING
    perf report: Add a simple GTK2-based 'perf report' browser
    ...

    Linus Torvalds
     

31 Mar, 2012

4 commits

  • The function for_each_cpu_mask() expects a *pointer* to struct
    cpumask as its second argument, whereas select_fallback_rq()
    passes the value itself.

    And moreover, for_each_cpu_mask() has been marked as obselete
    in include/linux/cpumask.h. So move to the more appropriate
    for_each_cpu() variant.

    Reported-by: Sasha Levin
    Signed-off-by: Srivatsa S. Bhat
    Acked-by: Peter Zijlstra
    Cc: Dave Jones
    Cc: Liu Chuansheng
    Cc: vapier@gentoo.org
    Cc: rusty@rustcorp.com.au
    Link: http://lkml.kernel.org/r/4F75BED4.9050005@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Srivatsa S. Bhat
     
  • Pull genirq updates from Thomas Gleixner.

    * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    genirq: Adjust irq thread affinity on IRQ_SET_MASK_OK_NOCOPY return value
    genirq: Respect NUMA node affinity in setup_irq_irq affinity()
    genirq: Get rid of unneeded force parameter in irq_finalize_oneshot()
    genirq: Minor readablity improvement in irq_wake_thread()

    Linus Torvalds
     
  • Pull core locking updates from Thomas Gleixner.

    * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    futex: Mark get_robust_list as deprecated
    futex: Do not leak robust list to unprivileged process

    Linus Torvalds
     
  • irq_move_masked_irq() checks the return code of
    chip->irq_set_affinity() only for 0, but IRQ_SET_MASK_OK_NOCOPY is
    also a valid return code, which is there to avoid a redundant copy of
    the cpumask. But in case of IRQ_SET_MASK_OK_NOCOPY we not only avoid
    the redundant copy, we also fail to adjust the thread affinity of an
    eventually threaded interrupt handler.

    Handle IRQ_SET_MASK_OK (==0) and IRQ_SET_MASK_OK_NOCOPY(==1) return
    values correctly by checking the valid return values seperately.

    Signed-off-by: Jiang Liu
    Cc: Jiang Liu
    Cc: Keping Chen
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/1333120296-13563-2-git-send-email-jiang.liu@huawei.com
    Signed-off-by: Thomas Gleixner

    Jiang Liu
     

30 Mar, 2012

8 commits

  • Pull urgent cgroup fix from Tejun Heo:
    "Commit 61d1d219c4c0 ('cgroup: remove extra calls to
    find_existing_css_set') which was part of the rc1 cgroup pull request
    made writes to the cgroup "tasks" file return an uninitialized retval
    on success which can cause boot failures with systemd.

    The change stayed in linux-next for quite some time but gcc
    interestingly failed to emit warning about using uninitialized
    variable and the problem seems to materialize only for certain build
    combinations (probably depends on register allocation).

    It's just missing local variable initialization and the fix is trivial
    & safe. As the problem is critical when it materializes, I'm
    fast-tracking it. Also included is Li's email address change in
    MAINTAINERS."

    * 'for-3.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: cgroup_attach_task() could return -errno after success
    cgroup: update MAINTAINERS entry

    Linus Torvalds
     
  • 61d1d219c4 "cgroup: remove extra calls to find_existing_css_set" made
    cgroup_task_migrate() return void. An unfortunate side effect was
    that cgroup_attach_task() was depending on that function's return
    value to clear its @retval on the success path. On cgroup mounts
    without any subsystem with ->can_attach() callback,
    cgroup_attach_task() ended up returning @retval without initializing
    it on success.

    For some reason, gcc failed to warn about it and it didn't cause
    cgroup_attach_task() to return non-zero value in many cases, probably
    due to difference in register allocation. When the problem
    materializes, systemd fails to populate /systemd cgroup mount and
    fails to boot.

    Fix it by initializing @retval to zero on declaration.

    Signed-off-by: Tejun Heo
    Reported-by: Jiri Kosina
    LKML-Reference:
    Reviewed-by: Mandeep Singh Baines
    Acked-by: Li Zefan

    Tejun Heo
     
  • Pull x32 support for x86-64 from Ingo Molnar:
    "This tree introduces the X32 binary format and execution mode for x86:
    32-bit data space binaries using 64-bit instructions and 64-bit kernel
    syscalls.

    This allows applications whose working set fits into a 32 bits address
    space to make use of 64-bit instructions while using a 32-bit address
    space with shorter pointers, more compressed data structures, etc."

    Fix up trivial context conflicts in arch/x86/{Kconfig,vdso/vma.c}

    * 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
    x32: Fix alignment fail in struct compat_siginfo
    x32: Fix stupid ia32/x32 inversion in the siginfo format
    x32: Add ptrace for x32
    x32: Switch to a 64-bit clock_t
    x32: Provide separate is_ia32_task() and is_x32_task() predicates
    x86, mtrr: Use explicit sizing and padding for the 64-bit ioctls
    x86/x32: Fix the binutils auto-detect
    x32: Warn and disable rather than error if binutils too old
    x32: Only clear TIF_X32 flag once
    x32: Make sure TS_COMPAT is cleared for x32 tasks
    fs: Remove missed ->fds_bits from cessation use of fd_set structs internally
    fs: Fix close_on_exec pointer in alloc_fdtable
    x32: Drop non-__vdso weak symbols from the x32 VDSO
    x32: Fix coding style violations in the x32 VDSO code
    x32: Add x32 VDSO support
    x32: Allow x32 to be configured
    x32: If configured, add x32 system calls to system call tables
    x32: Handle process creation
    x32: Signal-related system calls
    x86: Add #ifdef CONFIG_COMPAT to
    ...

    Linus Torvalds
     
  • Pull more ARM updates from Russell King.

    This got a fair number of conflicts with the split, but
    also with some other sparse-irq and header file include cleanups. They
    all looked pretty trivial, though.

    * 'for-linus' of git://git.linaro.org/people/rmk/linux-arm: (59 commits)
    ARM: fix Kconfig warning for HAVE_BPF_JIT
    ARM: 7361/1: provide XIP_VIRT_ADDR for no-MMU builds
    ARM: 7349/1: integrator: convert to sparse irqs
    ARM: 7259/3: net: JIT compiler for packet filters
    ARM: 7334/1: add jump label support
    ARM: 7333/2: jump label: detect %c support for ARM
    ARM: 7338/1: add support for early console output via semihosting
    ARM: use set_current_blocked() and block_sigmask()
    ARM: exec: remove redundant set_fs(USER_DS)
    ARM: 7332/1: extract out code patch function from kprobes
    ARM: 7331/1: extract out insn generation code from ftrace
    ARM: 7330/1: ftrace: use canonical Thumb-2 wide instruction format
    ARM: 7351/1: ftrace: remove useless memory checks
    ARM: 7316/1: kexec: EOI active and mask all interrupts in kexec crash path
    ARM: Versatile Express: add NO_IOPORT
    ARM: get rid of asm/irq.h in asm/prom.h
    ARM: 7319/1: Print debug info for SIGBUS in user faults
    ARM: 7318/1: gic: refactor irq_start assignment
    ARM: 7317/1: irq: avoid NULL check in for_each_irq_desc loop
    ARM: 7315/1: perf: add support for the Cortex-A7 PMU
    ...

    Linus Torvalds
     
  • Pull scheduler fixes from Ingo Molnar.

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    cpusets: Remove an unused variable
    sched/rt: Improve pick_next_highest_task_rt()
    sched: Fix select_fallback_rq() vs cpu_active/cpu_online
    sched/x86/smp: Do not enable IRQs over calibrate_delay()
    sched: Fix compiler warning about declared inline after use
    MAINTAINERS: Update email address for SCHEDULER and PERF EVENTS

    Linus Torvalds
     
  • Pull x86 updates from Ingo Molnar.

    This touches some non-x86 files due to the sanitized INLINE_SPIN_UNLOCK
    config usage.

    Fixed up trivial conflicts due to just header include changes (removing
    headers due to cpu_idle() merge clashing with the split).

    * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/apic/amd: Be more verbose about LVT offset assignments
    x86, tls: Off by one limit check
    x86/ioapic: Add io_apic_ops driver layer to allow interception
    x86/olpc: Add debugfs interface for EC commands
    x86: Merge the x86_32 and x86_64 cpu_idle() functions
    x86/kconfig: Remove CONFIG_TR=y from the defconfigs
    x86: Stop recursive fault in print_context_stack after stack overflow
    x86/io_apic: Move and reenable irq only when CONFIG_GENERIC_PENDING_IRQ=y
    x86/apic: Add separate apic_id_valid() functions for selected apic drivers
    locking/kconfig: Simplify INLINE_SPIN_UNLOCK usage
    x86/kconfig: Update defconfigs
    x86: Fix excessive MSR print out when show_msr is not specified

    Linus Torvalds
     
  • Pull timer core updates from Thomas Gleixner.

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    ia64: vsyscall: Add missing paranthesis
    alarmtimer: Don't call rtc_timer_init() when CONFIG_RTC_CLASS=n
    x86: vdso: Put declaration before code
    x86-64: Inline vdso clock_gettime helpers
    x86-64: Simplify and optimize vdso clock_gettime monotonic variants
    kernel-time: fix s/then/than/ spelling errors
    time: remove no_sync_cmos_clock
    time: Avoid scary backtraces when warning of > 11% adj
    alarmtimer: Make sure we initialize the rtctimer
    ntp: Fix leap-second hrtimer livelock
    x86, tsc: Skip refined tsc calibration on systems with reliable TSC
    rtc: Provide flag for rtc devices that don't support UIE
    ia64: vsyscall: Use seqcount instead of seqlock
    x86: vdso: Use seqcount instead of seqlock
    x86: vdso: Remove bogus locking in update_vsyscall_tz()
    time: Remove bogus comments
    time: Fix change_clocksource locking
    time: x86: Fix race switching from vsyscall to non-vsyscall clock

    Linus Torvalds
     
  • The debugfs code is really generic for all platforms. This patch removes the
    powerpc-specific directory reference and makes it available to all
    architectures.

    Signed-off-by: Grant Likely

    Grant Likely
     

29 Mar, 2012

5 commits

  • Merge reason: It has not gone upstream via the ARM tree, merge it via
    the scheduler tree.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Notify get_robust_list users that the syscall is going away.

    Suggested-by: Thomas Gleixner
    Signed-off-by: Kees Cook
    Cc: Randy Dunlap
    Cc: Darren Hart
    Cc: Peter Zijlstra
    Cc: Jiri Kosina
    Cc: Eric W. Biederman
    Cc: David Howells
    Cc: Serge E. Hallyn
    Cc: kernel-hardening@lists.openwall.com
    Cc: spender@grsecurity.net
    Link: http://lkml.kernel.org/r/20120323190855.GA27213@www.outflux.net
    Signed-off-by: Thomas Gleixner

    Kees Cook
     
  • It was possible to extract the robust list head address from a setuid
    process if it had used set_robust_list(), allowing an ASLR info leak. This
    changes the permission checks to be the same as those used for similar
    info that comes out of /proc.

    Running a setuid program that uses robust futexes would have had:
    cred->euid != pcred->euid
    cred->euid == pcred->uid
    so the old permissions check would allow it. I'm not aware of any setuid
    programs that use robust futexes, so this is just a preventative measure.

    (This patch is based on changes from grsecurity.)

    Signed-off-by: Kees Cook
    Cc: Darren Hart
    Cc: Peter Zijlstra
    Cc: Jiri Kosina
    Cc: Eric W. Biederman
    Cc: David Howells
    Cc: Serge E. Hallyn
    Cc: kernel-hardening@lists.openwall.com
    Cc: spender@grsecurity.net
    Link: http://lkml.kernel.org/r/20120319231253.GA20893@www.outflux.net
    Signed-off-by: Thomas Gleixner

    Kees Cook
     
  • We respect node affinity of devices already in the irq descriptor
    allocation, but we ignore it for the initial interrupt affinity
    setup, so the interrupt might be routed to a different node.

    Restrict the default affinity mask to the node on which the irq
    descriptor is allocated.

    [ tglx: Massaged changelog ]

    Signed-off-by: Prarit Bhargava
    Acked-by: Neil Horman
    Cc: Yinghai Lu
    Cc: David Rientjes
    Link: http://lkml.kernel.org/r/1332788538-17425-1-git-send-email-prarit@redhat.com
    Signed-off-by: Thomas Gleixner

    Prarit Bhargava
     
  • The only place irq_finalize_oneshot() is called with force parameter set
    is the threaded handler error exit path. But IRQTF_RUNTHREAD is dropped
    at this point and irq_wake_thread() is not going to set it again,
    since PF_EXITING is set for this thread already. So irq_finalize_oneshot()
    will drop the threads bit in threads_oneshot anyway and hence the force
    parameter is superfluous.

    Signed-off-by: Alexander Gordeev
    Link: http://lkml.kernel.org/r/20120321162234.GP24806@dhcp-26-207.brq.redhat.com
    Signed-off-by: Thomas Gleixner

    Alexander Gordeev