18 Jan, 2012

27 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit: (29 commits)
    audit: no leading space in audit_log_d_path prefix
    audit: treat s_id as an untrusted string
    audit: fix signedness bug in audit_log_execve_info()
    audit: comparison on interprocess fields
    audit: implement all object interfield comparisons
    audit: allow interfield comparison between gid and ogid
    audit: complex interfield comparison helper
    audit: allow interfield comparison in audit rules
    Kernel: Audit Support For The ARM Platform
    audit: do not call audit_getname on error
    audit: only allow tasks to set their loginuid if it is -1
    audit: remove task argument to audit_set_loginuid
    audit: allow audit matching on inode gid
    audit: allow matching on obj_uid
    audit: remove audit_finish_fork as it can't be called
    audit: reject entry,always rules
    audit: inline audit_free to simplify the look of generic code
    audit: drop audit_set_macxattr as it doesn't do anything
    audit: inline checks for not needing to collect aux records
    audit: drop some potentially inadvisable likely notations
    ...

    Use evil merge to fix up grammar mistakes in Kconfig file.

    Bad speling and horrible grammar (and copious swearing) is to be
    expected, but let's keep it to commit messages and comments, rather than
    expose it to users in config help texts or printouts.

    Linus Torvalds
     
  • audit_log_d_path() injects an additional space before the prefix,
    which serves no purpose and doesn't mix well with other audit_log*()
    functions that do not sneak extra characters into the log.

    Signed-off-by: Kees Cook
    Signed-off-by: Eric Paris

    Kees Cook
     
  • In the loop, a size_t "len" is used to hold the return value of
    audit_log_single_execve_arg(), which returns -1 on error. In that
    case the error handling (len
    Signed-off-by: Eric Paris

    Xi Wang
     
  • This allows audit to specify rules in which we compare two fields of a
    process. Such as is the running process uid != to the running process
    euid?

    Signed-off-by: Peter Moody
    Signed-off-by: Eric Paris

    Peter Moody
     
  • This completes the matrix of interfield comparisons between uid/gid
    information for the current task and the uid/gid information for inodes.
    aka I can audit based on differences between the euid of the process and
    the uid of fs objects.

    Signed-off-by: Peter Moody
    Signed-off-by: Eric Paris

    Peter Moody
     
  • Allow audit rules to compare the gid of the running task to the gid of the
    inode in question.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • Rather than code the same loop over and over implement a helper function which
    uses some pointer magic to make it generic enough to be used numerous places
    as we implement more audit interfield comparisons

    Signed-off-by: Eric Paris

    Eric Paris
     
  • We wish to be able to audit when a uid=500 task accesses a file which is
    uid=0. Or vice versa. This patch introduces a new audit filter type
    AUDIT_FIELD_COMPARE which takes as an 'enum' which indicates which fields
    should be compared. At this point we only define the task->uid vs
    inode->uid, but other comparisons can be added.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • Just a code cleanup really. We don't need to make a function call just for
    it to return on error. This also makes the VFS function even easier to follow
    and removes a conditional on a hot path.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • At the moment we allow tasks to set their loginuid if they have
    CAP_AUDIT_CONTROL. In reality we want tasks to set the loginuid when they
    log in and it be impossible to ever reset. We had to make it mutable even
    after it was once set (with the CAP) because on update and admin might have
    to restart sshd. Now sshd would get his loginuid and the next user which
    logged in using ssh would not be able to set his loginuid.

    Systemd has changed how userspace works and allowed us to make the kernel
    work the way it should. With systemd users (even admins) are not supposed
    to restart services directly. The system will restart the service for
    them. Thus since systemd is going to loginuid==-1, sshd would get -1, and
    sshd would be allowed to set a new loginuid without special permissions.

    If an admin in this system were to manually start an sshd he is inserting
    himself into the system chain of trust and thus, logically, it's his
    loginuid that should be used! Since we have old systems I make this a
    Kconfig option.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • The function always deals with current. Don't expose an option
    pretending one can use it for something. You can't.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • Much like the ability to filter audit on the uid of an inode collected, we
    should be able to filter on the gid of the inode.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • Allow syscall exit filter matching based on the uid of the owner of an
    inode used in a syscall. aka:

    auditctl -a always,exit -S open -F obj_uid=0 -F perm=wa

    Signed-off-by: Eric Paris

    Eric Paris
     
  • Audit entry,always rules are not allowed and are automatically changed in
    exit,always rules in userspace. The kernel refuses to load such rules.

    Thus a task in the middle of a syscall (and thus in audit_finish_fork())
    can only be in one of two states: AUDIT_BUILD_CONTEXT or AUDIT_DISABLED.
    Since the current task cannot be in AUDIT_RECORD_CONTEXT we aren't every
    going to actually use the code in audit_finish_fork() since it will
    return without doing anything. Thus drop the code.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • We deprecated entry,always rules a long time ago. Reject those rules as
    invalid.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • make the conditional a static inline instead of doing it in generic code.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • A number of audit hooks make function calls before they determine that
    auxilary records do not need to be collected. Do those checks as static
    inlines since the most common case is going to be that records are not
    needed and we can skip the function call overhead.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • The audit code makes heavy use of likely() and unlikely() macros, but they
    don't always make sense. Drop any that seem questionable and let the
    computer do it's thing.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • Audit contexts have 3 states. Disabled, which doesn't collect anything,
    build, which collects info but might not emit it, and record, which
    collects and emits. There is a 4th state, setup, which isn't used. Get
    rid of it.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • Every arch calls:

    if (unlikely(current->audit_context))
    audit_syscall_entry()

    which requires knowledge about audit (the existance of audit_context) in
    the arch code. Just do it all in static inline in audit.h so that arch's
    can remain blissfully ignorant.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • The audit system previously expected arches calling to audit_syscall_exit to
    supply as arguments if the syscall was a success and what the return code was.
    Audit also provides a helper AUDITSC_RESULT which was supposed to simplify things
    by converting from negative retcodes to an audit internal magic value stating
    success or failure. This helper was wrong and could indicate that a valid
    pointer returned to userspace was a failed syscall. The fix is to fix the
    layering foolishness. We now pass audit_syscall_exit a struct pt_reg and it
    in turns calls back into arch code to collect the return value and to
    determine if the syscall was a success or failure. We also define a generic
    is_syscall_success() macro which determines success/failure based on if the
    value is < -MAX_ERRNO. This works for arches like x86 which do not use a
    separate mechanism to indicate syscall failure.

    We make both the is_syscall_success() and regs_return_value() static inlines
    instead of macros. The reason is because the audit function must take a void*
    for the regs. (uml calls theirs struct uml_pt_regs instead of just struct
    pt_regs so audit_syscall_exit can't take a struct pt_regs). Since the audit
    function takes a void* we need to use static inlines to cast it back to the
    arch correct structure to dereference it.

    The other major change is that on some arches, like ia64, MIPS and ppc, we
    change regs_return_value() to give us the negative value on syscall failure.
    THE only other user of this macro, kretprobe_example.c, won't notice and it
    makes the value signed consistently for the audit functions across all archs.

    In arch/sh/kernel/ptrace_64.c I see that we were using regs[9] in the old
    audit code as the return value. But the ptrace_64.h code defined the macro
    regs_return_value() as regs[3]. I have no idea which one is correct, but this
    patch now uses the regs_return_value() function, so it now uses regs[3].

    For powerpc we previously used regs->result but now use the
    regs_return_value() function which uses regs->gprs[3]. regs->gprs[3] is
    always positive so the regs_return_value(), much like ia64 makes it negative
    before calling the audit code when appropriate.

    Signed-off-by: Eric Paris
    Acked-by: H. Peter Anvin [for x86 portion]
    Acked-by: Tony Luck [for ia64]
    Acked-by: Richard Weinberger [for uml]
    Acked-by: David S. Miller [for sparc]
    Acked-by: Ralf Baechle [for mips]
    Acked-by: Benjamin Herrenschmidt [for ppc]

    Eric Paris
     
  • The audit system likes to collect information about processes that end
    abnormally (SIGSEGV) as this may me useful intrusion detection information.
    This patch adds audit support to collect information when seccomp forces a
    task to exit because of misbehavior in a similar way.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • The audit system has the ability to filter on the major and minor number of
    the device containing the inode being operated upon. Lets say that
    /dev/sda1 has major,minor 8,1 and that we mount /dev/sda1 on /boot. Now lets
    say we add a watch with a filter on 8,1. If we proceed to open an inode
    inside /boot, such as /vboot/vmlinuz, we will match the major,minor filter.

    Lets instead assume that one were to use a tool like debugfs and were to
    open /dev/sda1 directly and to modify it's contents. We might hope that
    this would also be logged, but it isn't. The rules will check the
    major,minor of the device containing /dev/sda1. In other words the rule
    would match on the major/minor of the tmpfs mounted at /dev.

    I believe these rules should trigger on either device. The man page is
    devoid of useful information about the intended semantics. It only seems
    logical that if you want to know everything that happened on a major,minor
    that would include things that happened to the device itself...

    Signed-off-by: Eric Paris

    Eric Paris
     
  • userspace audit messages look like so:

    type=USER msg=audit(1271170549.415:24710): user pid=14722 uid=0 auid=500 ses=1 subj=unconfined_u:unconfined_r:auditctl_t:s0-s0:c0.c1023 msg=''

    That third field just says 'user'. That's useless and doesn't follow the
    key=value pair we are trying to enforce. We already know it came from the
    user based on the record type. Kill that word. Die.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • This patch does 2 things. First it reduces the number of audit_names
    allocated in every audit context from 20 to 5. 5 should be enough for all
    'normal' syscalls (rename being the worst). Some syscalls can still touch
    more the 5 inodes such as mount. When rpc filesystem is mounted it will
    create inodes and those can exceed 5. To handle that problem this patch will
    dynamically allocate audit_names if it needs more than 5. This should
    decrease the typicall memory usage while still supporting all the possible
    kernel operations.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • Every other filter that matches part of the inodes list collected by audit
    will match against any of the inodes on that list. The filetype matching
    however had a strange way of doing things. It allowed userspace to
    indicated if it should match on the first of the second name collected by
    the kernel. Name collection ordering seems like a kernel internal and
    making userspace rules get that right just seems like a bad idea. As it
    turns out the userspace audit writers had no idea it was doing this and
    thus never overloaded the value field. The kernel always checked the first
    name collected which for the tested rules was always correct.

    This patch just makes the filetype matching like the major, minor, inode,
    and LSM rules in that it will match against any of the names collected. It
    also changes the rule validation to reject the old unused rule types.

    Noone knew it was there. Noone used it. Why keep around the extra code?

    Signed-off-by: Eric Paris

    Eric Paris
     
  • This reverts commit d2a7009f0bb03fa22ad08dd25472efa0568126b9.

    J. R. Okajima explains:

    "After this commit, I am afraid access(2) on NFS may not work
    correctly. The scenario based upon my guess.
    - access(2) overrides the credentials.
    - calls inode_permission() -- ... -- generic_permission() --
    ns_capable().
    - while the old ns_capable() calls security_capable(current_cred()),
    the new ns_capable() calls has_ns_capability(current) --
    security_capable(__task_cred(t)).

    current_cred() returns current->cred which is effective (overridden)
    credentials, but __task_cred(current) returns current->real_cred (the
    NFSD's credential). And the overridden credentials by access(2) lost."

    Requested-by: J. R. Okajima
    Acked-by: Eric Paris
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

17 Jan, 2012

1 commit


16 Jan, 2012

3 commits

  • Recent changes to kernel/module.c caused the following compile
    error:

    kernel/module.c: In function ‘show_taint’:
    kernel/module.c:1024:2: error: implicit declaration of function ‘module_flags_taint’ [-Werror=implicit-function-declaration]
    cc1: some warnings being treated as errors

    Correct this error by moving the definition of module_flags_taint
    outside of the #ifdef CONFIG_MODULE_UNLOAD section.

    Signed-off-by: Kevin Winchester
    Signed-off-by: Linus Torvalds

    Kevin Winchester
     
  • * 'for-3.3/core' of git://git.kernel.dk/linux-block: (37 commits)
    Revert "block: recursive merge requests"
    block: Stop using macro stubs for the bio data integrity calls
    blockdev: convert some macros to static inlines
    fs: remove unneeded plug in mpage_readpages()
    block: Add BLKROTATIONAL ioctl
    block: Introduce blk_set_stacking_limits function
    block: remove WARN_ON_ONCE() in exit_io_context()
    block: an exiting task should be allowed to create io_context
    block: ioc_cgroup_changed() needs to be exported
    block: recursive merge requests
    block, cfq: fix empty queue crash caused by request merge
    block, cfq: move icq creation and rq->elv.icq association to block core
    block, cfq: restructure io_cq creation path for io_context interface cleanup
    block, cfq: move io_cq exit/release to blk-ioc.c
    block, cfq: move icq cache management to block core
    block, cfq: move io_cq lookup to blk-ioc.c
    block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq
    block, cfq: reorganize cfq_io_context into generic and cfq specific parts
    block: remove elevator_queue->ops
    block: reorder elevator switch sequence
    ...

    Fix up conflicts in:
    - block/blk-cgroup.c
    Switch from can_attach_task to can_attach
    - block/cfq-iosched.c
    conflict with now removed cic index changes (we now use q->id instead)

    Linus Torvalds
     
  • * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
    perf tools: Fix compile error on x86_64 Ubuntu
    perf report: Fix --stdio output alignment when --showcpuutilization used
    perf annotate: Get rid of field_sep check
    perf annotate: Fix usage string
    perf kmem: Fix a memory leak
    perf kmem: Add missing closedir() calls
    perf top: Add error message for EMFILE
    perf test: Change type of '-v' option to INCR
    perf script: Add missing closedir() calls
    tracing: Fix compile error when static ftrace is enabled
    recordmcount: Fix handling of elf64 big-endian objects.
    perf tools: Add const.h to MANIFEST to make perf-tar-src-pkg work again
    perf tools: Add support for guest/host-only profiling
    perf kvm: Do guest-only counting by default
    perf top: Don't update total_period on process_sample
    perf hists: Stop using 'self' for struct hist_entry
    perf hists: Rename total_session to total_period
    x86: Add counter when debug stack is used with interrupts enabled
    x86: Allow NMIs to hit breakpoints in i386
    x86: Keep current stack in NMI breakpoints
    ...

    Linus Torvalds
     

15 Jan, 2012

2 commits

  • * 'for-linus' of git://selinuxproject.org/~jmorris/linux-security:
    capabilities: remove __cap_full_set definition
    security: remove the security_netlink_recv hook as it is equivalent to capable()
    ptrace: do not audit capability check when outputing /proc/pid/stat
    capabilities: remove task_ns_* functions
    capabitlies: ns_capable can use the cap helpers rather than lsm call
    capabilities: style only - move capable below ns_capable
    capabilites: introduce new has_ns_capabilities_noaudit
    capabilities: call has_ns_capability from has_capability
    capabilities: remove all _real_ interfaces
    capabilities: introduce security_capable_noaudit
    capabilities: reverse arguments to security_capable
    capabilities: remove the task from capable LSM hook entirely
    selinux: sparse fix: fix several warnings in the security server cod
    selinux: sparse fix: fix warnings in netlink code
    selinux: sparse fix: eliminate warnings for selinuxfs
    selinux: sparse fix: declare selinux_disable() in security.h
    selinux: sparse fix: move selinux_complete_init
    selinux: sparse fix: make selinux_secmark_refcount static
    SELinux: Fix RCU deref check warning in sel_netport_insert()

    Manually fix up a semantic mis-merge wrt security_netlink_recv():

    - the interface was removed in commit fd7784615248 ("security: remove
    the security_netlink_recv hook as it is equivalent to capable()")

    - a new user of it appeared in commit a38f7907b926 ("crypto: Add
    userspace configuration API")

    causing no automatic merge conflict, but Eric Paris pointed out the
    issue.

    Linus Torvalds
     
  • Autogenerated GPG tag for Rusty D1ADB8F1: 15EE 8D6C AB0E 7F0C F999 BFCB D920 0E6C D1AD B8F1

    * tag 'for-linus' of git://github.com/rustyrussell/linux:
    module_param: check that bool parameters really are bool.
    intelfbdrv.c: bailearly is an int module_param
    paride/pcd: fix bool verbose module parameter.
    module_param: make bool parameters really bool (drivers & misc)
    module_param: make bool parameters really bool (arch)
    module_param: make bool parameters really bool (core code)
    kernel/async: remove redundant declaration.
    printk: fix unnecessary module_param_name.
    lirc_parallel: fix module parameter description.
    module_param: avoid bool abuse, add bint for special cases.
    module_param: check type correctness for module_param_array
    modpost: use linker section to generate table.
    modpost: use a table rather than a giant if/else statement.
    modules: sysfs - export: taint, coresize, initsize
    kernel/params: replace DEBUGP with pr_debug
    module: replace DEBUGP with pr_debug
    module: struct module_ref should contains long fields
    module: Fix performance regression on modules with large symbol tables
    module: Add comments describing how the "strmap" logic works

    Fix up conflicts in scripts/mod/file2alias.c due to the new linker-
    generated table approach to adding __mod_*_device_table entries. The
    ARM sa11x0 mcp bus needed to be converted to that too.

    Linus Torvalds
     

14 Jan, 2012

2 commits

  • For compressed image, the space required is not known until
    we finish compressing and writing all pages.
    This patch drops the check, and if swap space is not enough
    finally, system can still restore to normal after writing
    swap fails for compressed images.

    Signed-off-by: Barry Song
    Acked-by: Pavel Machek
    Signed-off-by: Rafael J. Wysocki

    Barry Song
     
  • After commit 1eb208aea3179dd2fc0cdeea45ef869d75b4fe70, "PM: Make
    CONFIG_PM depend on (CONFIG_PM_SLEEP || CONFIG_PM_RUNTIME)", the
    files under kernel/power are not built unless CONFIG_PM_SLEEP or
    CONFIG_PM_RUNTIME is set. In particular, this causes
    kernel/power/poweroff.c to be omitted, even though it should be
    compiled, because CONFIG_MAGIC_SYSRQ is set.

    Fix the problem by causing kernel/power/Makefile to be processed
    for CONFIG_PM unset too.

    Reported-and-tested-by: Phil Oester
    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     

13 Jan, 2012

5 commits

  • When we restore a task we need to set up text, data and data heap sizes
    from userspace to the values a task had at checkpoint time. This patch
    adds auxilary prctl codes for that.

    While most of them have a statistical nature (their values are involved
    into calculation of /proc//statm output) the start_brk and brk values
    are used to compute an allowed size of program data segment expansion.
    Which means an arbitrary changes of this values might be dangerous
    operation. So to restrict access the following requirements applied to
    prctl calls:

    - The process has to have CAP_SYS_ADMIN capability granted.
    - For all opcodes except start_brk/brk members an appropriate
    VMA area must exist and should fit certain VMA flags,
    such as:
    - code segment must be executable but not writable;
    - data segment must not be executable.

    start_brk/brk values must not intersect with data segment and must not
    exceed RLIMIT_DATA resource limit.

    Still the main guard is CAP_SYS_ADMIN capability check.

    Note the kernel should be compiled with CONFIG_CHECKPOINT_RESTORE support
    otherwise these prctl calls will return -EINVAL.

    [akpm@linux-foundation.org: cache current->mm in a local, saving 200 bytes text]
    Signed-off-by: Cyrill Gorcunov
    Reviewed-by: Kees Cook
    Cc: Tejun Heo
    Cc: Andrew Vagin
    Cc: Serge Hallyn
    Cc: Pavel Emelyanov
    Cc: Vasiliy Kulikov
    Cc: KAMEZAWA Hiroyuki
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • When an oops causes a panic and panic prints another backtrace it's pretty
    common to have the original oops data be scrolled away on a 80x50 screen.

    The second backtrace is quite redundant and not needed anyways.

    So don't print the panic backtrace when oops_in_progress is true.

    [akpm@linux-foundation.org: add comment]
    Signed-off-by: Andi Kleen
    Cc: Michael Holzheu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • The sysctl works on the current task's pid namespace, getting and setting
    its last_pid field.

    Writing is allowed for CAP_SYS_ADMIN-capable tasks thus making it possible
    to create a task with desired pid value. This ability is required badly
    for the checkpoint/restore in userspace.

    This approach suits all the parties for now.

    Signed-off-by: Pavel Emelyanov
    Acked-by: Tejun Heo
    Cc: Oleg Nesterov
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Cc: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • When two CPUs call panic at the same time there is a possible race
    condition that can stop kdump. The first CPU calls crash_kexec() and the
    second CPU calls smp_send_stop() in panic() before crash_kexec() finished
    on the first CPU. So the second CPU stops the first CPU and therefore
    kdump fails:

    1st CPU:
    panic()->crash_kexec()->mutex_trylock(&kexec_mutex)-> do kdump

    2nd CPU:
    panic()->crash_kexec()->kexec_mutex already held by 1st CPU
    ->smp_send_stop()-> stop 1st CPU (stop kdump)

    This patch fixes the problem by introducing a spinlock in panic that
    allows only one CPU to process crash_kexec() and the subsequent panic
    code.

    All other CPUs call the weak function panic_smp_self_stop() that stops the
    CPU itself. This function can be overloaded by architecture code. For
    example "tile" can use their lower-power "nap" instruction for that.

    Signed-off-by: Michael Holzheu
    Acked-by: Chris Metcalf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Holzheu
     
  • Currently it is possible to set the crash_size via the sysfs
    /sys/kernel/kexec_crash_size even if no crash kernel memory has been
    defined with the "crashkernel" parameter. In this case "crashk_res" is
    not initialized and crashk_res.start = crashk_res.end = 0. Unfortunately
    resource_size(&crashk_res) returns 1 in this case. This breaks the s390
    implementation of crash_(un)map_reserved_pages().

    To fix the problem the correct "old_size" is now calculated in
    crash_shrink_memory(). "old_size is set to "0" if crashk_res is not
    initialized. With this change crash_shrink_memory() will do nothing, when
    "crashk_res" is not initialized. It will return "0" for "echo 0 >
    /sys/kernel/kexec_crash_size" and -EINVAL for "echo [not zero] >
    /sys/kernel/kexec_crash_size".

    In addition to that this patch also simplifies the "ret = -EINVAL" vs.
    "ret = 0" logic as suggested by Simon Horman.

    Signed-off-by: Michael Holzheu
    Reviewed-by: Dave Young
    Reviewed-by: WANG Cong
    Reviewed-by: Simon Horman
    Cc: Vivek Goyal
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Holzheu