13 Jun, 2014

9 commits

  • Pull more security layer updates from Serge Hallyn:
    "A few more commits had previously failed to make it through
    security-next into linux-next but this week made it into linux-next.
    At least commit "ima: introduce ima_kernel_read()" was deemed critical
    by Mimi to make this merge window.

    This is a temporary tree just for this request. Mimi has pointed me
    to some previous threads about keeping maintainer trees at the
    previous release, which I'll certainly do for anything long-term,
    after talking with James"

    * 'serge-next-2' of git://git.kernel.org/pub/scm/linux/kernel/git/sergeh/linux-security:
    ima: introduce ima_kernel_read()
    evm: prohibit userspace writing 'security.evm' HMAC value
    ima: check inode integrity cache in violation check
    ima: prevent unnecessary policy checking
    evm: provide option to protect additional SMACK xattrs
    evm: replace HMAC version with attribute mask
    ima: prevent new digsig xattr from being replaced

    Linus Torvalds
     
  • Commit 8aac62706 "move exit_task_namespaces() outside of exit_notify"
    introduced the kernel opps since the kernel v3.10, which happens when
    Apparmor and IMA-appraisal are enabled at the same time.

    ----------------------------------------------------------------------
    [ 106.750167] BUG: unable to handle kernel NULL pointer dereference at
    0000000000000018
    [ 106.750221] IP: [] our_mnt+0x1a/0x30
    [ 106.750241] PGD 0
    [ 106.750254] Oops: 0000 [#1] SMP
    [ 106.750272] Modules linked in: cuse parport_pc ppdev bnep rfcomm
    bluetooth rpcsec_gss_krb5 nfsd auth_rpcgss nfs_acl nfs lockd sunrpc
    fscache dm_crypt intel_rapl x86_pkg_temp_thermal intel_powerclamp
    kvm_intel snd_hda_codec_hdmi kvm crct10dif_pclmul crc32_pclmul
    ghash_clmulni_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul
    ablk_helper cryptd snd_hda_codec_realtek dcdbas snd_hda_intel
    snd_hda_codec snd_hwdep snd_pcm snd_page_alloc snd_seq_midi
    snd_seq_midi_event snd_rawmidi psmouse snd_seq microcode serio_raw
    snd_timer snd_seq_device snd soundcore video lpc_ich coretemp mac_hid lp
    parport mei_me mei nbd hid_generic e1000e usbhid ahci ptp hid libahci
    pps_core
    [ 106.750658] CPU: 6 PID: 1394 Comm: mysqld Not tainted 3.13.0-rc7-kds+ #15
    [ 106.750673] Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A08
    09/19/2012
    [ 106.750689] task: ffff8800de804920 ti: ffff880400fca000 task.ti:
    ffff880400fca000
    [ 106.750704] RIP: 0010:[] []
    our_mnt+0x1a/0x30
    [ 106.750725] RSP: 0018:ffff880400fcba60 EFLAGS: 00010286
    [ 106.750738] RAX: 0000000000000000 RBX: 0000000000000100 RCX:
    ffff8800d51523e7
    [ 106.750764] RDX: ffffffffffffffea RSI: ffff880400fcba34 RDI:
    ffff880402d20020
    [ 106.750791] RBP: ffff880400fcbae0 R08: 0000000000000000 R09:
    0000000000000001
    [ 106.750817] R10: 0000000000000000 R11: 0000000000000001 R12:
    ffff8800d5152300
    [ 106.750844] R13: ffff8803eb8df510 R14: ffff880400fcbb28 R15:
    ffff8800d51523e7
    [ 106.750871] FS: 0000000000000000(0000) GS:ffff88040d200000(0000)
    knlGS:0000000000000000
    [ 106.750910] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 106.750935] CR2: 0000000000000018 CR3: 0000000001c0e000 CR4:
    00000000001407e0
    [ 106.750962] Stack:
    [ 106.750981] ffffffff813434eb ffff880400fcbb20 ffff880400fcbb18
    0000000000000000
    [ 106.751037] ffff8800de804920 ffffffff8101b9b9 0001800000000000
    0000000000000100
    [ 106.751093] 0000010000000000 0000000000000002 000000000000000e
    ffff8803eb8df500
    [ 106.751149] Call Trace:
    [ 106.751172] [] ? aa_path_name+0x2ab/0x430
    [ 106.751199] [] ? sched_clock+0x9/0x10
    [ 106.751225] [] aa_path_perm+0x7d/0x170
    [ 106.751250] [] ? native_sched_clock+0x15/0x80
    [ 106.751276] [] aa_file_perm+0x33/0x40
    [ 106.751301] [] common_file_perm+0x8e/0xb0
    [ 106.751327] [] apparmor_file_permission+0x18/0x20
    [ 106.751355] [] security_file_permission+0x23/0xa0
    [ 106.751382] [] rw_verify_area+0x52/0xe0
    [ 106.751407] [] vfs_read+0x6d/0x170
    [ 106.751432] [] kernel_read+0x41/0x60
    [ 106.751457] [] ima_calc_file_hash+0x225/0x280
    [ 106.751483] [] ? ima_calc_file_hash+0x32/0x280
    [ 106.751509] [] ima_collect_measurement+0x9d/0x160
    [ 106.751536] [] ? trace_hardirqs_on+0xd/0x10
    [ 106.751562] [] ? ima_file_free+0x6c/0xd0
    [ 106.751587] [] ima_update_xattr+0x34/0x60
    [ 106.751612] [] ima_file_free+0xc0/0xd0
    [ 106.751637] [] __fput+0xd5/0x300
    [ 106.751662] [] ____fput+0xe/0x10
    [ 106.751687] [] task_work_run+0xc4/0xe0
    [ 106.751712] [] do_exit+0x2bd/0xa90
    [ 106.751738] [] ? retint_swapgs+0x13/0x1b
    [ 106.751763] [] do_group_exit+0x4c/0xc0
    [ 106.751788] [] SyS_exit_group+0x14/0x20
    [ 106.751814] [] system_call_fastpath+0x1a/0x1f
    [ 106.751839] Code: c3 0f 1f 44 00 00 55 48 89 e5 e8 22 fe ff ff 5d c3
    0f 1f 44 00 00 55 65 48 8b 04 25 c0 c9 00 00 48 8b 80 28 06 00 00 48 89
    e5 5d 8b 40 18 48 39 87 c0 00 00 00 0f 94 c0 c3 0f 1f 80 00 00 00
    [ 106.752185] RIP [] our_mnt+0x1a/0x30
    [ 106.752214] RSP
    [ 106.752236] CR2: 0000000000000018
    [ 106.752258] ---[ end trace 3c520748b4732721 ]---
    ----------------------------------------------------------------------

    The reason for the oops is that IMA-appraisal uses "kernel_read()" when
    file is closed. kernel_read() honors LSM security hook which calls
    Apparmor handler, which uses current->nsproxy->mnt_ns. The 'guilty'
    commit changed the order of cleanup code so that nsproxy->mnt_ns was
    not already available for Apparmor.

    Discussion about the issue with Al Viro and Eric W. Biederman suggested
    that kernel_read() is too high-level for IMA. Another issue, except
    security checking, that was identified is mandatory locking. kernel_read
    honors it as well and it might prevent IMA from calculating necessary hash.
    It was suggested to use simplified version of the function without security
    and locking checks.

    This patch introduces special version ima_kernel_read(), which skips security
    and mandatory locking checking. It prevents the kernel oops to happen.

    Signed-off-by: Dmitry Kasatkin
    Suggested-by: Eric W. Biederman
    Signed-off-by: Mimi Zohar
    Cc:

    Dmitry Kasatkin
     
  • Calculating the 'security.evm' HMAC value requires access to the
    EVM encrypted key. Only the kernel should have access to it. This
    patch prevents userspace tools(eg. setfattr, cp --preserve=xattr)
    from setting/modifying the 'security.evm' HMAC value directly.

    Signed-off-by: Mimi Zohar
    Cc:

    Mimi Zohar
     
  • When IMA did not support ima-appraisal, existance of the S_IMA flag
    clearly indicated that the file was measured. With IMA appraisal S_IMA
    flag indicates that file was measured and/or appraised. Because of
    this, when measurement is not enabled by the policy, violations are
    still reported.

    To differentiate between measurement and appraisal policies this
    patch checks the inode integrity cache flags. The IMA_MEASURED
    flag indicates whether the file was actually measured, while the
    IMA_MEASURE flag indicates whether the file should be measured.
    Unfortunately, the IMA_MEASURED flag is reset to indicate the file
    needs to be re-measured. Thus, this patch checks the IMA_MEASURE
    flag.

    This patch limits the false positive violation reports, but does
    not fix it entirely. The IMA_MEASURE/IMA_MEASURED flags are
    indications that, at some point in time, the file opened for read
    was in policy, but might not be in policy now (eg. different uid).
    Other changes would be needed to further limit false positive
    violation reports.

    Changelog:
    - expanded patch description based on conversation with Roberto (Mimi)

    Signed-off-by: Dmitry Kasatkin
    Signed-off-by: Mimi Zohar

    Dmitry Kasatkin
     
  • ima_rdwr_violation_check is called for every file openning.
    The function checks the policy even when violation condition
    is not met. It causes unnecessary policy checking.

    This patch does policy checking only if violation condition is met.

    Changelog:
    - check writecount is greater than zero (Mimi)

    Signed-off-by: Dmitry Kasatkin
    Signed-off-by: Mimi Zohar

    Dmitry Kasatkin
     
  • Newer versions of SMACK introduced following security xattrs:
    SMACK64EXEC, SMACK64TRANSMUTE and SMACK64MMAP.

    To protect these xattrs, this patch includes them in the HMAC
    calculation. However, for backwards compatibility with existing
    labeled filesystems, including these xattrs needs to be
    configurable.

    Changelog:
    - Add SMACK dependency on new option (Mimi)

    Signed-off-by: Dmitry Kasatkin
    Signed-off-by: Mimi Zohar

    Dmitry Kasatkin
     
  • Using HMAC version limits the posibility to arbitrarily add new
    attributes such as SMACK64EXEC to the hmac calculation.

    This patch replaces hmac version with attribute mask.
    Desired attributes can be enabled with configuration parameter.
    It allows to build kernels which works with previously labeled
    filesystems.

    Currently supported attribute is 'fsuuid' which is equivalent of
    the former version 2.

    Signed-off-by: Dmitry Kasatkin
    Signed-off-by: Mimi Zohar

    Dmitry Kasatkin
     
  • Even though a new xattr will only be appraised on the next access,
    set the DIGSIG flag to prevent a signature from being replaced with
    a hash on file close.

    Signed-off-by: Mimi Zohar

    Mimi Zohar
     
  • Pull networking updates from David Miller:

    1) Seccomp BPF filters can now be JIT'd, from Alexei Starovoitov.

    2) Multiqueue support in xen-netback and xen-netfront, from Andrew J
    Benniston.

    3) Allow tweaking of aggregation settings in cdc_ncm driver, from Bjørn
    Mork.

    4) BPF now has a "random" opcode, from Chema Gonzalez.

    5) Add more BPF documentation and improve test framework, from Daniel
    Borkmann.

    6) Support TCP fastopen over ipv6, from Daniel Lee.

    7) Add software TSO helper functions and use them to support software
    TSO in mvneta and mv643xx_eth drivers. From Ezequiel Garcia.

    8) Support software TSO in fec driver too, from Nimrod Andy.

    9) Add Broadcom SYSTEMPORT driver, from Florian Fainelli.

    10) Handle broadcasts more gracefully over macvlan when there are large
    numbers of interfaces configured, from Herbert Xu.

    11) Allow more control over fwmark used for non-socket based responses,
    from Lorenzo Colitti.

    12) Do TCP congestion window limiting based upon measurements, from Neal
    Cardwell.

    13) Support busy polling in SCTP, from Neal Horman.

    14) Allow RSS key to be configured via ethtool, from Venkata Duvvuru.

    15) Bridge promisc mode handling improvements from Vlad Yasevich.

    16) Don't use inetpeer entries to implement ID generation any more, it
    performs poorly, from Eric Dumazet.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1522 commits)
    rtnetlink: fix userspace API breakage for iproute2 < v3.9.0
    tcp: fixing TLP's FIN recovery
    net: fec: Add software TSO support
    net: fec: Add Scatter/gather support
    net: fec: Increase buffer descriptor entry number
    net: fec: Factorize feature setting
    net: fec: Enable IP header hardware checksum
    net: fec: Factorize the .xmit transmit function
    bridge: fix compile error when compiling without IPv6 support
    bridge: fix smatch warning / potential null pointer dereference
    via-rhine: fix full-duplex with autoneg disable
    bnx2x: Enlarge the dorq threshold for VFs
    bnx2x: Check for UNDI in uncommon branch
    bnx2x: Fix 1G-baseT link
    bnx2x: Fix link for KR with swapped polarity lane
    sctp: Fix sk_ack_backlog wrap-around problem
    net/core: Add VF link state control policy
    net/fsl: xgmac_mdio is dependent on OF_MDIO
    net/fsl: Make xgmac_mdio read error message useful
    net_sched: drr: warn when qdisc is not work conserving
    ...

    Linus Torvalds
     

11 Jun, 2014

1 commit

  • Pull security layer updates from Serge Hallyn:
    "This is a merge of James Morris' security-next tree from 3.14 to
    yesterday's master, plus four patches from Paul Moore which are in
    linux-next, plus one patch from Mimi"

    * 'serge-next-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sergeh/linux-security:
    ima: audit log files opened with O_DIRECT flag
    selinux: conditionally reschedule in hashtab_insert while loading selinux policy
    selinux: conditionally reschedule in mls_convert_context while loading selinux policy
    selinux: reject setexeccon() on MNT_NOSUID applications with -EACCES
    selinux: Report permissive mode in avc: denied messages.
    Warning in scanf string typing
    Smack: Label cgroup files for systemd
    Smack: Verify read access on file open - v3
    security: Convert use of typedef ctl_table to struct ctl_table
    Smack: bidirectional UDS connect check
    Smack: Correctly remove SMACK64TRANSMUTE attribute
    SMACK: Fix handling value==NULL in post setxattr
    bugfix patch for SMACK
    Smack: adds smackfs/ptrace interface
    Smack: unify all ptrace accesses in the smack
    Smack: fix the subject/object order in smack_ptrace_traceme()
    Minor improvement of 'smack_sb_kern_mount'
    smack: fix key permission verification
    KEYS: Move the flags representing required permission to linux/key.h

    Linus Torvalds
     

10 Jun, 2014

1 commit

  • Pull cgroup updates from Tejun Heo:
    "A lot of activities on cgroup side. Heavy restructuring including
    locking simplification took place to improve the code base and enable
    implementation of the unified hierarchy, which currently exists behind
    a __DEVEL__ mount option. The core support is mostly complete but
    individual controllers need further work. To explain the design and
    rationales of the the unified hierarchy

    Documentation/cgroups/unified-hierarchy.txt

    is added.

    Another notable change is css (cgroup_subsys_state - what each
    controller uses to identify and interact with a cgroup) iteration
    update. This is part of continuing updates on css object lifetime and
    visibility. cgroup started with reference count draining on removal
    way back and is now reaching a point where csses behave and are
    iterated like normal refcnted objects albeit with some complexities to
    allow distinguishing the state where they're being deleted. The css
    iteration update isn't taken advantage of yet but is planned to be
    used to simplify memcg significantly"

    * 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits)
    cgroup: disallow disabled controllers on the default hierarchy
    cgroup: don't destroy the default root
    cgroup: disallow debug controller on the default hierarchy
    cgroup: clean up MAINTAINERS entries
    cgroup: implement css_tryget()
    device_cgroup: use css_has_online_children() instead of has_children()
    cgroup: convert cgroup_has_live_children() into css_has_online_children()
    cgroup: use CSS_ONLINE instead of CGRP_DEAD
    cgroup: iterate cgroup_subsys_states directly
    cgroup: introduce CSS_RELEASED and reduce css iteration fallback window
    cgroup: move cgroup->serial_nr into cgroup_subsys_state
    cgroup: link all cgroup_subsys_states in their sibling lists
    cgroup: move cgroup->sibling and ->children into cgroup_subsys_state
    cgroup: remove cgroup->parent
    device_cgroup: remove direct access to cgroup->children
    memcg: update memcg_has_children() to use css_next_child()
    memcg: remove tasks/children test from mem_cgroup_force_empty()
    cgroup: remove css_parent()
    cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css
    cgroup: use cgroup->self.refcnt for cgroup refcnting
    ...

    Linus Torvalds
     

04 Jun, 2014

5 commits

  • Files are measured or appraised based on the IMA policy. When a
    file, in policy, is opened with the O_DIRECT flag, a deadlock
    occurs.

    The first attempt at resolving this lockdep temporarily removed the
    O_DIRECT flag and restored it, after calculating the hash. The
    second attempt introduced the O_DIRECT_HAVELOCK flag. Based on this
    flag, do_blockdev_direct_IO() would skip taking the i_mutex a second
    time. The third attempt, by Dmitry Kasatkin, resolves the i_mutex
    locking issue, by re-introducing the IMA mutex, but uncovered
    another problem. Reading a file with O_DIRECT flag set, writes
    directly to userspace pages. A second patch allocates a user-space
    like memory. This works for all IMA hooks, except ima_file_free(),
    which is called on __fput() to recalculate the file hash.

    Until this last issue is addressed, do not 'collect' the
    measurement for measuring, appraising, or auditing files opened
    with the O_DIRECT flag set. Based on policy, permit or deny file
    access. This patch defines a new IMA policy rule option named
    'permit_directio'. Policy rules could be defined, based on LSM
    or other criteria, to permit specific applications to open files
    with the O_DIRECT flag set.

    Changelog v1:
    - permit or deny file access based IMA policy rules

    Signed-off-by: Mimi Zohar
    Acked-by: Dmitry Kasatkin
    Cc:

    Mimi Zohar
     
  • After silencing the sleeping warning in mls_convert_context() I started
    seeing similar traces from hashtab_insert. Do a cond_resched there too.

    Signed-off-by: Dave Jones
    Acked-by: Stephen Smalley
    Signed-off-by: Paul Moore

    Dave Jones
     
  • On a slow machine (with debugging enabled), upgrading selinux policy may take
    a considerable amount of time. Long enough that the softlockup detector
    gets triggered.

    The backtrace looks like this..

    > BUG: soft lockup - CPU#2 stuck for 23s! [load_policy:19045]
    > Call Trace:
    > [] symcmp+0xf/0x20
    > [] hashtab_search+0x47/0x80
    > [] mls_convert_context+0xdc/0x1c0
    > [] convert_context+0x378/0x460
    > [] ? security_context_to_sid_core+0x240/0x240
    > [] sidtab_map+0x45/0x80
    > [] security_load_policy+0x3ff/0x580
    > [] ? sched_clock_cpu+0xa8/0x100
    > [] ? sched_clock_local+0x1d/0x80
    > [] ? sched_clock_cpu+0xa8/0x100
    > [] ? __change_page_attr_set_clr+0x82a/0xa50
    > [] ? sched_clock_local+0x1d/0x80
    > [] ? sched_clock_cpu+0xa8/0x100
    > [] ? __change_page_attr_set_clr+0x82a/0xa50
    > [] ? sched_clock_cpu+0xa8/0x100
    > [] ? retint_restore_args+0xe/0xe
    > [] ? trace_hardirqs_on_caller+0xfd/0x1c0
    > [] ? trace_hardirqs_on_thunk+0x3a/0x3f
    > [] ? rcu_irq_exit+0x68/0xb0
    > [] ? retint_restore_args+0xe/0xe
    > [] sel_write_load+0xa7/0x770
    > [] ? vfs_write+0x1c3/0x200
    > [] ? security_file_permission+0x1e/0xa0
    > [] vfs_write+0xbb/0x200
    > [] ? fget_light+0x397/0x4b0
    > [] SyS_write+0x47/0xa0
    > [] tracesys+0xdd/0xe2

    Stephen Smalley suggested:

    > Maybe put a cond_resched() within the ebitmap_for_each_positive_bit()
    > loop in mls_convert_context()?

    That seems to do the trick. Tested by downgrading and re-upgrading selinux-policy-targeted.

    Signed-off-by: Dave Jones
    Acked-by: Stephen Smalley
    Signed-off-by: Paul Moore

    Dave Jones
     
  • We presently prevent processes from using setexecon() to set the
    security label of exec()'d processes when NO_NEW_PRIVS is enabled by
    returning an error; however, we silently ignore setexeccon() when
    exec()'ing from a nosuid mounted filesystem. This patch makes things
    a bit more consistent by returning an error in the setexeccon()/nosuid
    case.

    Signed-off-by: Paul Moore
    Acked-by: Andy Lutomirski
    Acked-by: Stephen Smalley

    Paul Moore
     
  • We cannot presently tell from an avc: denied message whether access was in
    fact denied or was allowed due to global or per-domain permissive mode.
    Add a permissive= field to the avc message to reflect this information.

    Signed-off-by: Stephen Smalley
    Acked-by: Eric Paris
    Signed-off-by: Paul Moore

    Stephen Smalley
     

24 May, 2014

1 commit

  • Conflicts:
    drivers/net/bonding/bond_alb.c
    drivers/net/ethernet/altera/altera_msgdma.c
    drivers/net/ethernet/altera/altera_sgdma.c
    net/ipv6/xfrm6_output.c

    Several cases of overlapping changes.

    The xfrm6_output.c has a bug fix which overlaps the renaming
    of skb->local_df to skb->ignore_df.

    In the Altera TSE driver cases, the register access cleanups
    in net-next overlapped with bug fixes done in net.

    Similarly a bug fix to send ALB packets in the bonding driver using
    the right source address overlaps with cleanups in net-next.

    Signed-off-by: David S. Miller

    David S. Miller
     

20 May, 2014

1 commit


17 May, 2014

3 commits

  • devcgroup_update_access() wants to know whether there are child
    cgroups which are online and visible to userland and has_children()
    may return false positive. Replace it with css_has_online_children().

    Signed-off-by: Tejun Heo
    Acked-by: Aristeu Rozanski
    Acked-by: Serge Hallyn
    Acked-by: Li Zefan

    Tejun Heo
     
  • Currently, devcg::has_children() directly tests cgroup->children for
    list emptiness. The field is not a published field and scheduled to
    go away. In addition, the test isn't strictly correct as devcg should
    only care about children which are visible to userland.

    This patch converts has_children() to use css_next_child() instead.
    The subtle incorrectness is noted and will be dealt with later.

    Signed-off-by: Tejun Heo
    Acked-by: Aristeu Rozanski
    Acked-by: Serge Hallyn
    Acked-by: Li Zefan

    Tejun Heo
     
  • cgroup in general is moving towards using cgroup_subsys_state as the
    fundamental structural component and css_parent() was introduced to
    convert from using cgroup->parent to css->parent. It was quite some
    time ago and we're moving forward with making css more prominent.

    This patch drops the trivial wrapper css_parent() and let the users
    dereference css->parent. While at it, explicitly mark fields of css
    which are public and immutable.

    v2: New usage from device_cgroup.c converted.

    Signed-off-by: Tejun Heo
    Acked-by: Michal Hocko
    Acked-by: Neil Horman
    Acked-by: "David S. Miller"
    Acked-by: Li Zefan
    Cc: Vivek Goyal
    Cc: Jens Axboe
    Cc: Peter Zijlstra
    Cc: Johannes Weiner

    Tejun Heo
     

14 May, 2014

1 commit

  • Convert all cftype->write_string() users to the new cftype->write()
    which maps directly to kernfs write operation and has full access to
    kernfs and cgroup contexts. The conversions are mostly mechanical.

    * @css and @cft are accessed using of_css() and of_cft() accessors
    respectively instead of being specified as arguments.

    * Should return @nbytes on success instead of 0.

    * @buf is not trimmed automatically. Trim if necessary. Note that
    blkcg and netprio don't need this as the parsers already handle
    whitespaces.

    cftype->write_string() has no user left after the conversions and
    removed.

    While at it, remove unnecessary local variable @p in
    cgroup_subtree_control_write() and stale comment about
    CGROUP_LOCAL_BUFFER_SIZE in cgroup_freezer.c.

    This patch doesn't introduce any visible behavior changes.

    v2: netprio was missing from conversion. Converted.

    Signed-off-by: Tejun Heo
    Acked-by: Aristeu Rozanski
    Acked-by: Vivek Goyal
    Acked-by: Li Zefan
    Cc: Jens Axboe
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Neil Horman
    Cc: "David S. Miller"

    Tejun Heo
     

13 May, 2014

2 commits

  • Pull cgroup fixes from Tejun Heo:
    "During recent restructuring, device_cgroup unified config input check
    and enforcement logic; unfortunately, it turned out to share too much.
    Aristeu's patches fix the breakage and marked for -stable backport.

    The other two patches are fallouts from kernfs conversion. The blkcg
    change is temporary and will go away once kernfs internal locking gets
    simplified (patches pending)"

    * 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    blkcg: use trylock on blkcg_pol_mutex in blkcg_reset_stats()
    device_cgroup: check if exception removal is allowed
    device_cgroup: fix the comment format for recently added functions
    device_cgroup: rework device access check and exception checking
    cgroup: fix the retry path of cgroup_mount()

    Linus Torvalds
     
  • Conflicts:
    drivers/net/ethernet/altera/altera_sgdma.c
    net/netlink/af_netlink.c
    net/sched/cls_api.c
    net/sched/sch_api.c

    The netlink conflict dealt with moving to netlink_capable() and
    netlink_ns_capable() in the 'net' tree vs. supporting 'tc' operations
    in non-init namespaces. These were simple transformations from
    netlink_capable to netlink_ns_capable.

    The Altera driver conflict was simply code removal overlapping some
    void pointer cast cleanups in net-next.

    Signed-off-by: David S. Miller

    David S. Miller
     

07 May, 2014

3 commits

  • Pull vfs fixes from Al Viro:
    "dcache fixes + kvfree() (uninlined, exported by mm/util.c) + posix_acl
    bugfix from hch"

    The dcache fixes are for a subtle LRU list corruption bug reported by
    Miklos Szeredi, where people inside IBM saw list corruptions with the
    LTP/host01 test.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    nick kvfree() from apparmor
    posix_acl: handle NULL ACL in posix_acl_equiv_mode
    dcache: don't need rcu in shrink_dentry_list()
    more graceful recovery in umount_collect()
    don't remove from shrink list in select_collect()
    dentry_kill(): don't try to remove from shrink list
    expand the call of dentry_lru_del() in dentry_kill()
    new helper: dentry_free()
    fold try_prune_one_dentry()
    fold d_kill() and d_free()
    fix races between __d_instantiate() and checks of dentry flags

    Linus Torvalds
     
  • This fixes a warning about the mismatch of types between
    the declared unsigned and integer.

    Signed-off-by: Toralf Förster

    Toralf Förster
     
  • too many places open-code it

    Signed-off-by: Al Viro

    Al Viro
     

05 May, 2014

2 commits

  • [PATCH v3 1/2] device_cgroup: check if exception removal is allowed

    When the device cgroup hierarchy was introduced in
    bd2953ebbb53 - devcg: propagate local changes down the hierarchy

    a specific case was overlooked. Consider the hierarchy bellow:

    A default policy: ALLOW, exceptions will deny access
    \
    B default policy: ALLOW, exceptions will deny access

    There's no need to verify when an new exception is added to B because
    in this case exceptions will deny access to further devices, which is
    always fine. Hierarchy in device cgroup only makes sure B won't have
    more access than A.

    But when an exception is removed (by writing devices.allow), it isn't
    checked if the user is in fact removing an inherited exception from A,
    thus giving more access to B.

    Example:

    # echo 'a' >A/devices.allow
    # echo 'c 1:3 rw' >A/devices.deny
    # echo $$ >A/B/tasks
    # echo >/dev/null
    -bash: /dev/null: Operation not permitted
    # echo 'c 1:3 w' >A/B/devices.allow
    # echo >/dev/null
    #

    This shouldn't be allowed and this patch fixes it by making sure to never allow
    exceptions in this case to be removed if the exception is partially or fully
    present on the parent.

    v3: missing '*' in function description
    v2: improved log message and formatting fixes

    Cc: cgroups@vger.kernel.org
    Cc: Li Zefan
    Cc: stable@vger.kernel.org
    Signed-off-by: Aristeu Rozanski
    Acked-by: Serge Hallyn
    Signed-off-by: Tejun Heo

    Aristeu Rozanski
     
  • Moving more extensive explanations to the end of the comment.

    Cc: Li Zefan
    Signed-off-by: Aristeu Rozanski
    Acked-by: Serge Hallyn
    Signed-off-by: Tejun Heo

    Aristeu Rozanski
     

01 May, 2014

1 commit

  • The cgroup filesystem isn't ready for an LSM to
    properly use extented attributes. This patch makes
    files created in the cgroup filesystem usable by
    a system running Smack and systemd.

    Targeted for git://git.gitorious.org/smack-next/kernel.git

    Signed-off-by: Casey Schaufler

    Casey Schaufler
     

23 Apr, 2014

2 commits


22 Apr, 2014

2 commits

  • File-private locks have been merged into Linux for v3.15, and *now*
    people are commenting that the name and macro definitions for the new
    file-private locks suck.

    ...and I can't even disagree. The names and command macros do suck.

    We're going to have to live with these for a long time, so it's
    important that we be happy with the names before we're stuck with them.
    The consensus on the lists so far is that they should be rechristened as
    "open file description locks".

    The name isn't a big deal for the kernel, but the command macros are not
    visually distinct enough from the traditional POSIX lock macros. The
    glibc and documentation folks are recommending that we change them to
    look like F_OFD_{GETLK|SETLK|SETLKW}. That lessens the chance that a
    programmer will typo one of the commands wrong, and also makes it easier
    to spot this difference when reading code.

    This patch makes the following changes that I think are necessary before
    v3.15 ships:

    1) rename the command macros to their new names. These end up in the uapi
    headers and so are part of the external-facing API. It turns out that
    glibc doesn't actually use the fcntl.h uapi header, but it's hard to
    be sure that something else won't. Changing it now is safest.

    2) make the the /proc/locks output display these as type "OFDLCK"

    Cc: Michael Kerrisk
    Cc: Christoph Hellwig
    Cc: Carlos O'Donell
    Cc: Stefan Metzmacher
    Cc: Andy Lutomirski
    Cc: Frank Filz
    Cc: Theodore Ts'o
    Signed-off-by: Jeff Layton

    Jeff Layton
     
  • Whenever a device file is opened and checked against current device
    cgroup rules, it uses the same function (may_access()) as when a new
    exception rule is added by writing devices.{allow,deny}. And in both
    cases, the algorithm is the same, doesn't matter the behavior.

    First problem is having device access to be considered the same as rule
    checking. Consider the following structure:

    A (default behavior: allow, exceptions disallow access)
    \
    B (default behavior: allow, exceptions disallow access)

    A new exception is added to B by writing devices.deny:

    c 12:34 rw

    When checking if that exception is allowed in may_access():

    if (dev_cgroup->behavior == DEVCG_DEFAULT_ALLOW) {
    if (behavior == DEVCG_DEFAULT_ALLOW) {
    /* the exception will deny access to certain devices */
    return true;

    Which is ok, since B is not getting more privileges than A, it doesn't
    matter and the rule is accepted

    Now, consider it's a device file open check and the process belongs to
    cgroup B. The access will be generated as:

    behavior: allow
    exception: c 12:34 rw

    The very same chunk of code will allow it, even if there's an explicit
    exception telling to do otherwise.

    A simple test case:

    # mkdir new_group
    # cd new_group
    # echo $$ >tasks
    # echo "c 1:3 w" >devices.deny
    # echo >/dev/null
    # echo $?
    0

    This is a serious bug and was introduced on

    c39a2a3018f8 devcg: prepare may_access() for hierarchy support

    To solve this problem, the device file open function was split from the
    new exception check.

    Second problem is how exceptions are processed by may_access(). The
    first part of the said function tries to match fully with an existing
    exception:

    list_for_each_entry_rcu(ex, &dev_cgroup->exceptions, list) {
    if ((refex->type & DEV_BLOCK) && !(ex->type & DEV_BLOCK))
    continue;
    if ((refex->type & DEV_CHAR) && !(ex->type & DEV_CHAR))
    continue;
    if (ex->major != ~0 && ex->major != refex->major)
    continue;
    if (ex->minor != ~0 && ex->minor != refex->minor)
    continue;
    if (refex->access & (~ex->access))
    continue;
    match = true;
    break;
    }

    That means the new exception should be contained into an existing one to
    be considered a match:

    New exception Existing match? notes
    b 12:34 rwm b 12:34 rwm yes
    b 12:34 r b *:34 rw yes
    b 12:34 rw b 12:34 w no extra "r"
    b *:34 rw b 12:34 rw no too broad "*"
    b *:34 rw b *:34 rwm yes

    Which is fine in some cases. Consider:

    A (default behavior: deny, exceptions allow access)
    \
    B (default behavior: deny, exceptions allow access)

    In this case the full match makes sense, the new exception cannot add
    more access than the parent allows

    But this doesn't always work, consider:

    A (default behavior: allow, exceptions disallow access)
    \
    B (default behavior: deny, exceptions allow access)

    In this case, a new exception in B shouldn't match any of the exceptions
    in A, after all you can't allow something that was forbidden by A. But
    consider this scenario:

    New exception Existing in A match? outcome
    b 12:34 rw b 12:34 r no exception is accepted

    Because the new exception has "w" as extra, it doesn't match, so it'll
    be added to B's exception list.

    The same problem can happen during a file access check. Consider a
    cgroup with allow as default behavior:

    Access Exception match?
    b 12:34 rw b 12:34 r no

    In this case, the access didn't match any of the exceptions in the
    cgroup, which is required since exceptions will disallow access.

    To solve this problem, two new functions were created to match an
    exception either fully or partially. In the example above, a partial
    check will be performed and it'll produce a match since at least
    "b 12:34 r" from "b 12:34 rw" access matches.

    Cc: cgroups@vger.kernel.org
    Cc: Tejun Heo
    Cc: Serge Hallyn
    Cc: Li Zefan
    Cc: stable@vger.kernel.org
    Signed-off-by: Aristeu Rozanski
    Signed-off-by: Tejun Heo

    Aristeu Rozanski
     

15 Apr, 2014

1 commit


14 Apr, 2014

2 commits


13 Apr, 2014

2 commits

  • Pull vfs updates from Al Viro:
    "The first vfs pile, with deep apologies for being very late in this
    window.

    Assorted cleanups and fixes, plus a large preparatory part of iov_iter
    work. There's a lot more of that, but it'll probably go into the next
    merge window - it *does* shape up nicely, removes a lot of
    boilerplate, gets rid of locking inconsistencie between aio_write and
    splice_write and I hope to get Kent's direct-io rewrite merged into
    the same queue, but some of the stuff after this point is having
    (mostly trivial) conflicts with the things already merged into
    mainline and with some I want more testing.

    This one passes LTP and xfstests without regressions, in addition to
    usual beating. BTW, readahead02 in ltp syscalls testsuite has started
    giving failures since "mm/readahead.c: fix readahead failure for
    memoryless NUMA nodes and limit readahead pages" - might be a false
    positive, might be a real regression..."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    missing bits of "splice: fix racy pipe->buffers uses"
    cifs: fix the race in cifs_writev()
    ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
    kill generic_file_buffered_write()
    ocfs2_file_aio_write(): switch to generic_perform_write()
    ceph_aio_write(): switch to generic_perform_write()
    xfs_file_buffered_aio_write(): switch to generic_perform_write()
    export generic_perform_write(), start getting rid of generic_file_buffer_write()
    generic_file_direct_write(): get rid of ppos argument
    btrfs_file_aio_write(): get rid of ppos
    kill the 5th argument of generic_file_buffered_write()
    kill the 4th argument of __generic_file_aio_write()
    lustre: don't open-code kernel_recvmsg()
    ocfs2: don't open-code kernel_recvmsg()
    drbd: don't open-code kernel_recvmsg()
    constify blk_rq_map_user_iov() and friends
    lustre: switch to kernel_sendmsg()
    ocfs2: don't open-code kernel_sendmsg()
    take iov_iter stuff to mm/iov_iter.c
    process_vm_access: tidy up a bit
    ...

    Linus Torvalds
     
  • Pull audit updates from Eric Paris.

    * git://git.infradead.org/users/eparis/audit: (28 commits)
    AUDIT: make audit_is_compat depend on CONFIG_AUDIT_COMPAT_GENERIC
    audit: renumber AUDIT_FEATURE_CHANGE into the 1300 range
    audit: do not cast audit_rule_data pointers pointlesly
    AUDIT: Allow login in non-init namespaces
    audit: define audit_is_compat in kernel internal header
    kernel: Use RCU_INIT_POINTER(x, NULL) in audit.c
    sched: declare pid_alive as inline
    audit: use uapi/linux/audit.h for AUDIT_ARCH declarations
    syscall_get_arch: remove useless function arguments
    audit: remove stray newline from audit_log_execve_info() audit_panic() call
    audit: remove stray newlines from audit_log_lost messages
    audit: include subject in login records
    audit: remove superfluous new- prefix in AUDIT_LOGIN messages
    audit: allow user processes to log from another PID namespace
    audit: anchor all pid references in the initial pid namespace
    audit: convert PPIDs to the inital PID namespace.
    pid: get pid_t ppid of task in init_pid_ns
    audit: rename the misleading audit_get_context() to audit_take_context()
    audit: Add generic compat syscall support
    audit: Add CONFIG_HAVE_ARCH_AUDITSYSCALL
    ...

    Linus Torvalds
     

12 Apr, 2014

1 commit

  • Smack IPC policy requires that the sender have write access
    to the receiver. UDS streams don't do per-packet checks. The
    only check is done at connect time. The existing code checks
    if the connecting process can write to the other, but not the
    other way around. This change adds a check that the other end
    can write to the connecting process.

    Targeted for git://git.gitorious.org/smack-next/kernel.git

    Signed-off-by: Casey Schuafler

    Casey Schaufler