01 Oct, 2013

1 commit

  • Recently commit bab55417b10c ("block: support embedded device command
    line partition") introduced CONFIG_CMDLINE_PARSER. However, that name
    is too generic and sounds like it enables/disables generic kernel boot
    arg processing, when it really is block specific.

    Before this option becomes a part of a full/final release, add the BLK_
    prefix to it so that it is clear in absence of any other context that it
    is block specific.

    In addition, fix up the following less critical items:
    - help text was not really at all helpful.
    - index file for Documentation was not updated
    - add the new arg to Documentation/kernel-parameters.txt
    - clarify wording in source comments

    Signed-off-by: Paul Gortmaker
    Cc: Jens Axboe
    Cc: Cai Zhiyong
    Cc: Wei Yongjun
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Gortmaker
     

23 Sep, 2013

2 commits

  • Pull block IO fixes from Jens Axboe:
    "After merge window, no new stuff this time only a collection of neatly
    confined and simple fixes"

    * 'for-3.12/core' of git://git.kernel.dk/linux-block:
    cfq: explicitly use 64bit divide operation for 64bit arguments
    block: Add nr_bios to block_rq_remap tracepoint
    If the queue is dying then we only call the rq->end_io callout. This leaves bios setup on the request, because the caller assumes when the blk_execute_rq_nowait/blk_execute_rq call has completed that the rq->bios have been cleaned up.
    bio-integrity: Fix use of bs->bio_integrity_pool after free
    blkcg: relocate root_blkg setting and clearing
    block: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...)
    block: trace all devices plug operation

    Linus Torvalds
     
  • 'samples' is 64bit operant, but do_div() second parameter is 32.
    do_div silently truncates high 32 bits and calculated result
    is invalid.

    In case if low 32bit of 'samples' are zeros then do_div() produces
    kernel crash.

    Signed-off-by: Anatol Pomozov
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Anatol Pomozov
     

18 Sep, 2013

1 commit


15 Sep, 2013

1 commit

  • Matt found that commit 27a7c642174e ("partitions/efi: account for pmbr
    size in lba") caused his GPT formatted eMMC device not to boot. The
    reason is that this commit enforced Linux to always check the lesser of
    the whole disk or 2Tib for the pMBR size in LBA. While most disk
    partitioning tools out there create a pMBR with these characteristics,
    Microsoft does not, as it always sets the entry to the maximum 32-bit
    limitation - even though a drive may be smaller than that[1].

    Loosen this check and only verify that the size is either the whole disk
    or 0xFFFFFFFF. No tool in its right mind would set it to any value
    other than these.

    [1] http://thestarman.pcministry.com/asm/mbr/GPT.htm#GPTPT

    Reported-and-tested-by: Matt Porter
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

12 Sep, 2013

16 commits

  • With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is
    one such possible user), the following race can happen:

    radix_tree_preload()
    ...
    radix_tree_insert()
    radix_tree_node_alloc()
    if (rtp->nr) {
    ret = rtp->nodes[rtp->nr - 1];

    ...
    radix_tree_preload()
    ...
    radix_tree_insert()
    radix_tree_node_alloc()
    if (rtp->nr) {
    ret = rtp->nodes[rtp->nr - 1];

    And we give out one radix tree node twice. That clearly results in radix
    tree corruption with different results (usually OOPS) depending on which
    two users of radix tree race.

    We fix the problem by making radix_tree_node_alloc() always allocate fresh
    radix tree nodes when in interrupt. Using preloading when in interrupt
    doesn't make sense since all the allocations have to be atomic anyway and
    we cannot steal nodes from process-context users because some users rely
    on radix_tree_insert() succeeding after radix_tree_preload().
    in_interrupt() check is somewhat ugly but we cannot simply key off passed
    gfp_mask as that is acquired from root_gfp_mask() and thus the same for
    all preload users.

    Another part of the fix is to avoid node preallocation in
    radix_tree_preload() when passed gfp_mask doesn't allow waiting. Again,
    preallocation in such case doesn't make sense and when preallocation would
    happen in interrupt we could possibly leak some allocated nodes. However,
    some users of radix_tree_preload() require following radix_tree_insert()
    to succeed. To avoid unexpected effects for these users,
    radix_tree_preload() only warns if passed gfp mask doesn't allow waiting
    and we provide a new function radix_tree_maybe_preload() for those users
    which get different gfp mask from different call sites and which are
    prepared to handle radix_tree_insert() failure.

    Signed-off-by: Jan Kara
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Cc: Davidlohr Bueso
    Cc: Karel Zak
    Cc: Matt Fleming
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Trivial coding style cleanups - still plenty left.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Karel Zak
    Acked-by: Matt Fleming
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • I love emacs, but these settings for coding style are annoying when trying
    to open the efi.h file. More important, we already have checkpatch for
    that.

    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Karel Zak
    Acked-by: Matt Fleming
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • When verifying GPT header integrity, make sure that first usable LBA is
    smaller than last usable LBA.

    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Karel Zak
    Acked-by: Matt Fleming
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • The partition that has the 0xEE (GPT protective), must have the size in
    lba field set to the lesser of the size of the disk minus one or
    0xFFFFFFFF for larger disks.

    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Karel Zak
    Acked-by: Matt Fleming
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • One of the biggest problems with GPT is compatibility with older, non-GPT
    systems. The problem is addressed by creating hybrid mbrs, an extension,
    or variant, of the traditional protective mbr. This contains, apart from
    the 0xEE partition, up three additional primary partitions that point to
    the same space marked by up to three GPT partitions. The result is that
    legacy OSs can see the three required MBR partitions and at the same time
    ignore the GPT-aware partitions that protect the GPT structures.

    While hybrid MBRs are hacks, workarounds and simply not part of the GPT
    standard, they do exist and we have no way around them. For instance, by
    default, OSX creates a hybrid scheme when using multi-OS booting.

    In order for Linux to properly discover protective MBRs, it must be made
    aware of devices that have hybrid MBRs. No functionality is changed by
    this patch, just a debug message informing the user of the MBR scheme that
    is being used.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Karel Zak
    Acked-by: Matt Fleming
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • When detecting a valid protective MBR, the Linux kernel isn't picky about
    the partition (1-4) the 0xEE is at, but, unlike other operating systems,
    it does require it to begin at the second sector (sector 1). This check,
    apart from it not being enforced by UEFI, and causing Linux to potentially
    fail to detect any *valid* partitions on the disk, can present problems
    when dealing with hybrid MBRs[1].

    For compatibility reasons, if the first partition is hybridized, the 0xEE
    partition must be small enough to ensure that it only protects the GPT
    data structures - as opposed to the the whole disk in a protective MBR.
    This problem is very well described by Rod Smith[1]: where MBR-only
    partitioning programs (such as older versions of fdisk) can see some of
    the disk space as unallocated, thus loosing the purpose of the 0xEE
    partition's protection of GPT data structures.

    By dropping this check, this patch enables Linux to be more flexible when
    probing for GPT disklabels.

    [1] http://www.rodsbooks.com/gdisk/hybrid.html#reactions

    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Karel Zak
    Acked-by: Matt Fleming
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Per the UEFI Specs 2.4, June 2013, the starting lba of the partition that
    has the EFI GPT (0xEE) must be set to 0x00000001 - this is obviously the
    LBA of the GPT Partition Header.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Karel Zak
    Acked-by: Matt Fleming
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • The kernel's GPT implementation currently uses the generic 'struct
    partition' type for dealing with legacy MBR partition records. While this
    is is useful for disklabels that we designed for CHS addressing, such as
    msdos, it doesn't adapt well to newer standards that use LBA instead, such
    as GUID partition tables. Furthermore, these generic partition structures
    do not have all the required fields to properly follow the UEFI specs.

    While a CHS address can be translated to LBA, it's much simpler and
    cleaner to just replace the partition type. This patch adds a new
    'gpt_record' type that is fully compliant with EFI and will allow, in the
    next patches, to add more checks to properly verify a protective MBR,
    which is paramount to probing a device that makes use of GPT.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Karel Zak
    Acked-by: Matt Fleming
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • I found the following pattern that leads in to interesting findings:

    grep -r "ret.*|=.*__put_user" *
    grep -r "ret.*|=.*__get_user" *
    grep -r "ret.*|=.*__copy" *

    The __put_user() calls in compat_ioctl.c, ptrace compat, signal compat,
    since those appear in compat code, we could probably expect the kernel
    addresses not to be reachable in the lower 32-bit range, so I think they
    might not be exploitable.

    For the "__get_user" cases, I don't think those are exploitable: the worse
    that can happen is that the kernel will copy kernel memory into in-kernel
    buffers, and will fail immediately afterward.

    The alpha csum_partial_copy_from_user() seems to be missing the
    access_ok() check entirely. The fix is inspired from x86. This could
    lead to information leak on alpha. I also noticed that many architectures
    map csum_partial_copy_from_user() to csum_partial_copy_generic(), but I
    wonder if the latter is performing the access checks on every
    architectures.

    Signed-off-by: Mathieu Desnoyers
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Jens Axboe
    Cc: Oleg Nesterov
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathieu Desnoyers
     
  • Read block device partition table from command line. The partition used
    for fixed block device (eMMC) embedded device. It is no MBR, save
    storage space. Bootloader can be easily accessed by absolute address of
    data on the block device. Users can easily change the partition.

    This code reference MTD partition, source "drivers/mtd/cmdlinepart.c"
    About the partition verbose reference
    "Documentation/block/cmdline-partition.txt"

    [akpm@linux-foundation.org: fix printk text]
    [yongjun_wei@trendmicro.com.cn: fix error return code in parse_parts()]
    Signed-off-by: Cai Zhiyong
    Cc: Karel Zak
    Cc: "Wanglin (Albert)"
    Cc: Marius Groeger
    Cc: David Woodhouse
    Cc: Jens Axboe
    Cc: Brian Norris
    Cc: Artem Bityutskiy
    Signed-off-by: Wei Yongjun
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cai Zhiyong
     
  • The usage of strict_strtoul() is not preferred, because strict_strtoul()
    is obsolete. Thus, kstrtoul() should be used.

    Signed-off-by: Jingoo Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jingoo Han
     
  • Hello, Jens.

    The original thread can be read from

    http://thread.gmane.org/gmane.linux.kernel.cgroups/8937

    While it leads to oops, given that it only triggers under specific
    configurations which aren't common. I don't think it's necessary to
    backport it through -stable and merging it during the coming merge
    window should be enough.

    Thanks!

    ----- 8< -----
    Currently, q->root_blkg and q->root_rl.blkg are set from
    blkcg_activate_policy() and cleared from blkg_destroy_all(). This
    doesn't necessarily coincide with the lifetime of the root blkcg_gq
    leading to the following oops when blkcg is enabled but no policy is
    activated because __blk_queue_next_rl() malfunctions expecting the
    root_blkg pointers to be set.

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [] __wake_up_common+0x2b/0x90
    PGD 60f7a9067 PUD 60f4c9067 PMD 0
    Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
    gsmi: Log Shutdown Reason 0x03
    Modules linked in: act_mirred cls_tcindex cls_prioshift sch_dsmark xt_multiport iptable_mangle sata_mv elephant elephant_dev_num cdc_acm uhci_hcd ehci_hcd i2c_d
    CPU: 9 PID: 41382 Comm: iSCSI-write- Not tainted 3.11.0-dbg-DEV #19
    Hardware name: Intel XXX
    task: ffff88060d16eec0 ti: ffff88060d170000 task.ti: ffff88060d170000
    RIP: 0010:[] [] __wake_up_common+0x2b/0x90
    RSP: 0000:ffff88060d171818 EFLAGS: 00010096
    RAX: 0000000000000082 RBX: ffff880baa3dee60 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffff880baa3dee60
    RBP: ffff88060d171858 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000002 R12: ffff880baa3dee98
    R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000003
    FS: 00007f977cba6700(0000) GS:ffff880c79c60000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 0000000000000000 CR3: 000000060f7a5000 CR4: 00000000000007e0
    Stack:
    0000000000000082 0000000000000000 ffff88060d171858 ffff880baa3dee60
    0000000000000082 0000000000000003 0000000000000000 0000000000000000
    ffff88060d171898 ffffffff810c7848 ffff88060d171888 ffff880bde4bc4b8
    Call Trace:
    [] __wake_up+0x48/0x70
    [] __blk_drain_queue+0x123/0x190
    [] blk_cleanup_queue+0xf5/0x210
    [] __scsi_remove_device+0x5a/0xd0
    [] scsi_remove_device+0x34/0x50
    [] scsi_remove_target+0x16b/0x220
    [] __iscsi_unbind_session+0xd1/0x1b0
    [] iscsi_remove_session+0xe2/0x1c0
    [] iscsi_destroy_session+0x16/0x60
    [] iscsi_session_teardown+0xd9/0x100
    [] iscsi_sw_tcp_session_destroy+0x5a/0xb0
    [] iscsi_if_rx+0x10e8/0x1560
    [] netlink_unicast+0x145/0x200
    [] netlink_sendmsg+0x303/0x410
    [] sock_sendmsg+0xa6/0xd0
    [] ___sys_sendmsg+0x38c/0x3a0
    [] ? fget_light+0x40/0x160
    [] ? fget_light+0x99/0x160
    [] ? fget_light+0x40/0x160
    [] __sys_sendmsg+0x49/0x90
    [] SyS_sendmsg+0x12/0x20
    [] system_call_fastpath+0x16/0x1b
    Code: 66 66 66 66 90 55 48 89 e5 41 57 41 89 f7 41 56 41 89 ce 41 55 41 54 4c 8d 67 38 53 48 83 ec 18 89 55 c4 48 8b 57 38 4c 89 45 c8 8b 2a 48 8d 42 e8 49

    Fix it by moving r->root_blkg and q->root_rl.blkg setting to
    blkg_create() and clearing to blkg_destroy() so that they area
    initialized when a root blkg is created and cleared when destroyed.

    Reported-and-tested-by: Anatol Pomozov
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Use the helper function instead of __GFP_ZERO.

    Signed-off-by: Joe Perches
    Signed-off-by: Jens Axboe

    Joe Perches
     
  • In func blk_queue_bio, if list of plug is empty,it will call
    blk_trace_plug.
    If process deal with a single device,it't ok.But if process deal with
    multi devices,it only trace the first device.
    Using request_count to judge, it can soleve this problem.

    In addition, i modify the comment.

    Signed-off-by: Jianpeng Ma
    Signed-off-by: Jens Axboe

    Jianpeng Ma
     

04 Sep, 2013

1 commit

  • Pull cgroup updates from Tejun Heo:
    "A lot of activities on the cgroup front. Most changes aren't visible
    to userland at all at this point and are laying foundation for the
    planned unified hierarchy.

    - The biggest change is decoupling the lifetime management of css
    (cgroup_subsys_state) from that of cgroup's. Because controllers
    (cpu, memory, block and so on) will need to be dynamically enabled
    and disabled, css which is the association point between a cgroup
    and a controller may come and go dynamically across the lifetime of
    a cgroup. Till now, css's were created when the associated cgroup
    was created and stayed till the cgroup got destroyed.

    Assumptions around this tight coupling permeated through cgroup
    core and controllers. These assumptions are gradually removed,
    which consists bulk of patches, and css destruction path is
    completely decoupled from cgroup destruction path. Note that
    decoupling of creation path is relatively easy on top of these
    changes and the patchset is pending for the next window.

    - cgroup has its own event mechanism cgroup.event_control, which is
    only used by memcg. It is overly complex trying to achieve high
    flexibility whose benefits seem dubious at best. Going forward,
    new events will simply generate file modified event and the
    existing mechanism is being made specific to memcg. This pull
    request contains prepatory patches for such change.

    - Various fixes and cleanups"

    Fixed up conflict in kernel/cgroup.c as per Tejun.

    * 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (69 commits)
    cgroup: fix cgroup_css() invocation in css_from_id()
    cgroup: make cgroup_write_event_control() use css_from_dir() instead of __d_cgrp()
    cgroup: make cgroup_event hold onto cgroup_subsys_state instead of cgroup
    cgroup: implement CFTYPE_NO_PREFIX
    cgroup: make cgroup_css() take cgroup_subsys * instead and allow NULL subsys
    cgroup: rename cgroup_css_from_dir() to css_from_dir() and update its syntax
    cgroup: fix cgroup_write_event_control()
    cgroup: fix subsystem file accesses on the root cgroup
    cgroup: change cgroup_from_id() to css_from_id()
    cgroup: use css_get() in cgroup_create() to check CSS_ROOT
    cpuset: remove an unncessary forward declaration
    cgroup: RCU protect each cgroup_subsys_state release
    cgroup: move subsys file removal to kill_css()
    cgroup: factor out kill_css()
    cgroup: decouple cgroup_subsys_state destruction from cgroup destruction
    cgroup: replace cgroup->css_kill_cnt with ->nr_css
    cgroup: bounce cgroup_subsys_state ref kill confirmation to a work item
    cgroup: move cgroup->subsys[] assignment to online_css()
    cgroup: reorganize css init / exit paths
    cgroup: add __rcu modifier to cgroup->subsys[]
    ...

    Linus Torvalds
     

24 Aug, 2013

2 commits


09 Aug, 2013

9 commits

  • Previously, all css descendant iterators didn't include the origin
    (root of subtree) css in the iteration. The reasons were maintaining
    consistency with css_for_each_child() and that at the time of
    introduction more use cases needed skipping the origin anyway;
    however, given that css_is_descendant() considers self to be a
    descendant, omitting the origin css has become more confusing and
    looking at the accumulated use cases rather clearly indicates that
    including origin would result in simpler code overall.

    While this is a change which can easily lead to subtle bugs, cgroup
    API including the iterators has recently gone through major
    restructuring and no out-of-tree changes will be applicable without
    adjustments making this a relatively acceptable opportunity for this
    type of change.

    The conversions are mostly straight-forward. If the iteration block
    had explicit origin handling before or after, it's moved inside the
    iteration. If not, if (pos == origin) continue; is added. Some
    conversions add extra reference get/put around origin handling by
    consolidating origin handling and the rest. While the extra ref
    operations aren't strictly necessary, this shouldn't cause any
    noticeable difference.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Acked-by: Michal Hocko
    Cc: Jens Axboe
    Cc: Matt Helsley
    Cc: Johannes Weiner
    Cc: Balbir Singh

    Tejun Heo
     
  • cgroup is in the process of converting to css (cgroup_subsys_state)
    from cgroup as the principal subsystem interface handle. This is
    mostly to prepare for the unified hierarchy support where css's will
    be created and destroyed dynamically but also helps cleaning up
    subsystem implementations as css is usually what they are interested
    in anyway.

    cgroup_taskset which is used by the subsystem attach methods is the
    last cgroup subsystem API which isn't using css as the handle. Update
    cgroup_taskset_cur_cgroup() to cgroup_taskset_cur_css() and
    cgroup_taskset_for_each() to take @skip_css instead of @skip_cgrp.

    The conversions are pretty mechanical. One exception is
    cpuset::cgroup_cs(), which lost its last user and got removed.

    This patch shouldn't introduce any functional changes.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Daniel Wagner
    Cc: Ingo Molnar
    Cc: Matt Helsley
    Cc: Steven Rostedt

    Tejun Heo
     
  • cgroup is currently in the process of transitioning to using css
    (cgroup_subsys_state) as the primary handle instead of cgroup in
    subsystem API. For hierarchy iterators, this is beneficial because

    * In most cases, css is the only thing subsystems care about anyway.

    * On the planned unified hierarchy, iterations for different
    subsystems will need to skip over different subtrees of the
    hierarchy depending on which subsystems are enabled on each cgroup.
    Passing around css makes it unnecessary to explicitly specify the
    subsystem in question as css is intersection between cgroup and
    subsystem

    * For the planned unified hierarchy, css's would need to be created
    and destroyed dynamically independent from cgroup hierarchy. Having
    cgroup core manage css iteration makes enforcing deref rules a lot
    easier.

    Most subsystem conversions are straight-forward. Noteworthy changes
    are

    * blkio: cgroup_to_blkcg() is no longer used. Removed.

    * freezer: cgroup_freezer() is no longer used. Removed.

    * devices: cgroup_to_devcgroup() is no longer used. Removed.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley
    Cc: Jens Axboe

    Tejun Heo
     
  • cgroup is currently in the process of transitioning to using struct
    cgroup_subsys_state * as the primary handle instead of struct cgroup.
    Please see the previous commit which converts the subsystem methods
    for rationale.

    This patch converts all cftype file operations to take @css instead of
    @cgroup. cftypes for the cgroup core files don't have their subsytem
    pointer set. These will automatically use the dummy_css added by the
    previous patch and can be converted the same way.

    Most subsystem conversions are straight forwards but there are some
    interesting ones.

    * freezer: update_if_frozen() is also converted to take @css instead
    of @cgroup for consistency. This will make the code look simpler
    too once iterators are converted to use css.

    * memory/vmpressure: mem_cgroup_from_css() needs to be exported to
    vmpressure while mem_cgroup_from_cont() can be made static.
    Updated accordingly.

    * cpu: cgroup_tg() doesn't have any user left. Removed.

    * cpuacct: cgroup_ca() doesn't have any user left. Removed.

    * hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left.
    Removed.

    * net_cls: cgrp_cls_state() doesn't have any user left. Removed.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Acked-by: Daniel Wagner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley
    Cc: Jens Axboe
    Cc: Steven Rostedt

    Tejun Heo
     
  • cgroup is transitioning to using css (cgroup_subsys_state) instead of
    cgroup as the primary subsystem handle. The cgroupfs file interface
    will be converted to use css's which requires finding out the
    subsystem from cftype so that the matching css can be determined from
    the cgroup.

    This patch adds cftype->ss which points to the subsystem the file
    belongs to. The field is initialized while a cftype is being
    registered. This makes it unnecessary to explicitly specify the
    subsystem for other cftype handling functions. @ss argument dropped
    from various cftype handling functions.

    This patch shouldn't introduce any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Vivek Goyal
    Cc: Jens Axboe

    Tejun Heo
     
  • cgroup is currently in the process of transitioning to using struct
    cgroup_subsys_state * as the primary handle instead of struct cgroup *
    in subsystem implementations for the following reasons.

    * With unified hierarchy, subsystems will be dynamically bound and
    unbound from cgroups and thus css's (cgroup_subsys_state) may be
    created and destroyed dynamically over the lifetime of a cgroup,
    which is different from the current state where all css's are
    allocated and destroyed together with the associated cgroup. This
    in turn means that cgroup_css() should be synchronized and may
    return NULL, making it more cumbersome to use.

    * Differing levels of per-subsystem granularity in the unified
    hierarchy means that the task and descendant iterators should behave
    differently depending on the specific subsystem the iteration is
    being performed for.

    * In majority of the cases, subsystems only care about its part in the
    cgroup hierarchy - ie. the hierarchy of css's. Subsystem methods
    often obtain the matching css pointer from the cgroup and don't
    bother with the cgroup pointer itself. Passing around css fits
    much better.

    This patch converts all cgroup_subsys methods to take @css instead of
    @cgroup. The conversions are mostly straight-forward. A few
    noteworthy changes are

    * ->css_alloc() now takes css of the parent cgroup rather than the
    pointer to the new cgroup as the css for the new cgroup doesn't
    exist yet. Knowing the parent css is enough for all the existing
    subsystems.

    * In kernel/cgroup.c::offline_css(), unnecessary open coded css
    dereference is replaced with local variable access.

    This patch shouldn't cause any behavior differences.

    v2: Unnecessary explicit cgrp->subsys[] deref in css_online() replaced
    with local variable @css as suggested by Li Zefan.

    Rebased on top of new for-3.12 which includes for-3.11-fixes so
    that ->css_free() invocation added by da0a12caff ("cgroup: fix a
    leak when percpu_ref_init() fails") is converted too. Suggested
    by Li Zefan.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Acked-by: Daniel Wagner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley
    Cc: Jens Axboe
    Cc: Steven Rostedt

    Tejun Heo
     
  • Currently, controllers have to explicitly follow the cgroup hierarchy
    to find the parent of a given css. cgroup is moving towards using
    cgroup_subsys_state as the main controller interface construct, so
    let's provide a way to climb the hierarchy using just csses.

    This patch implements css_parent() which, given a css, returns its
    parent. The function is guarnateed to valid non-NULL parent css as
    long as the target css is not at the top of the hierarchy.

    freezer, cpuset, cpu, cpuacct, hugetlb, memory, net_cls and devices
    are converted to use css_parent() instead of accessing cgroup->parent
    directly.

    * __parent_ca() is dropped from cpuacct and its usage is replaced with
    parent_ca(). The only difference between the two was NULL test on
    cgroup->parent which is now embedded in css_parent() making the
    distinction moot. Note that eventually a css->parent field will be
    added to css and the NULL check in css_parent() will go away.

    This patch shouldn't cause any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • css (cgroup_subsys_state) is usually embedded in a subsys specific
    data structure. Subsystems either use container_of() directly to cast
    from css to such data structure or has an accessor function wrapping
    such cast. As cgroup as whole is moving towards using css as the main
    interface handle, add and update such accessors to ease dealing with
    css's.

    All accessors explicitly handle NULL input and return NULL in those
    cases. While this looks like an extra branch in the code, as all
    controllers specific data structures have css as the first field, the
    casting doesn't involve any offsetting and the compiler can trivially
    optimize out the branch.

    * blkio, freezer, cpuset, cpu, cpuacct and net_cls didn't have such
    accessor. Added.

    * memory, hugetlb and devices already had one but didn't explicitly
    handle NULL input. Updated.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • The names of the two struct cgroup_subsys_state accessors -
    cgroup_subsys_state() and task_subsys_state() - are somewhat awkward.
    The former clashes with the type name and the latter doesn't even
    indicate it's somehow related to cgroup.

    We're about to revamp large portion of cgroup API, so, let's rename
    them so that they're less awkward. Most per-controller usages of the
    accessors are localized in accessor wrappers and given the amount of
    scheduled changes, this isn't gonna add any noticeable headache.

    Rename cgroup_subsys_state() to cgroup_css() and task_subsys_state()
    to task_css(). This patch is pure rename.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     

15 Jul, 2013

1 commit

  • The __cpuinit type of throwaway sections might have made sense
    some time ago when RAM was more constrained, but now the savings
    do not offset the cost and complications. For example, the fix in
    commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
    is a good example of the nasty type of bugs that can be created
    with improper use of the various __init prefixes.

    After a discussion on LKML[1] it was decided that cpuinit should go
    the way of devinit and be phased out. Once all the users are gone,
    we can then finally remove the macros themselves from linux/init.h.

    This removes all the drivers/block uses of the __cpuinit macros
    from all C files.

    [1] https://lkml.org/lkml/2013/5/20/589

    Cc: Jens Axboe
    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

12 Jul, 2013

1 commit

  • Pull core block IO updates from Jens Axboe:
    "Here are the core IO block bits for 3.11. It contains:

    - A tweak to the reserved tag logic from Jan, for weirdo devices with
    just 3 free tags. But for those it improves things substantially
    for random writes.

    - Periodic writeback fix from Jan. Marked for stable as well.

    - Fix for a race condition in IO scheduler switching from Jianpeng.

    - The hierarchical blk-cgroup support from Tejun. This is the grunt
    of the series.

    - blk-throttle fix from Vivek.

    Just a note that I'm in the middle of a relocation, whole family is
    flying out tomorrow. Hence I will be awal the remainder of this week,
    but back at work again on Monday the 15th. CC'ing Tejun, since any
    potential "surprises" will most likely be from the blk-cgroup work.
    But it's been brewing for a while and sitting in my tree and
    linux-next for a long time, so should be solid."

    * 'for-3.11/core' of git://git.kernel.dk/linux-block: (36 commits)
    elevator: Fix a race in elevator switching
    block: Reserve only one queue tag for sync IO if only 3 tags are available
    writeback: Fix periodic writeback after fs mount
    blk-throttle: implement proper hierarchy support
    blk-throttle: implement throtl_grp->has_rules[]
    blk-throttle: Account for child group's start time in parent while bio climbs up
    blk-throttle: add throtl_qnode for dispatch fairness
    blk-throttle: make throtl_pending_timer_fn() ready for hierarchy
    blk-throttle: make tg_dispatch_one_bio() ready for hierarchy
    blk-throttle: make blk_throtl_bio() ready for hierarchy
    blk-throttle: make blk_throtl_drain() ready for hierarchy
    blk-throttle: dispatch from throtl_pending_timer_fn()
    blk-throttle: implement dispatch looping
    blk-throttle: separate out throtl_service_queue->pending_timer from throtl_data->dispatch_work
    blk-throttle: set REQ_THROTTLED from throtl_charge_bio() and gate stats update with it
    blk-throttle: implement sq_to_tg(), sq_to_td() and throtl_log()
    blk-throttle: add throtl_service_queue->parent_sq
    blk-throttle: generalize update_disptime optimization in blk_throtl_bio()
    blk-throttle: dispatch to throtl_data->service_queue.bio_lists[]
    blk-throttle: move bio_lists[] and friends to throtl_service_queue
    ...

    Linus Torvalds
     

10 Jul, 2013

3 commits

  • Graft AIX partitions enumeration into partitions/msdos.c

    There is already a AIX disks detection logic in msdos.c. When an AIX disk
    has been found, and if configured to, call the aix partitions recognizer.
    This avoids removal of AIX disks protection from msdos.c, avoids code
    duplication, and ensures that AIX partitions enumeration is called before
    plain msdos partitions enumeration.

    Signed-off-by: Philippe De Muyter
    Cc: Karel Zak
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Philippe De Muyter
     
  • Add partitions/aix.h and partitions/aix.c.

    AIX LVM permits to make "logical volumes" which are made of multiple
    slices of multiple disks. The new code allows only access to the
    "logical volumes" which are made of one slice on the probed disk, a
    slice being a contiguous disk area. The code also detects "logical
    volumes" made of multiple slices on the probed disk, but can not
    describe them to the partition layer, because the partition layer
    generic code does not support that. When such non-contiguous "logical
    volumes" are detected, a diagnostic message is printed.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Philippe De Muyter
    Cc: Karel Zak
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Philippe De Muyter
     
  • Signed-off-by: Philippe De Muyter
    Cc: Karel Zak
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Philippe De Muyter
     

04 Jul, 2013

2 commits

  • Merge first patch-bomb from Andrew Morton:
    - various misc bits
    - I'm been patchmonkeying ocfs2 for a while, as Joel and Mark have been
    distracted. There has been quite a bit of activity.
    - About half the MM queue
    - Some backlight bits
    - Various lib/ updates
    - checkpatch updates
    - zillions more little rtc patches
    - ptrace
    - signals
    - exec
    - procfs
    - rapidio
    - nbd
    - aoe
    - pps
    - memstick
    - tools/testing/selftests updates

    * emailed patches from Andrew Morton : (445 commits)
    tools/testing/selftests: don't assume the x bit is set on scripts
    selftests: add .gitignore for kcmp
    selftests: fix clean target in kcmp Makefile
    selftests: add .gitignore for vm
    selftests: add hugetlbfstest
    self-test: fix make clean
    selftests: exit 1 on failure
    kernel/resource.c: remove the unneeded assignment in function __find_resource
    aio: fix wrong comment in aio_complete()
    drivers/w1/slaves/w1_ds2408.c: add magic sequence to disable P0 test mode
    drivers/memstick/host/r592.c: convert to module_pci_driver
    drivers/memstick/host/jmb38x_ms: convert to module_pci_driver
    pps-gpio: add device-tree binding and support
    drivers/pps/clients/pps-gpio.c: convert to module_platform_driver
    drivers/pps/clients/pps-gpio.c: convert to devm_* helpers
    drivers/parport/share.c: use kzalloc
    Documentation/accounting/getdelays.c: avoid strncpy in accounting tool
    aoe: update internal version number to v83
    aoe: update copyright date
    aoe: perform I/O completions in parallel
    ...

    Linus Torvalds
     
  • Disk names may contain arbitrary strings, so they must not be
    interpreted as format strings. It seems that only md allows arbitrary
    strings to be used for disk names, but this could allow for a local
    memory corruption from uid 0 into ring 0.

    CVE-2013-2851

    Signed-off-by: Kees Cook
    Cc: Jens Axboe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook