16 May, 2011

1 commit

  • Currentlly we first map the task to cgroup and then cgroup to
    blkio_cgroup. There is a more direct way to get to blkio_cgroup
    from task using task_subsys_state(). Use that.

    The real reason for the fix is that it also avoids a race in generic
    cgroup code. During remount/umount rebind_subsystems() is called and
    it can do following with and rcu protection.

    cgrp->subsys[i] = NULL;

    That means if somebody got hold of cgroup under rcu and then it tried
    to do cgroup->subsys[] to get to blkio_cgroup, it would get NULL which
    is wrong. I was running into this race condition with ltp running on a
    upstream derived kernel and that lead to crash.

    So ideally we should also fix cgroup generic code to wait for rcu
    grace period before setting pointer to NULL. Li Zefan is not very keen
    on introducing synchronize_wait() as he thinks it will slow
    down moun/remount/umount operations.

    So for the time being atleast fix the kernel crash by taking a more
    direct route to blkio_cgroup.

    One tester had reported a crash while running LTP on a derived kernel
    and with this fix crash is no more seen while the test has been
    running for over 6 days.

    Signed-off-by: Vivek Goyal
    Reviewed-by: Li Zefan
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

29 Apr, 2011

2 commits

  • __blkdev_get() doesn't rescan partitions if disk->fops->open() fails,
    which leads to ghost partition devices lingering after medimum removal
    is known to both the kernel and userland. The behavior also creates a
    subtle inconsistency where O_NONBLOCK open, which doesn't fail even if
    there's no medium, clears the ghots partitions, which is exploited to
    work around the problem from userland.

    Fix it by updating __blkdev_get() to issue partition rescan after
    -ENOMEDIA too.

    This was reported in the following bz.

    https://bugzilla.kernel.org/show_bug.cgi?id=13029

    Stable: 2.6.38

    Signed-off-by: Tejun Heo
    Reported-by: David Zeuthen
    Reported-by: Martin Pitt
    Reported-by: Kay Sievers
    Tested-by: Kay Sievers
    Cc: Alan Cox
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • cdrom_open() called check_disk_change() after the rest of open path
    succeeded which leads to the following bizarre behavior.

    * After media change, if the device opened without O_NONBLOCK,
    open_for_data() naturally fails with -ENOMEDIA and
    check_disk_change() is never called. The media is known to be gone
    and the open failure makes it obvious to the userland but device
    invalidation never happens.

    * But if the device is opened with O_NONBLOCK, all the checks are
    bypassed and cdrom_open() doesn't notice that the media is not there
    and check_disk_change() is called and invalidation happens.

    There's nothing to be gained by avoiding calling check_disk_change()
    on open failure. Common cases end up calling check_disk_change()
    anyway. All we get is inconsistent behavior.

    Fix it by moving check_disk_change() invocation to the top of
    cdrom_open() so that it always gets called regardless of how the rest
    of open proceeds.

    Stable: 2.6.38

    Signed-off-by: Tejun Heo
    Reported-by: Amit Shah
    Tested-by: Amit Shah
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Tejun Heo
     

22 Apr, 2011

12 commits


21 Apr, 2011

16 commits

  • For some reason generic_setxattr() did not pass flags (XATTR_CREATE,
    XATTR_REPLACE) to the filesystem specific helper. This caused that
    setxattr(2) syscall just ignored these flags.

    Fix the bug by passing flags correctly.

    Signed-off-by: Jan Kara
    Acked-by: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • This call was disabled as hot-unplugging one virtconsole port led to
    another virtconsole port freezing.

    Upon testing it again, this now works, so enable it.

    In addition, a bug was found in qemu wherein removing a port of one type
    caused the guest output from another port to stop working. I doubt it
    was just this bug that caused it (since disabling the hvc_remove() call
    did allow other ports to continue working), but since it's all solved
    now, we're fine with hot-unplugging of virtconsole ports.

    Signed-off-by: Amit Shah
    Signed-off-by: Rusty Russell

    Amit Shah
     
  • In the case where a virtio-console port is in use (opened by a program)
    and a virtio-console device is removed, the port is kept around but all
    the virtio-related state is assumed to be gone.

    When the port is finally released (close() called), we call
    device_destroy() on the port's device. This results in the parent
    device's structures to be freed as well. This includes the PCI regions
    for the virtio-console PCI device.

    Once this is done, however, virtio_pci_release_dev() kicks in, as the
    last ref to the virtio device is now gone, and attempts to do

    pci_iounmap(pci_dev, vp_dev->ioaddr);
    pci_release_regions(pci_dev);
    pci_disable_device(pci_dev);

    which results in a double-free warning.

    Move the code that releases regions, etc., to the virtio_pci_remove()
    function, and all that's now left in release_dev is the final freeing of
    the vp_dev.

    Signed-off-by: Amit Shah
    Signed-off-by: Rusty Russell

    Amit Shah
     
  • When detaching a buffer from a vq, the avail.idx value should be
    decremented as well.

    This was noticed by hot-unplugging a virtio console port and then
    plugging in a new one on the same number (re-using the vqs which were
    just 'disowned'). qemu reported

    'Guest moved used index from 0 to 256'

    when any IO was attempted on the new port.

    CC: stable@kernel.org
    Reported-by: juzhang
    Signed-off-by: Amit Shah
    Signed-off-by: Rusty Russell

    Amit Shah
     
  • Intel VT-d Protected Memory Regions (PMRs) are supposed to be disabled,
    on each VT-d engine, after DMA remapping is enabled on the engines.
    This is because the behavior of having both enabled is not deterministic
    and because, if TXT has been used to launch the kernel, the PMRs may be
    programmed to cover memory regions that will be used for DMA.

    Under some circumstances (certain quirks detected, lack of multiple
    devices, etc.), the current code does not set up DMA remapping on some
    VT-d engines. In such cases it also skips disabling the PMRs. This
    causes failures when the kernel is launched with TXT (most often this
    occurs on the graphics engine and results in colored vertical bars on
    the display).

    This patch detects when the kernel has been launched with TXT and then
    disables the PMRs on all VT-d engines. In some cases where the reason
    that remapping is not being enabled is due to possible ACPI DMAR table
    errors, the VT-d engine addresses may not be correct and thus not able
    to be safely programmed even to disable PMRs. Because part of the TXT
    launch process is the verification of these addresses, it will always be
    safe to disable PMRs if the TXT launch has succeeded and hence only
    doing this in such cases.

    Signed-off-by: Joseph Cihula
    Signed-off-by: David Woodhouse

    Joseph Cihula
     
  • The cpunode mappings under CONFIG_DEBUG_PER_CPU_MAPS=y
    when NUMA emulation is enabled is currently broken because it does
    not iterate through every emulated node and bind cpus that have
    affinity to it.

    NUMA emulation should bind each cpu to every local node to
    accurately represent the true NUMA topology of the underlying
    machine.

    debug_cpumask_set_cpu() needs to be fixed at the same time so
    that the debugging information that it emits shows the new
    cpumask of the node being assigned when the cpu is being added
    or removed.

    It can now take responsibility of setting or clearing the cpu
    itself to remove the need for duplicate code.

    Also change its last parameter, "enable", to have the correct bool
    type since it can only be true or false.

    -v2: Fix the return statements, by Kosaki Motohiro

    Acked-and-Tested-by: KOSAKI Motohiro
    Signed-off-by: David Rientjes
    Cc: Andreas Herrmann
    Cc: Tejun Heo
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/alpine.DEB.2.00.1104201918470.12634@chino.kir.corp.google.com
    Signed-off-by: Ingo Molnar

    David Rientjes
     
  • Andreas Herrmann reported that 7d6b46707f24 ("x86, NUMA: Fix fakenuma
    boot failure") causes certain physical NUMA topologies (for example
    AMD Magny-Cours) to move sibling cpus to a single node when in reality
    they are in separate domains.

    This may result in some nodes being completely void of cpus, which
    doesn't accurately represent the correct topology. The system will
    boot, but will have suboptimal NUMA performance.

    This commit was intended as a fix for NUMA emulation, but should
    not cause a regression for real NUMA machines as a side effect.

    ( There will be a separate fix for the numa-debug code, which
    will not affect physical topologies. )

    Reported-by: Andreas Herrmann
    Signed-off-by: David Rientjes
    Acked-by: KOSAKI Motohiro
    Cc: Tejun Heo
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/alpine.DEB.2.00.1104201918110.12634@chino.kir.corp.google.com
    Signed-off-by: Ingo Molnar

    David Rientjes
     
  • pg_start is copied from userspace on AGPIOC_BIND and AGPIOC_UNBIND ioctl
    cmds of agp_ioctl() and passed to agpioc_bind_wrap(). As said in the
    comment, (pg_start + mem->page_count) may wrap in case of AGPIOC_BIND,
    and it is not checked at all in case of AGPIOC_UNBIND. As a result, user
    with sufficient privileges (usually "video" group) may generate either
    local DoS or privilege escalation.

    Signed-off-by: Vasiliy Kulikov
    Signed-off-by: Dave Airlie

    Vasiliy Kulikov
     
  • page_count is copied from userspace. agp_allocate_memory() tries to
    check whether this number is too big, but doesn't take into account the
    wrap case. Also agp_create_user_memory() doesn't check whether
    alloc_size is calculated from num_agp_pages variable without overflow.
    This may lead to allocation of too small buffer with following buffer
    overflow.

    Another problem in agp code is not addressed in the patch - kernel memory
    exhaustion (AGPIOC_RESERVE and AGPIOC_ALLOCATE ioctls). It is not checked
    whether requested pid is a pid of the caller (no check in agpioc_reserve_wrap()).
    Each allocation is limited to 16KB, though, there is no per-process limit.
    This might lead to OOM situation, which is not even solved in case of the
    caller death by OOM killer - the memory is allocated for another (faked) process.

    Signed-off-by: Vasiliy Kulikov
    Signed-off-by: Dave Airlie

    Vasiliy Kulikov
     
  • * 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/staging:
    hwmon: (max34440) Add driver documentation
    hwmon: (max16064) Add driver documentation
    hwmon: (max8688) Add driver documentation
    hwmon: (pmbus) Documentation updates
    hwmon: (smm665) Fix spelling error in driver documentation
    hwmon: (pmbus) Removed unused variable from struct pmbus_data
    hwmon: Add submitting-patches checklist to documentation

    Linus Torvalds
     
  • * 'for-2.6.39' of git://linux-nfs.org/~bfields/linux:
    Open with O_CREAT flag set fails to open existing files on non writable directories
    nfsd4: Fix filp leak
    nfsd4: fix struct file leak on delegation

    Linus Torvalds
     
  • * 'fixes' of master.kernel.org:/home/rmk/linux-2.6-arm:
    ARM: 6881/1: cputype.h uses __attribute_const__ which requires including kernel.h
    ARM: Add new syscalls

    Linus Torvalds
     
  • * 'stable/bug-fixes-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
    xen: mask_rw_pte: do not apply the early_ioremap checks on x86_32
    xen: do not create the extra e820 region at an addr lower than 4G

    Linus Torvalds
     
  • * 'for-linus' of git://neil.brown.name/md:
    md: Update documentation for sync_min and sync_max entries
    md: Cleanup after raid45->raid0 takeover
    md: Fix dev_sectors on takeover from raid0 to raid4/5
    md/raid5: remove setting of ->queue_lock

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block:
    block: Remove the extra check in queue_requests_store
    block, blk-sysfs: Fix an err return path in blk_register_queue()
    block: remove stale kerneldoc member from __blk_run_queue()
    block: get rid of QUEUE_FLAG_REENTER
    cfq-iosched: read_lock() does not always imply rcu_read_lock()
    block: kill blk_flush_plug_list() export

    Linus Torvalds
     
  • Commit 957935dc ("xfs: fix xfs_debug warnings" broke the logic in
    __xfs_printk(). Instead of only printing one of two possible output
    strings based on whether the fs has a name or not, it outputs both.
    Fix it to only output one message again.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     

20 Apr, 2011

9 commits

  • An open on a NFS4 share using the O_CREAT flag on an existing file for
    which we have permissions to open but contained in a directory with no
    write permissions will fail with EACCES.

    A tcpdump shows that the client had set the open mode to UNCHECKED which
    indicates that the file should be created if it doesn't exist and
    encountering an existing flag is not an error. Since in this case the
    file exists and can be opened by the user, the NFS server is wrong in
    attempting to check create permissions on the parent directory.

    The patch adds a conditional statement to check for create permissions
    only if the file doesn't exist.

    Signed-off-by: Sachin S. Prabhu
    Signed-off-by: J. Bruce Fields

    Sachin Prabhu
     
  • The two "is_early_ioremap_ptep" checks in mask_rw_pte are only used on
    x86_64, in fact early_ioremap is not used at all to setup the initial
    pagetable on x86_32.
    Moreover on x86_32 the two checks are wrong because the range
    pgt_buf_start..pgt_buf_end initially should be mapped RW because
    the pages in the range are not pagetable pages yet and haven't been
    cleared yet. Afterwards considering the pgt_buf_start..pgt_buf_end is
    part of the initial mapping, xen_alloc_pte is capable of turning
    the ptes RO when they become pagetable pages.

    Fix the issue and improve the readability of the code providing two
    different implementation of mask_rw_pte for x86_32 and x86_64.

    Signed-off-by: Stefano Stabellini
    Signed-off-by: Konrad Rzeszutek Wilk

    Stefano Stabellini
     
  • Do not add the extra e820 region at a physical address lower than 4G
    because it breaks e820_end_of_low_ram_pfn().

    It is OK for us to move the xen_extra_mem_start up and down because this
    is the index of the memory that can be ballooned in/out - it is memory
    not available to the kernel during bootup.

    Signed-off-by: Stefano Stabellini
    Signed-off-by: Konrad Rzeszutek Wilk

    Stefano Stabellini
     
  • linux/Documentation/md.txt is missing description for sync_min and
    sync_max entries.
    This patch adds description for sync_min and sync_max entries.

    Signed-off-by: Roman Ovchinnikov
    Signed-off-by: NeilBrown

    CoolCold
     
  • Problem:
    After raid4->raid0 takeover operation, another takeover operation
    (e.g raid0->raid10) results "kernel oops".
    Root cause:
    Variables 'degraded' in mddev structure is not cleared
    on raid45->raid0 takeover.

    This patch reset this variable.

    Signed-off-by: Krzysztof Wojcik
    Signed-off-by: NeilBrown

    Krzysztof Wojcik
     
  • A raid0 array doesn't set 'dev_sectors' as each device might
    contribute a different number of sectors.
    So when converting to a RAID4 or RAID5 we need to set dev_sectors
    as they need the number.
    We have already verified that in fact all devices do contribute
    the same number of sectors, so use that number.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • We previously needed to set ->queue_lock to match the raid5
    device_lock so we could safely use queue_flag_* operations (e.g. for
    plugging). which test the ->queue_lock is in fact locked.

    However that need has completely gone away and is unlikely to come
    back to remove this now-pointless setting.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • * 'drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6:
    drm/radeon/kms: pll tweaks for r7xx
    drm/nouveau: fix allocation of notifier object
    drm/nouveau: fix notifier memory corruption bug
    drm/nouveau: fix pinning of notifier block
    drm/nouveau: populate ttm_alloced with false, when it's not
    drm/nouveau: fix nv30 pcie boards
    drm/nouveau: split ramin_lock into two locks, one hardirq safe
    drm/radeon/kms: adjust evergreen display watermark setup
    drm/radeon/kms: add connectors even if i2c fails
    drm/radeon/kms: fix bad shift in atom iio table parser

    Linus Torvalds
     
  • agd5f: fix commit message.

    Signed-off-by: Cedric Cano
    Reviewed-by: Michel Dänzer
    Signed-off-by: Alex Deucher
    Signed-off-by: Dave Airlie

    Cédric Cano