19 Oct, 2007

20 commits

  • The non-filesystem capability meaning of CAP_SETPCAP is that a process, p1,
    can change the capabilities of another process, p2. This is not the
    meaning that was intended for this capability at all, and this
    implementation came about purely because, without filesystem capabilities,
    there was no way to use capabilities without one process bestowing them on
    another.

    Since we now have a filesystem support for capabilities we can fix the
    implementation of CAP_SETPCAP.

    The most significant thing about this change is that, with it in effect, no
    process can set the capabilities of another process.

    The capabilities of a program are set via the capability convolution
    rules:

    pI(post-exec) = pI(pre-exec)
    pP(post-exec) = (X(aka cap_bset) & fP) | (pI(post-exec) & fI)
    pE(post-exec) = fE ? pP(post-exec) : 0

    at exec() time. As such, the only influence the pre-exec() program can
    have on the post-exec() program's capabilities are through the pI
    capability set.

    The correct implementation for CAP_SETPCAP (and that enabled by this patch)
    is that it can be used to add extra pI capabilities to the current process
    - to be picked up by subsequent exec()s when the above convolution rules
    are applied.

    Here is how it works:

    Let's say we have a process, p. It has capability sets, pE, pP and pI.
    Generally, p, can change the value of its own pI to pI' where

    (pI' & ~pI) & ~pP = 0.

    That is, the only new things in pI' that were not present in pI need to
    be present in pP.

    The role of CAP_SETPCAP is basically to permit changes to pI beyond
    the above:

    if (pE & CAP_SETPCAP) {
    pI' = anything; /* ie., even (pI' & ~pI) & ~pP != 0 */
    }

    This capability is useful for things like login, which (say, via
    pam_cap) might want to raise certain inheritable capabilities for use
    by the children of the logged-in user's shell, but those capabilities
    are not useful to or needed by the login program itself.

    One such use might be to limit who can run ping. You set the
    capabilities of the 'ping' program to be "= cap_net_raw+i", and then
    only shells that have (pI & CAP_NET_RAW) will be able to run
    it. Without CAP_SETPCAP implemented as described above, login(pam_cap)
    would have to also have (pP & CAP_NET_RAW) in order to raise this
    capability and pass it on through the inheritable set.

    Signed-off-by: Andrew Morgan
    Signed-off-by: Serge E. Hallyn
    Cc: Stephen Smalley
    Cc: James Morris
    Cc: Casey Schaufler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morgan
     
  • After going through the kernels sysctl tables several times it has become
    clear that code review and testing is just not effective in prevent
    problematic sysctl tables from being used in the stable kernel. I certainly
    can't seem to fix the problems as fast as they are introduced.

    Therefore this patch adds sysctl_check_table which is called when a sysctl
    table is registered and checks to see if we have a problematic sysctl table.

    The biggest part of the code is the table of valid binary sysctl entries, but
    since we have frozen our set of binary sysctls this table should not need to
    change, and it makes it much easier to detect when someone unintentionally
    adds a new binary sysctl value.

    As best as I can determine all of the several hundred errors spewed on boot up
    now are legitimate.

    [bunk@kernel.org: kernel/sysctl_check.c must #include ]
    Signed-off-by: Eric W. Biederman
    Cc: Alexey Dobriyan
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Grumble. These numbers should have been in sysctl.h from the beginning if we
    ever expected anyone to use them. Oh well put them there now so we can find
    them and make maintenance easier.

    Signed-off-by: Eric W. Biederman
    Acked-by: Samuel Ortiz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • The sysctl binary paths don't look as if they even code work, .data is not
    filled in, and all of the proc_handlers look at extra1 and there is not
    strategy routine.

    So just kill the binary paths.

    In addition this patch removes the setting of extra1 on directories. It
    doesn't look like the parport code ever examines it, and it's bad sysctl form.

    [bunk@kernel.org: remove parport_device_num()]
    Signed-off-by: Eric W. Biederman
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • There as been no easy way to wrap the default sysctl strategy routine except
    for returning 0. Which is not always what we want. The few instances I have
    seen that want different behaviour have written their own version of
    sysctl_data. While not too hard it is unnecessary code and has the potential
    for extra bugs.

    So to make these situations easier and make that part of sysctl more symetric
    I have factord sysctl_data out of do_sysctl_strategy and exported as a
    function everyone can use.

    Further having sysctl_data be an explicit function makes checking for badly
    formed sysctl tables much easier.

    Signed-off-by: Eric W. Biederman
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • In sysctl.h the typedef struct ctl_table ctl_table violates coding style isn't
    needed and is a bit of a nuisance because it makes it harder to recognize
    ctl_table is a type name.

    So this patch removes it from the generic sysctl code. Hopefully I will have
    enough energy to send the rest of my patches will follow and to remove it from
    the rest of the kernel.

    Signed-off-by: Eric W. Biederman
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Now that we have DMA_BIT_MASK(), these macros are pointless.

    Cc: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Remove redundant DMA_..BIT_MASK definitions across two drivers. The
    computation of the majority of the bitmasks is done by the compiler. The
    initial split of the patch touching each a different file got removed due
    to possible git bisect breakage.

    Signed-off-by: Borislav Petkov
    Cc: Jeremy Fitzhardinge
    Cc: Muli Ben-Yehuda
    Cc: Jeff Garzik
    Cc: James Bottomley
    Reviewed-by: Satyam Sharma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Borislav Petkov
     
  • On platforms that copy sys_tz into the vdso (currently only x86_64, soon to
    include powerpc), it is possible for the vdso to get out of sync if a user
    calls (admittedly unusual) settimeofday(NULL, ptr).

    This patch adds a hook for architectures that set
    CONFIG_GENERIC_TIME_VSYSCALL to ensure when sys_tz is updated they can also
    updatee their copy in the vdso.

    Signed-off-by: Tony Breeds
    Cc: Andi Kleen
    Cc: Tony Luck
    Acked-by: John Stultz
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tony Breeds
     
  • Hell knows what happened in commit 63b05203af57e7de4f3bb63b8b81d43bc196d32b
    during 2.6.9 development. Commit introduced io_wait field which remained
    write-only than and still remains write-only.

    Also garbage collect macros which "use" io_wait.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • The following scenario leads to total confusion of the platform firmware on
    some boxes (eg. HPC nx6325):
    * Hibernate with ACPI enabled
    * Resume passing "acpi=off" to the boot kernel

    To prevent this from happening it's necessary to check if ACPI is enabled (and
    enable it if that's not the case) _right_ _after_ control has been transfered
    from the boot kernel to the image kernel, before device_power_up() is called
    (ie. with interrupts disabled).  Enabling ACPI after calling
    device_power_up() turns out to be insufficient.

    For this reason, introduce new hibernation callback ->leave() that will be
    executed before device_power_up() by the restored image kernel.  To make it
    work, it also is necessary to move swsusp_suspend() from swsusp.c to disk.c
    (it's name is changed to "create_image", which is more up to the point).

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Make it possible to restore a hibernation image on x86_64 with the help of a
    kernel different from the one in the image.

    The idea is to split the core restoration code into two separate parts and to
    place each of them in a different page.  The first part belongs to the boot
    kernel and is executed as the last step of the image kernel's memory
    restoration procedure.  Before being executed, it is relocated to a safe page
    that won't be overwritten while copying the image kernel pages.

    The final operation performed by it is a jump to the second part of the core
    restoration code that belongs to the image kernel and has just been restored.
    This code makes the CPU switch to the image kernel's page tables and restores
    the state of general purpose registers (including the stack pointer) from
    before the hibernation.

    The main issue with this idea is that in order to jump to the second part of
    the core restoration code the boot kernel needs to know its address.
     However, this address may be passed to it in the image header.  Namely, the
    part of the image header previously used for checking if the version of the
    image kernel is correct can be replaced with some architecture specific data
    that will allow the boot kernel to jump to the right address within the image
    kernel.  These data should also be used for checking if the image kernel is
    compatible with the boot kernel (as far as the memory restroration procedure
    is concerned). It can be done, for example, with the help of a "magic" value
    that has to be equal in both kernels, so that they can be regarded as
    compatible.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Currently, there's a CONFIG_DISABLE_CONSOLE_SUSPEND that allows one to stop
    the serial console from being suspended when the rest of the machine goes
    to sleep. This is incredibly useful for debugging power management-related
    things; however, having it as a compile-time option has proved to be
    incredibly inconvenient for us (OLPC). There are plenty of times that we
    want serial console to not suspend, but for the most part we'd like serial
    console to be suspended.

    This drops CONFIG_DISABLE_CONSOLE_SUSPEND, and replaces it with a kernel
    boot parameter (no_console_suspend). By default, the serial console will
    be suspended along with the rest of the system; by passing
    'no_console_suspend' to the kernel during boot, serial console will remain
    alive during suspend.

    For now, this is pretty serial console specific; further fixes could be
    applied to make this work for things like netconsole.

    Signed-off-by: Andres Salomon
    Acked-by: "Rafael J. Wysocki"
    Acked-by: Pavel Machek
    Cc: Nigel Cunningham
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andres Salomon
     
  • Introduce freezer-friendly wrappers around wait_event_interruptible() and
    wait_event_interruptible_timeout(), originally defined in , to
    be used in freezable kernel threads. Make some of the freezable kernel
    threads use them.

    This is necessary for the freezer to stop sending signals to kernel threads,
    which is implemented in the next patch.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Cc: Nigel Cunningham
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Rename 'struct hibernation_ops' to 'struct platform_hibernation_ops' in
    analogy with 'struct platform_suspend_ops'.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Cc: Len Brown
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • During hibernation we also need to tell the ACPI core that we're going to put
    the system into the S4 sleep state. For this reason, an additional method in
    'struct hibernation_ops' is needed, playing the role of set_target() in
    'struct platform_suspend_operations'. Moreover, the role of the .prepare()
    method is now different, so it's better to introduce another method, that in
    general may be different from .prepare(), that will be used to prepare the
    platform for creating the hibernation image (.prepare() is used anyway to
    notify the platform that we're going to enter the low power state after the
    image has been saved).

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • The variable suspend_ops representing the set of global platform-specific
    suspend-related operations, used by the PM core, need not be exported outside
    of kernel/power/main.c .  Make it static.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • There is no reason why the .prepare() and .finish() methods in 'struct
    platform_suspend_ops' should take any arguments, since architectures don't use
    these methods' argument in any practically meaningful way (ie. either the
    target system sleep state is conveyed to the platform by .set_target(), or
    there is only one suspend state supported and it is indicated to the PM core
    by .valid(), or .prepare() and .finish() aren't defined at all).  There also
    is no reason why .finish() should return any result.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Cc: Len Brown
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • The name of 'struct pm_ops' suggests that it is related to the power
    management in general, but in fact it is only related to suspend.  Moreover,
    its name should indicate what this structure is used for, so it seems
    reasonable to change it to 'struct platform_suspend_ops'.  In that case, the
    name of the global variable of this type used by the PM core and the names of
    related functions should be changed accordingly.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Cc: Len Brown
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Move the definition of 'struct pm_ops' and related functions from
    to .

    There are, at least, the following reasons to do that:
    * 'struct pm_ops' is specifically related to suspend and not to the power
    management in general.
    * As long as 'struct pm_ops' is defined in , any modification of it
    causes the entire kernel to be recompiled, which is unnecessary and annoying.
    * Some suspend-related features are already defined in , so it
    is logical to move the definition of 'struct pm_ops' into there.
    * 'struct hibernation_ops', being the hibernation-related counterpart of
    'struct pm_ops', is defined in .

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Cc: Len Brown
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     

18 Oct, 2007

20 commits

  • Convert ext4_extent_idx.ei_leaf ext4_extent_idx.ei_leaf_lo
    This helps in finding BUGs due to direct partial access of
    these split 48 bit values.

    Signed-off-by: Aneesh Kumar K.V

    Aneesh Kumar K.V
     
  • Convert ext4_extent.ee_start to ext4_extent.ee_start_lo
    This helps in finding BUGs due to direct partial access of
    these split 48 bit values

    Also fix direct partial access in ext4 code

    Signed-off-by: Aneesh Kumar K.V

    Aneesh Kumar K.V
     
  • Convert s_r_blocks_count and s_free_blocks_count to
    s_r_blocks_count_lo and s_free_blocks_count_lo

    This helps in finding BUGs due to direct partial access of
    these split 64 bit values

    Also fix direct partial access in ext4 code

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: "Theodore Ts'o"

    Aneesh Kumar K.V
     
  • Convert s_blocks_count to s_blocks_count_lo
    This helps in finding BUGs due to direct partial access of
    these split 64 bit values

    Also fix direct partial access in ext4 code

    Signed-off-by: Aneesh Kumar K.V

    Aneesh Kumar K.V
     
  • Convert bg_inode_bitmap and bg_inode_table to bg_inode_bitmap_lo
    and bg_inode_table_lo. This helps in finding BUGs due to
    direct partial access of these split 64 bit values

    Also fix one direct partial access

    Signed-off-by: Aneesh Kumar K.V

    Aneesh Kumar K.V
     
  • Convert bg_block_bitmap to bg_block_bitmap_lo
    This helps in catching some BUGS due to direct
    partial access of these split fields.

    Signed-off-by: Aneesh Kumar K.V

    Aneesh Kumar K.V
     
  • This feature relaxes check restrictions on where each block groups meta
    data is located within the storage media. This allows for the allocation
    of bitmaps or inode tables outside the block group boundaries in cases
    where bad blocks forces us to look for new blocks which the owning block
    group can not satisfy. This will also allow for new meta-data allocation
    schemes to improve performance and scalability.

    Signed-off-by: Jose R. Santos
    Cc:
    Signed-off-by: Andrew Morton

    Jose R. Santos
     
  • Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Andrew Morton

    Aneesh Kumar K.V
     
  • In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
    regardless of whether it is in use. This is this the most time consuming part
    of the filesystem check. The unintialized block group feature can greatly
    reduce e2fsck time by eliminating checking of uninitialized inodes.

    With this feature, there is a a high water mark of used inodes for each block
    group. Block and inode bitmaps can be uninitialized on disk via a flag in the
    group descriptor to avoid reading or scanning them at e2fsck time. A checksum
    of each group descriptor is used to ensure that corruption in the group
    descriptor's bit flags does not cause incorrect operation.

    The feature is enabled through a mkfs option

    mke2fs /dev/ -O uninit_groups

    A patch adding support for uninitialized block groups to e2fsprogs tools has
    been posted to the linux-ext4 mailing list.

    The patches have been stress tested with fsstress and fsx. In performance
    tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
    linearly with the total number of inodes in the filesytem. In ext4 with the
    uninitialized block groups feature, the e2fsck time is constant, based
    solely on the number of used inodes rather than the total inode count.
    Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
    greatly reduce e2fsck time for users. With performance improvement of 2-20
    times, depending on how full the filesystem is.

    The attached graph shows the major improvements in e2fsck times in filesystems
    with a large total inode count, but few inodes in use.

    In each group descriptor if we have

    EXT4_BG_INODE_UNINIT set in bg_flags:
    Inode table is not initialized/used in this group. So we can skip
    the consistency check during fsck.
    EXT4_BG_BLOCK_UNINIT set in bg_flags:
    No block in the group is used. So we can skip the block bitmap
    verification for this group.

    We also add two new fields to group descriptor as a part of
    uninitialized group patch.

    __le16 bg_itable_unused; /* Unused inodes count */
    __le16 bg_checksum; /* crc16(sb_uuid+group+desc) */

    bg_itable_unused:

    If we have EXT4_BG_INODE_UNINIT not set in bg_flags
    then bg_itable_unused will give the offset within
    the inode table till the inodes are used. This can be
    used by fsck to skip list of inodes that are marked unused.

    bg_checksum:
    Now that we depend on bg_flags and bg_itable_unused to determine
    the block and inode usage, we need to make sure group descriptor
    is not corrupt. We add checksum to group descriptor to
    detect corruption. If the descriptor is found to be corrupt, we
    mark all the blocks and inodes in the group used.

    Signed-off-by: Avantika Mathur
    Signed-off-by: Andreas Dilger
    Signed-off-by: Mingming Cao
    Signed-off-by: Aneesh Kumar K.V

    Andreas Dilger
     
  • CONFIG_EXT4_INDEX is not an exposed config option in the kernel, and it is
    unconditionally defined in ext4_fs.h. tune2fs is already able to turn off
    dir indexing, so at this point it's just cluttering up the code. Remove
    it.

    Signed-off-by: Eric Sandeen
    Signed-off-by: Mingming Cao
    Signed-off-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton

    Eric Sandeen
     
  • Fragment support in ext2/3/4 was never implemented, and it probably will
    never be implemented. So remove it from ext4.

    Signed-off-by: Coly Li
    Acked-by: Andreas Dilger
    Signed-off-by: Andrew Morton
    Signed-off-by: "Theodore Ts'o"

    Coly Li
     
  • change JBD_XXX macros to JBD2_XXX in JBD2/Ext4

    Signed-off-by: Mingming Cao
    Signed-off-by: "Theodore Ts'o"

    Mingming Cao
     
  • This patch cleans up jbd_kmalloc and replace it with kmalloc directly

    Signed-off-by: Mingming Cao

    Mingming Cao
     
  • This patch cleans up jbd_kmalloc and replace it with kmalloc directly

    Signed-off-by: Mingming Cao

    Mingming Cao
     
  • JBD2: Replace slab allocations with page allocations

    JBD2 allocate memory for committed_data and frozen_data from slab. However
    JBD2 should not pass slab pages down to the block layer. Use page allocator
    pages instead. This will also prepare JBD for the large blocksize patchset.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Mingming Cao

    Mingming Cao
     
  • JBD: Replace slab allocations with page allocations

    JBD allocate memory for committed_data and frozen_data from slab. However
    JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Mingming Cao

    Mingming Cao
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
    9p: remove sysctl
    9p: fix bad kconfig cross-dependency
    9p: soften invalidation in loose_mode
    9p: attach-per-user
    9p: rename uid and gid parameters
    9p: define session flags
    9p: Make transports dynamic

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/drzeus/mmc:
    net: libertas sdio driver
    mmc: at91_mci: cleanup: use MCI_ERRORS
    mmc: possible leak in mmc_read_ext_csd

    Linus Torvalds
     
  • Add driver for Marvell's Libertas 8385 and 8686 wifi chips.

    Signed-off-by: Pierre Ossman
    Acked-by: Dan Williams

    Pierre Ossman
     
  • * ssh://master.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-x86: (114 commits)
    x86: delete vsyscall files during make clean
    kbuild: fix typo SRCARCH in find_sources
    x86: fix kernel rebuild due to vsyscall fallout
    .gitignore update for x86 arch
    x86: unify include/asm/debugreg_32/64.h
    x86: unify include/asm/unwind_32/64.h
    x86: unify include/asm/types_32/64.h
    x86: unify include/asm/tlb_32/64.h
    x86: unify include/asm/siginfo_32/64.h
    x86: unify include/asm/bug_32/64.h
    x86: unify include/asm/mman_32/64.h
    x86: unify include/asm/agp_32/64.h
    x86: unify include/asm/kdebug_32/64.h
    x86: unify include/asm/ioctls_32/64.h
    x86: unify include/asm/floppy_32/64.h
    x86: apply missing DMA/OOM prevention to floppy_32.h
    x86: unify include/asm/cache_32/64.h
    x86: unify include/asm/cache_32/64.h
    x86: unify include/asm/dmi_32/64.h
    x86: unify include/asm/delay_32/64.h
    ...

    Linus Torvalds