23 Mar, 2006

11 commits

  • Set the family field in xt_[matches|targets] registered.

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Pablo Neira Ayuso
     
  • Currently the first conntrack ID assigned is 2, use 1 instead.

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Pablo Neira Ayuso
     
  • Fix oversized message, use NLMSG_SPACE just one since it reserves space
    for the netlink header and NFA_SPACE for every attribute.

    Thanks to Harald Welte for the feedback

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Pablo Neira Ayuso
     
  • The expectation mask has some particularities that requires a different
    handling. The protocol number fields can be set to non-valid protocols,
    ie. l3num is set to 0xFFFF. Since that protocol does not exist, the mask
    tuple will not be dumped. Moreover, this results in a kernel panic when
    nf_conntrack accesses the array of protocol handlers, that is PF_MAX (0x1F)
    long.

    This patch introduces the function ctnetlink_exp_dump_mask, that correctly
    dumps the expectation mask. Such function uses the l3num value from the
    expectation tuple that is a valid layer 3 protocol number. The value of the
    l3num mask isn't dumped since it is meaningless from the userspace side.

    Thanks to Yasuyuki Kozakai and Patrick McHardy for the feedback.

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Pablo Neira Ayuso
     
  • Signed-off-by: Thomas Vögtle
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Thomas Vögtle
     
  • do_ipv6_getsockopt returns -EINVAL for unknown options, not
    -ENOPROTOOPT as do_ipv6_setsockopt.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/perex/alsa: (124 commits)
    [ALSA] version 1.0.11rc4
    [PATCH] Intruduce DMA_28BIT_MASK
    [ALSA] hda-codec - Add support for ASUS P4GPL-X
    [ALSA] hda-codec - Add support for HP nx9420 laptop
    [ALSA] Fix memory leaks in error path of control.c
    [ALSA] AMD Au1x00: AC'97 controller is memory mapped
    [ALSA] AMD Au1x00: fix DMA init/cleanup
    [ALSA] hda-codec - Fix generic auto-configurator
    [ALSA] hda-codec - Fix BIOS auto-configuration
    [ALSA] Fixes typos in Audiophile-USB.txt
    [ALSA] ice1712 - typo fixes for dxr_enable module option
    [ALSA] AMD Au1x00: make driver build after cleanup
    [ALSA] ice1712 - Fix wrong value types for enum items
    [ALSA] fix resource leak in usbmixer
    [ALSA] Fix gus_pcm dereference before NULL
    [ALSA] Fix seq_clientmgr dereferences before NULL check
    [ALSA] hda-codec - Fix for Samsung R65 and ASUS A6J
    [ALSA] hda-codec - Add support for VAIO FE550G and SZ110
    [ALSA] usb-audio: add Maya44 mixer control names
    [ALSA] usb-audio: add Casio PL-40R support
    ...

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial:
    fixed path to moved file in include/linux/device.h
    Fix spelling in E1000_DISABLE_PACKET_SPLIT Kconfig description
    Documentation/dvb/get_dvb_firmware: fix firmware URL
    Documentation: Update to BUG-HUNTING
    Remove superfluous NOTIFY_COOKIE_LEN define
    add "tags" to .gitignore
    Fix "frist", "fisrt", typos
    fix rwlock usage example
    It's UTF-8

    Linus Torvalds
     
  • * master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6:
    [SPARC64]: Add a secondary TSB for hugepage mappings.
    [SPARC]: Respect vm_page_prot in io_remap_page_range().

    Linus Torvalds
     
  • * master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6:
    [TG3]: Bump driver version and reldate.
    [TG3]: Skip phy power down on some devices
    [TG3]: Fix SRAM access during tg3_init_one()
    [X25]: dte facilities 32 64 ioctl conversion
    [X25]: allow ITU-T DTE facilities for x25
    [X25]: fix kernel error message 64 bit kernel
    [X25]: ioctl conversion 32 bit user to 64 bit kernel
    [NET]: socket timestamp 32 bit handler for 64 bit kernel
    [NET]: allow 32 bit socket ioctl in 64 bit kernel
    [BLUETOOTH]: Return negative error constant

    Linus Torvalds
     
  • * master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (138 commits)
    [SCSI] libata: implement minimal transport template for ->eh_timed_out
    [SCSI] eliminate rphy allocation in favour of expander/end device allocation
    [SCSI] convert mptsas over to end_device/expander allocations
    [SCSI] allow displaying and setting of cache type via sysfs
    [SCSI] add scsi_mode_select to scsi_lib.c
    [SCSI] 3ware 9000 add big endian support
    [SCSI] qla2xxx: update MAINTAINERS
    [SCSI] scsi: move target_destroy call
    [SCSI] fusion - bump version
    [SCSI] fusion - expander hotplug suport in mptsas module
    [SCSI] fusion - exposing raid components in mptsas
    [SCSI] fusion - memory leak, and initializing fields
    [SCSI] fusion - exclosure misspelled
    [SCSI] fusion - cleanup mptsas event handling functions
    [SCSI] fusion - removing target_id/bus_id from the VirtDevice structure
    [SCSI] fusion - static fix's
    [SCSI] fusion - move some debug firmware event debug msgs to verbose level
    [SCSI] fusion - loginfo header update
    [SCSI] add scsi_reprobe_device
    [SCSI] megaraid_sas: fix extended timeout handling
    ...

    Linus Torvalds
     

22 Mar, 2006

29 commits

  • Add a slab cache for the SELinux inode security struct, one of which is
    allocated for every inode instantiated by the system.

    The memory savings are considerable.

    On 64-bit, instead of the size-128 cache, we have a slab object of 96
    bytes, saving 32 bytes per object. After booting, I see about 4000 of
    these and then about 17,000 after a kernel compile. With this patch, we
    save around 530KB of kernel memory in the latter case. On 32-bit, the
    savings are about half of this.

    Signed-off-by: James Morris
    Acked-by: Stephen Smalley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    James Morris
     
  • Remove an unneded pointer variable in selinux_inode_init_security().

    Signed-off-by: James Morris
    Acked-by: Stephen Smalley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    James Morris
     
  • A further fix is needed for selinuxfs link count management, to ensure that
    the count is correct for the parent directory when a subdirectory is
    created. This is only required for the root directory currently, but the
    code has been updated for the general case.

    Signed-off-by: James Morris
    Acked-by: Stephen Smalley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    James Morris
     
  • Fix copy & paste error in sel_make_avc_files(), removing a supurious call to
    d_genocide() in the error path. All of this will be cleaned up by
    kill_litter_super().

    Signed-off-by: James Morris
    Acked-by: Stephen Smalley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    James Morris
     
  • Remove the call to sel_make_bools() from sel_fill_super(), as policy needs to
    be loaded before the boolean files can be created. Policy will never be
    loaded during sel_fill_super() as selinuxfs is kernel mounted during init and
    the only means to load policy is via selinuxfs.

    Also, the call to d_genocide() on the error path of sel_make_bools() is
    incorrect and replaced with sel_remove_bools().

    Signed-off-by: James Morris
    Acked-by: Stephen Smalley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    James Morris
     
  • Unify the error path of sel_fill_super() so that all errors pass through the
    same point and generate an error message. Also, removes a spurious dput() in
    the error path which breaks the refcounting for the filesystem
    (litter_kill_super() will correctly clean things up itself on error).

    Signed-off-by: James Morris
    Acked-by: Stephen Smalley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    James Morris
     
  • Use existing sel_make_dir() helper to create booleans directory rather than
    duplicating the logic.

    Signed-off-by: James Morris
    Acked-by: Stephen Smalley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    James Morris
     
  • Fix the hard link count for selinuxfs directories, which are currently one
    short.

    Signed-off-by: James Morris
    Acked-by: Stephen Smalley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    James Morris
     
  • Simplify sel_read_bool to use the simple_read_from_buffer helper, like the
    other selinuxfs functions.

    Signed-off-by: Stephen Smalley
    Acked-by: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Smalley
     
  • Semaphore to mutex conversion.

    The conversion was generated via scripts, and the result was validated
    automatically via a script as well.

    Signed-off-by: Ingo Molnar
    Cc: Stephen Smalley
    Cc: James Morris
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • This patch disables the automatic labeling of new inodes on disk
    when no policy is loaded.

    Discussion is here:
    https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=180296

    In short, we're changing the behavior so that when no policy is loaded,
    SELinux does not label files at all. Currently it does add an 'unlabeled'
    label in this case, which we've found causes problems later.

    SELinux always maintains a safe internal label if there is none, so with this
    patch, we just stick with that and wait until a policy is loaded before adding
    a persistent label on disk.

    The effect is simply that if you boot with SELinux enabled but no policy
    loaded and create a file in that state, SELinux won't try to set a security
    extended attribute on the new inode on the disk. This is the only sane
    behavior for SELinux in that state, as it cannot determine the right label to
    assign in the absence of a policy. That state usually doesn't occur, but the
    rawhide installer seemed to be misbehaving temporarily so it happened to show
    up on a test install.

    Signed-off-by: Stephen Smalley
    Acked-by: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Smalley
     
  • Centralize the page migration functions in anticipation of additional
    tinkering. Creates a new file mm/migrate.c

    1. Extract buffer_migrate_page() from fs/buffer.c

    2. Extract central migration code from vmscan.c

    3. Extract some components from mempolicy.c

    4. Export pageout() and remove_from_swap() from vmscan.c

    5. Make it possible to configure NUMA systems without page migration
    and non-NUMA systems with page migration.

    I had to so some #ifdeffing in mempolicy.c that may need a cleanup.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The alien cache rotor in mm/slab.c assumes that the first online node is
    node 0. Eventually for some archs, especially with hotplug, this will no
    longer be true.

    Fix the interleave rotor to handle the general case of node numbering.

    Signed-off-by: Paul Jackson
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Fix bogus node loop in hugetlb.c alloc_fresh_huge_page(), which was
    assuming that nodes are numbered contiguously from 0 to num_online_nodes().
    Once the hotplug folks get this far, that will be false.

    Signed-off-by: Paul Jackson
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • When we've allocated SWAPFILE_CLUSTER pages, ->cluster_next should be the
    first index of swap cluster. But current code probably sets it wrong offset.

    Signed-off-by: Akinobu Mita
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • 1. Only disable interrupts if there is actually something to free

    2. Only dirty the pcp cacheline if we actually freed something.

    3. Disable interrupts for each single pcp and not for cleaning
    all the pcps in all zones of a node.

    drain_node_pages is called every 2 seconds from cache_reap. This
    fix should avoid most disabling of interrupts.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The list_lock also protects the shared array and we call drain_array() with
    the shared array. Therefore we cannot go as far as I wanted to but have to
    take the lock in a way so that it also protects the array_cache in
    drain_pages.

    (Note: maybe we should make the array_cache locking more consistent? I.e.
    always take the array cache lock for shared arrays and disable interrupts
    for the per cpu arrays?)

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Remove drain_array_locked and use that opportunity to limit the time the l3
    lock is taken further.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • And a parameter to drain_array to control the freeing of all objects and
    then use drain_array() to replace instances of drain_array_locked with
    drain_array. Doing so will avoid taking locks in those locations if the
    arrays are empty.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • cache_reap takes the l3->list_lock (disabling interrupts) unconditionally
    and then does a few checks and maybe does some cleanup. This patch makes
    cache_reap() only take the lock if there is work to do and then the lock is
    taken and released for each cleaning action.

    The checking of when to do the next reaping is done without any locking and
    becomes racy. Should not matter since reaping can also be skipped if the
    slab mutex cannot be acquired.

    The same is true for the touched processing. If we get this wrong once in
    awhile then we will mistakenly clean or not clean the shared cache. This
    will impact performance slightly.

    Note that the additional drain_array() function introduced here will fall
    out in a subsequent patch since array cleaning will now be very similar
    from all callers.

    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Make shrink_all_memory() repeat the attempts to free more memory if there
    seems to be no pages to free.

    Signed-off-by: Rafael J. Wysocki
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • follow_hugetlb_page() walks a range of user virtual address and then fills
    in list of struct page * into an array that is passed from the argument
    list. It also gets a reference count via get_page(). For compound page,
    get_page() actually traverse back to head page via page_private() macro and
    then adds a reference count to the head page. Since we are doing a virt to
    pte look up, kernel already has a struct page pointer into the head page.
    So instead of traverse into the small unit page struct and then follow a
    link back to the head page, optimize that with incrementing the reference
    count directly on the head page.

    The benefit is that we don't take a cache miss on accessing page struct for
    the corresponding user address and more importantly, not to pollute the
    cache with a "not very useful" round trip of pointer chasing. This adds a
    moderate performance gain on an I/O intensive database transaction
    workload.

    Signed-off-by: Ken Chen
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen, Kenneth W
     
  • Implementation of hugetlbfs_counter() is functionally equivalent to
    atomic_inc_return(). Use the simpler atomic form.

    Signed-off-by: Ken Chen
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen, Kenneth W
     
  • Quite a long time back, prepare_hugepage_range() replaced
    is_aligned_hugepage_range() as the callback from mm/mmap.c to arch code to
    verify if an address range is suitable for a hugepage mapping.
    is_aligned_hugepage_range() stuck around, but only to implement
    prepare_hugepage_range() on archs which didn't implement their own.

    Most archs (everything except ia64 and powerpc) used the same
    implementation of is_aligned_hugepage_range(). On powerpc, which
    implements its own prepare_hugepage_range(), the custom version was never
    used.

    In addition, "is_aligned_hugepage_range()" was a bad name, because it
    suggests it returns true iff the given range is a good hugepage range,
    whereas in fact it returns 0-or-error (so the sense is reversed).

    This patch cleans up by abolishing is_aligned_hugepage_range(). Instead
    prepare_hugepage_range() is defined directly. Most archs use the default
    version, which simply checks the given region is aligned to the size of a
    hugepage. ia64 and powerpc define custom versions. The ia64 one simply
    checks that the range is in the correct address space region in addition to
    being suitably aligned. The powerpc version (just as previously) checks
    for suitable addresses, and if necessary performs low-level MMU frobbing to
    set up new areas for use by hugepages.

    No libhugetlbfs testsuite regressions on ppc64 (POWER5 LPAR).

    Signed-off-by: David Gibson
    Signed-off-by: Zhang Yanmin
    Cc: "David S. Miller"
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     
  • The optional hugepage callback, hugetlb_free_pgd_range() is presently
    implemented non-trivially only on ia64 (but I plan to add one for powerpc
    shortly). It has its own prototype for the function in asm-ia64/pgtable.h.
    However, since the function is called from generic code, it make sense for
    its prototype to be in the generic hugetlb.h header file, as the protypes
    other arch callbacks already are (prepare_hugepage_range(),
    set_huge_pte_at(), etc.). This patch makes it so.

    Signed-off-by: David Gibson
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     
  • Turns out the hugepage logic in free_pgtables() was doubly broken. The
    loop coalescing multiple normal page VMAs into one call to free_pgd_range()
    had an off by one error, which could mean it would coalesce one hugepage
    VMA into the same bundle (checking 'vma' not 'next' in the loop). I
    transferred this bug into the new is_vm_hugetlb_page() based version.
    Here's the fix.

    This one didn't bite on powerpc previously for the same reason the
    is_hugepage_only_range() problem didn't: powerpc's hugetlb_free_pgd_range()
    is identical to free_pgd_range(). It didn't bite on ia64 because the
    hugepage region is distant enough from any other region that the separated
    PMD_SIZE distance test would always prevent coalescing the two together.

    No libhugetlbfs testsuite regressions (ppc64, POWER5).

    Signed-off-by: David Gibson
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     
  • free_pgtables() has special logic to call hugetlb_free_pgd_range() instead
    of the normal free_pgd_range() on hugepage VMAs. However, the test it uses
    to do so is incorrect: it calls is_hugepage_only_range on a hugepage sized
    range at the start of the vma. is_hugepage_only_range() will return true
    if the given range has any intersection with a hugepage address region, and
    in this case the given region need not be hugepage aligned. So, for
    example, this test can return true if called on, say, a 4k VMA immediately
    preceding a (nicely aligned) hugepage VMA.

    At present we get away with this because the powerpc version of
    hugetlb_free_pgd_range() is just a call to free_pgd_range(). On ia64 (the
    only other arch with a non-trivial is_hugepage_only_range()) we get away
    with it for a different reason; the hugepage area is not contiguous with
    the rest of the user address space, and VMAs are not permitted in between,
    so the test can't return a false positive there.

    Nonetheless this should be fixed. We do that in the patch below by
    replacing the is_hugepage_only_range() test with an explicit test of the
    VMA using is_vm_hugetlb_page().

    This in turn changes behaviour for platforms where is_hugepage_only_range()
    returns false always (everything except powerpc and ia64). We address this
    by ensuring that hugetlb_free_pgd_range() is defined to be identical to
    free_pgd_range() (instead of a no-op) on everything except ia64. Even so,
    it will prevent some otherwise possible coalescing of calls down to
    free_pgd_range(). Since this only happens for hugepage VMAs, removing this
    small optimization seems unlikely to cause any trouble.

    This patch causes no regressions on the libhugetlbfs testsuite - ppc64
    POWER5 (8-way), ppc64 G5 (2-way) and i386 Pentium M (UP).

    Signed-off-by: David Gibson
    Cc: William Lee Irwin III
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     
  • Originally, mm/hugetlb.c just handled the hugepage physical allocation path
    and its {alloc,free}_huge_page() functions were used from the arch specific
    hugepage code. These days those functions are only used with mm/hugetlb.c
    itself. Therefore, this patch makes them static and removes their
    prototypes from hugetlb.h. This requires a small rearrangement of code in
    mm/hugetlb.c to avoid a forward declaration.

    This patch causes no regressions on the libhugetlbfs testsuite (ppc64,
    POWER5).

    Signed-off-by: David Gibson
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     
  • These days, hugepages are demand-allocated at first fault time. There's a
    somewhat dubious (and racy) heuristic when making a new mmap() to check if
    there are enough available hugepages to fully satisfy that mapping.

    A particularly obvious case where the heuristic breaks down is where a
    process maps its hugepages not as a single chunk, but as a bunch of
    individually mmap()ed (or shmat()ed) blocks without touching and
    instantiating the pages in between allocations. In this case the size of
    each block is compared against the total number of available hugepages.
    It's thus easy for the process to become overcommitted, because each block
    mapping will succeed, although the total number of hugepages required by
    all blocks exceeds the number available. In particular, this defeats such
    a program which will detect a mapping failure and adjust its hugepage usage
    downward accordingly.

    The patch below addresses this problem, by strictly reserving a number of
    physical hugepages for hugepage inodes which have been mapped, but not
    instatiated. MAP_SHARED mappings are thus "safe" - they will fail on
    mmap(), not later with an OOM SIGKILL. MAP_PRIVATE mappings can still
    trigger an OOM. (Actually SHARED mappings can technically still OOM, but
    only if the sysadmin explicitly reduces the hugepage pool between mapping
    and instantiation)

    This patch appears to address the problem at hand - it allows DB2 to start
    correctly, for instance, which previously suffered the failure described
    above.

    This patch causes no regressions on the libhugetblfs testsuite, and makes a
    test (designed to catch this problem) pass which previously failed (ppc64,
    POWER5).

    Signed-off-by: David Gibson
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson