06 Dec, 2007

1 commit

  • Creating PDEs with refcount 0 and "deleted" flag has problems (see below).
    Switch to usual scheme:
    * PDE is created with refcount 1
    * every de_get does +1
    * every de_put() and remove_proc_entry() do -1
    * once refcount reaches 0, PDE is freed.

    This elegantly fixes at least two following races (both observed) without
    introducing new locks, without abusing old locks, without spreading
    lock_kernel():

    1) PDE leak

    remove_proc_entry de_put
    ----------------- ------
    [refcnt = 1]
    if (atomic_read(&de->count) == 0)
    if (atomic_dec_and_test(&de->count))
    if (de->deleted)
    /* also not taken! */
    free_proc_entry(de);
    else
    de->deleted = 1;
    [refcount=0, deleted=1]

    2) use after free

    remove_proc_entry de_put
    ----------------- ------
    [refcnt = 1]

    if (atomic_dec_and_test(&de->count))
    if (atomic_read(&de->count) == 0)
    free_proc_entry(de);
    /* boom! */
    if (de->deleted)
    free_proc_entry(de);

    BUG: unable to handle kernel paging request at virtual address 6b6b6b6b
    printing eip: c10acdda *pdpt = 00000000338f8001 *pde = 0000000000000000
    Oops: 0000 [#1] PREEMPT SMP
    Modules linked in: af_packet ipv6 cpufreq_ondemand loop serio_raw psmouse k8temp hwmon sr_mod cdrom
    Pid: 23161, comm: cat Not tainted (2.6.24-rc2-8c0863403f109a43d7000b4646da4818220d501f #4)
    EIP: 0060:[] EFLAGS: 00210097 CPU: 1
    EIP is at strnlen+0x6/0x18
    EAX: 6b6b6b6b EBX: 6b6b6b6b ECX: 6b6b6b6b EDX: fffffffe
    ESI: c128fa3b EDI: f380bf34 EBP: ffffffff ESP: f380be44
    DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
    Process cat (pid: 23161, ti=f380b000 task=f38f2570 task.ti=f380b000)
    Stack: c10ac4f0 00000278 c12ce000 f43cd2a8 00000163 00000000 7da86067 00000400
    c128fa20 00896b18 f38325a8 c128fe20 ffffffff 00000000 c11f291e 00000400
    f75be300 c128fa20 f769c9a0 c10ac779 f380bf34 f7bfee70 c1018e6b f380bf34
    Call Trace:
    [] vsnprintf+0x2ad/0x49b
    [] vscnprintf+0x14/0x1f
    [] vprintk+0xc5/0x2f9
    [] handle_fasteoi_irq+0x0/0xab
    [] do_IRQ+0x9f/0xb7
    [] preempt_schedule_irq+0x3f/0x5b
    [] need_resched+0x1f/0x21
    [] printk+0x1b/0x1f
    [] de_put+0x3d/0x50
    [] proc_delete_inode+0x38/0x41
    [] proc_delete_inode+0x0/0x41
    [] generic_delete_inode+0x5e/0xc6
    [] iput+0x60/0x62
    [] d_kill+0x2d/0x46
    [] dput+0xdc/0xe4
    [] __fput+0xb0/0xcd
    [] filp_close+0x48/0x4f
    [] sys_close+0x67/0xa5
    [] sysenter_past_esp+0x5f/0x85
    =======================
    Code: c9 74 0c f2 ae 74 05 bf 01 00 00 00 4f 89 fa 5f 89 d0 c3 85 c9 57 89 c7 89 d0 74 05 f2 ae 75 01 4f 89 f8 5f c3 89 c1 89 c8 eb 06 38 00 74 07 40 4a 83 fa ff 75 f4 29 c8 c3 90 90 90 57 83 c9
    EIP: [] strnlen+0x6/0x18 SS:ESP 0068:f380be44

    Also, remove broken usage of ->deleted from reiserfs: if sget() succeeds,
    module is already pinned and remove_proc_entry() can't happen => nobody
    can mark PDE deleted.

    Dummy proc root in netns code is not marked with refcount 1. AFAICS, we
    never get it, it's just for proper /proc/net removal. I double checked
    CLONE_NETNS continues to work.

    Patch survives many hours of modprobe/rmmod/cat loops without new bugs
    which can be attributed to refcounting.

    Signed-off-by: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

15 Nov, 2007

2 commits

  • This is not a new problem in 2.6.23-git17. 2.6.22/2.6.23 is buggy in the
    same way.

    Reiserfs could accumulate dirty sub-page-size files until umount time.
    They cannot be synced to disk by pdflush routines or explicit `sync'
    commands. Only `umount' can do the trick.

    The direct cause is: the dirty page's PG_dirty is wrongly _cleared_.
    Call trace:
    [] cancel_dirty_page+0xd0/0xf0
    [] :reiserfs:reiserfs_cut_from_item+0x660/0x710
    [] :reiserfs:reiserfs_do_truncate+0x271/0x530
    [] :reiserfs:reiserfs_truncate_file+0xfd/0x3b0
    [] :reiserfs:reiserfs_file_release+0x1e0/0x340
    [] __fput+0xcc/0x1b0
    [] fput+0x16/0x20
    [] filp_close+0x56/0x90
    [] sys_close+0xad/0x110
    [] system_call+0x7e/0x83

    Fix the bug by removing the cancel_dirty_page() call. Tests show that
    it causes no bad behaviors on various write sizes.

    === for the patient ===
    Here are more detailed demonstrations of the problem.

    1) the page has both PG_dirty(D)/PAGECACHE_TAG_DIRTY(d) after being written to;
    and then only PAGECACHE_TAG_DIRTY(d) remains after the file is closed.

    ------------------------------ screen 0 ------------------------------
    [T0] root /home/wfg# cat > /test/tiny
    [T1] hi
    [T2] root /home/wfg#

    ------------------------------ screen 1 ------------------------------
    [T1] root /home/wfg# echo /test/tiny > /proc/filecache
    [T1] root /home/wfg# cat /proc/filecache
    # file /test/tiny
    # flags R:referenced A:active M:mmap U:uptodate D:dirty W:writeback O:owner B:buffer d:dirty w:writeback
    # idx len state refcnt
    0 1 ___UD__Bd_ 2
    [T2] root /home/wfg# cat /proc/filecache
    # file /test/tiny
    # flags R:referenced A:active M:mmap U:uptodate D:dirty W:writeback O:owner B:buffer d:dirty w:writeback
    # idx len state refcnt
    0 1 ___U___Bd_ 2

    2) note the non-zero 'cancelled_write_bytes' after /tmp/hi is copied.

    ------------------------------ screen 0 ------------------------------
    [T0] root /home/wfg# echo hi > /tmp/hi
    [T1] root /home/wfg# cp /tmp/hi /dev/stdin /test
    [T2] hi
    [T3] root /home/wfg#

    ------------------------------ screen 1 ------------------------------
    [T1] root /proc/4397# cd /proc/`pidof cp`
    [T1] root /proc/4713# cat io
    rchar: 8396
    wchar: 3
    syscr: 20
    syscw: 1
    read_bytes: 0
    write_bytes: 20480
    cancelled_write_bytes: 4096
    [T2] root /proc/4713# cat io
    rchar: 8399
    wchar: 6
    syscr: 21
    syscw: 2
    read_bytes: 0
    write_bytes: 24576
    cancelled_write_bytes: 4096

    //Question: the 'write_bytes' is a bit more than expected ;-)

    Tested-by: Maxim Levitsky
    Cc: Peter Zijlstra
    Cc: Jeff Mahoney
    Signed-off-by: Fengguang Wu
    Reviewed-by: Chris Mason
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Forbid user from changing file flags on quota files. User has no bussiness
    in playing with these flags when quota is on. Furthermore there is a
    remote possibility of deadlock due to a lock inversion between quota file's
    i_mutex and transaction's start (i_mutex for quota file is locked only when
    trasaction is started in quota operations) in ext3 and ext4.

    Signed-off-by: Jan Kara
    Cc: LIOU Payphone
    Cc:
    Acked-by: Dave Kleikamp
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

22 Oct, 2007

2 commits

  • Now that nfsd has stopped writing to the find_exported_dentry member we an
    mark the export_operations const

    Signed-off-by: Christoph Hellwig
    Cc: Neil Brown
    Cc: "J. Bruce Fields"
    Cc:
    Cc: Dave Kleikamp
    Cc: Anton Altaparmakov
    Cc: David Chinner
    Cc: Timothy Shimmin
    Cc: OGAWA Hirofumi
    Cc: Hugh Dickins
    Cc: Chris Mason
    Cc: Jeff Mahoney
    Cc: "Vladimir V. Saveliev"
    Cc: Steven Whitehouse
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Another nice little cleanup by using the new methods.

    Signed-off-by: Christoph Hellwig
    Cc: Neil Brown
    Cc: "J. Bruce Fields"
    Cc: Chris Mason
    Cc: Jeff Mahoney
    Cc: "Vladimir V. Saveliev"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

20 Oct, 2007

8 commits

  • Fix the various misspellings of "system", controller", "interrupt" and
    "[un]necessary".

    Signed-off-by: Robert P. J. Day
    Signed-off-by: Adrian Bunk

    Robert P. J. Day
     
  • Implement support for file systems larger than 8 TiB.

    The reiserfs superblock contains a 16 bit value for counting the number of
    bitmap blocks. The rest of the disk format supports file systems up to 2^32
    blocks, but the bitmap block limitation artificially limits this to 8 TiB with
    a 4KiB block size.

    Rather than trust the superblock's 16-bit bitmap block count, we calculate it
    dynamically based on the number of blocks in the file system. When an
    incorrect value is observed in the superblock, it is zeroed out, ensuring that
    older kernels will not be able to mount the file system.

    Userspace support has already been implemented and shipped in reiserfsprogs
    3.6.20.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Mahoney
     
  • The first_zero_hint metadata caching was never actually used, and it's of
    dubious optimization quality. This patch removes it.

    It doesn't actually shrink the size of the reiserfs_bitmap_info struct, since
    that doesn't work with block sizes larger than 8K. There was a big fixme in
    there, and with all the work lately in allowing block size > page size, I
    might as well kill the fixme as well.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Mahoney
     
  • Do a quick signedness check for block numbers. There are a number of places
    where signed integers are used for block numbers, which limits the usable file
    system size to 8 TiB. The disk format, excepting a problem which will be
    fixed in the following patch, supports file systems up to 16 TiB in size.
    This patch cleans up those sites so that we can enable the full usable size.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Mahoney
     
  • Correct the memset in reiserfs_resize to clear the memory allocated for the
    new bitmap info structs. Previously, it would clear the memory used by the
    old size. Depending on the contents of memory, this could cause incorrect
    caching behavior for bitmap blocks in the newly allocated area.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Mahoney
     
  • Build in is_reusable() unconditionally and use it to catch corruption before
    it reaches the block freeing paths.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Mahoney
     
  • Change reiserfs_panic() to use panic() initially instead of BUG(). Using
    BUG() ignores the configurable panic behavior, so systems that should be
    failing and rebooting are left hanging. This causes problems in
    active/standby HA scenarios.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Mahoney
     
  • Add I_MUTEX_XATTR annotations to the inode locking in the reiserfs xattr code.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Mahoney
     

19 Oct, 2007

1 commit

  • reiserfs_setattr can call notify_change recursively using the same
    iattr struct. This could cause it to trip the BUG() in notify_change.
    Fix reiserfs to clear those bits near the beginning of the function.

    Signed-off-by: Jeff Layton
    Cc: Chris Mason
    Cc: Jeff Mahoney
    Cc: "Vladimir V. Saveliev"
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Layton
     

17 Oct, 2007

9 commits

  • When mounting a file system with wrong journal params do not try to repair
    them, suggest fsck instead.

    Signed-off-by: Edward Shishkin
    Cc: Jeff Mahoney
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Edward Shishkin
     
  • There is possible dead loop in finish_unfinished function.

    In most situation, the call chain iput -> ... -> reiserfs_delete_inode ->
    remove_save_link will success. But for some reason such as data
    corruption, reiserfs_delete_inode fails on reiserfs_do_truncate ->
    search_for_position_by_key.

    Then remove_save_link won't be called. We always get the same
    "save_link_key" in the while loop in finish_unfinished function. The
    following patch adds a check for the possible dead loop and just remove
    save link when deap loop.

    [akpm@linux-foundation.org: cleanups]
    Signed-off-by: Lepton Wu
    Cc: Chris Mason
    Cc: Jeff Mahoney
    Cc: "Vladimir V. Saveliev"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lepton Wu
     
  • When reading corrupted reiserfs directory data, d_reclen could be a
    negative number or a big positive number, this can lead to kernel panic or
    oop. The following patch adds a sanity check.

    Signed-off-by: Lepton Wu
    Cc: Chris Mason
    Cc: Jeff Mahoney
    Cc: "Vladimir V. Saveliev"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lepton Wu
     
  • reiserfs_invalidatepage will refuse to free pages if they have been logged
    in data=journal mode, or were pinned down by a data=ordered operation. For
    data=journal, this is fairly easy to trigger just with fsx-linux, and it
    results in a large number of pages hanging around on the LRUs with
    page->mapping == NULL.

    Calling try_to_free_buffers when reiserfs decides it is done with the page
    allows it to be freed earlier, and with much less VM thrashing. Lock
    ordering rules mean that reiserfs can't call lock_page when it is releasing
    the buffers, so TestSetPageLocked is used instead. Contention on these
    pages should be rare, so it should be sufficient most of the time.

    Signed-off-by: Chris Mason
    Cc: "Vladimir V. Saveliev"
    Cc: Jeff Mahoney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Mason
     
  • - remove the following no longer used functions:
    - bitmap.c: reiserfs_claim_blocks_to_be_allocated()
    - bitmap.c: reiserfs_release_claimed_blocks()
    - bitmap.c: reiserfs_can_fit_pages()

    - make the following functions static:
    - inode.c: restart_transaction()
    - journal.c: reiserfs_async_progress_wait()

    Signed-off-by: Adrian Bunk
    Acked-by: Vladimir V. Saveliev
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Slab constructors currently have a flags parameter that is never used. And
    the order of the arguments is opposite to other slab functions. The object
    pointer is placed before the kmem_cache pointer.

    Convert

    ctor(void *object, struct kmem_cache *s, unsigned long flags)

    to

    ctor(struct kmem_cache *s, void *object)

    throughout the kernel

    [akpm@linux-foundation.org: coupla fixes]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This patch makes reiserfs to use AOP_FLAG_CONT_EXPAND
    in order to get rid of the special generic_cont_expand routine

    Signed-off-by: Vladimir Saveliev
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Saveliev
     
  • Convert reiserfs to new aops

    Signed-off-by: Vladimir Saveliev
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Saveliev
     
  • Make reiserfs to write via generic routines.
    Original reiserfs write optimized for big writes is deadlock rone

    Signed-off-by: Vladimir Saveliev
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Saveliev
     

12 Sep, 2007

1 commit

  • If we fail to start a transaction when releasing dquot, we have to call
    dquot_release() anyway to mark dquot structure as inactive. Otherwise we
    end in an infinite loop inside dqput().

    Signed-off-by: Jan Kara
    Cc: xb
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

27 Jul, 2007

1 commit


20 Jul, 2007

1 commit

  • Slab destructors were no longer supported after Christoph's
    c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been
    BUGs for both slab and slub, and slob never supported them
    either.

    This rips out support for the dtor pointer from kmem_cache_create()
    completely and fixes up every single callsite in the kernel (there were
    about 224, not including the slab allocator definitions themselves,
    or the documentation references).

    Signed-off-by: Paul Mundt

    Paul Mundt
     

18 Jul, 2007

2 commits

  • Introduce is_owner_or_cap() macro in fs.h, and convert over relevant
    users to it. This is done because we want to avoid bugs in the future
    where we check for only effective fsuid of the current task against a
    file's owning uid, without simultaneously checking for CAP_FOWNER as
    well, thus violating its semantics.
    [ XFS uses special macros and structures, and in general looked ...
    untouchable, so we leave it alone -- but it has been looked over. ]

    The (current->fsuid != inode->i_uid) check in generic_permission() and
    exec_permission_lite() is left alone, because those operations are
    covered by CAP_DAC_OVERRIDE and CAP_DAC_READ_SEARCH. Similarly operations
    falling under the purview of CAP_CHOWN and CAP_LEASE are also left alone.

    Signed-off-by: Satyam Sharma
    Cc: Al Viro
    Acked-by: Serge E. Hallyn
    Signed-off-by: Linus Torvalds

    Satyam Sharma
     
  • currently the export_operation structure and helpers related to it are in
    fs.h. fs.h is already far too large and there are very few places needing the
    export bits, so split them off into a separate header.

    [akpm@linux-foundation.org: fix cifs build]
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Neil Brown
    Cc: Steven French
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

17 Jul, 2007

1 commit

  • Some users have been having problems with utilities like cp or dd dumping
    core when they try to copy a file that's too large for the destination
    filesystem (typically, > 4gb). Apparently, some defunct standards required
    SIGXFSZ to be sent in such circumstances, but SUS only requires/allows it
    for when a written file exceeds the process's resource limits. I'd like to
    limit SIGXFSZs to the bare minimum required by SUS.

    Patch sent per http://lkml.org/lkml/2007/4/10/302

    Signed-off-by: Micah Cowan
    Acked-by: Alan Cox
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Micah Cowan
     

10 Jul, 2007

1 commit


24 May, 2007

1 commit


17 May, 2007

1 commit

  • SLAB_CTOR_CONSTRUCTOR is always specified. No point in checking it.

    Signed-off-by: Christoph Lameter
    Cc: David Howells
    Cc: Jens Axboe
    Cc: Steven French
    Cc: Michael Halcrow
    Cc: OGAWA Hirofumi
    Cc: Miklos Szeredi
    Cc: Steven Whitehouse
    Cc: Roman Zippel
    Cc: David Woodhouse
    Cc: Dave Kleikamp
    Cc: Trond Myklebust
    Cc: "J. Bruce Fields"
    Cc: Anton Altaparmakov
    Cc: Mark Fasheh
    Cc: Paul Mackerras
    Cc: Christoph Hellwig
    Cc: Jan Kara
    Cc: David Chinner
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

10 May, 2007

2 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (25 commits)
    sound: convert "sound" subdirectory to UTF-8
    MAINTAINERS: Add cxacru website/mailing list
    include files: convert "include" subdirectory to UTF-8
    general: convert "kernel" subdirectory to UTF-8
    documentation: convert the Documentation directory to UTF-8
    Convert the toplevel files CREDITS and MAINTAINERS to UTF-8.
    remove broken URLs from net drivers' output
    Magic number prefix consistency change to Documentation/magic-number.txt
    trivial: s/i_sem /i_mutex/
    fix file specification in comments
    drivers/base/platform.c: fix small typo in doc
    misc doc and kconfig typos
    Remove obsolete fat_cvf help text
    Fix occurrences of "the the "
    Fix minor typoes in kernel/module.c
    Kconfig: Remove reference to external mqueue library
    Kconfig: A couple of grammatical fixes in arch/i386/Kconfig
    Correct comments in genrtc.c to refer to correct /proc file.
    Fix more "deprecated" spellos.
    Fix "deprecated" typoes.
    ...

    Fix trivial comment conflict in kernel/relay.c.

    Linus Torvalds
     
  • Use zero_user_page() instead of open-coding it.

    Signed-off-by: Nate Diller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nate Diller
     

09 May, 2007

6 commits