03 Oct, 2007

1 commit


02 Oct, 2007

1 commit

  • Nick Piggin points out that splice isn't being good about the mmap
    semaphore: while two readers can nest inside each others, it does leave
    a possible deadlock if a writer (ie a new mmap()) comes in during that
    nesting.

    Original "just move the locking" patch by Nick, replaced by one by me
    based on an optimistic pagefault_disable(). And then Jens tested and
    updated that patch.

    Reported-by: Nick Piggin
    Tested-by: Jens Axboe
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

01 Oct, 2007

1 commit

  • This reverts commit b394e43e995d08821588a22561c6a71a63b4ff27.

    Lachlan McIlroy says:
    It tried to fix an issue where log replay is replaying an inode cluster
    initialisation transaction that should not be replayed because the inode
    cluster on disk is more up to date. Since we don't log file sizes (we
    rely on inode flushing to get them to disk) then we can't just replay
    all the transations in the log and expect the inode to be completely
    restored. We lose file size updates. Unfortunately this fix is causing
    more (serious) problems than it is fixing.

    SGI-PV: 969656
    SGI-Modid: xfs-linux-melb:xfs-kern:29804a

    Signed-off-by: Lachlan McIlroy
    Signed-off-by: Tim Shimmin

    Tim Shimmin
     

29 Sep, 2007

1 commit

  • It doesn't look as if the NFS file name limit is being initialised correctly
    in the struct nfs_server. Make sure that we limit whatever is being set in
    nfs_probe_fsinfo() and nfs_init_server().

    Also ensure that readdirplus and nfs4_path_walk respect our file name
    limits.

    Signed-off-by: Trond Myklebust
    Signed-off-by: Linus Torvalds

    Trond Myklebust
     

27 Sep, 2007

1 commit

  • The problem is that the garbage collector for the 'host' structures
    nlm_gc_hosts(), holds nlm_host_mutex while calling down to
    nlmsvc_mark_resources, which, eventually takes the file->f_mutex.

    We cannot therefore call nlmsvc_lookup_host() from within
    nlmsvc_create_block, since the caller will already hold file->f_mutex, so
    the attempt to grab nlm_host_mutex may deadlock.

    Fix the problem by calling nlmsvc_lookup_host() outside the file->f_mutex.

    Signed-off-by: Trond Myklebust
    Signed-off-by: Linus Torvalds

    Trond Myklebust
     

26 Sep, 2007

2 commits


25 Sep, 2007

1 commit

  • Different types of ufs hold state in different places, to hide complexity
    of this, there is ufs_get_fs_state, it returns state according to
    "UFS_SB(sb)->s_flags", but during mount ufs_get_fs_state is called, before
    setting s_flags, this cause message for ufs types like sun ufs: "fs need
    fsck", and remount in readonly state.

    Signed-off-by: Evgeniy Dushistov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Evgeniy Dushistov
     

23 Sep, 2007

1 commit


22 Sep, 2007

1 commit


21 Sep, 2007

6 commits

  • Johannes just found that we are missing a compat-ioctl
    declaration. The fix is trivial. As previous patches for compat-ioctl,
    this should also go to stable.

    More info :
    http://marc.info/?l=linux-wireless&m=119029667902588&w=2

    Signed-off-by: Jean Tourrilhes
    Signed-off-by: John W. Linville

    Jean Tourrilhes
     
  • The ocfs2_vote_msg and ocfs2_response_msg structs needed to be
    packed to ensure similar sizeofs in 32-bit and 64-bit arches. Without this,
    we had inadvertantly broken 32/64 bit cross mounts.

    Signed-off-by: Sunil Mushran
    Signed-off-by: Mark Fasheh

    Sunil Mushran
     
  • The target page offsets were being incorrectly set a second time in
    ocfs2_prepare_page_for_write(), which was causing problems on a 16k page
    size kernel. Additionally, ocfs2_write_failure() was incorrectly using those
    parameters instead of the parameters for the individual page being cleaned
    up.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     
  • This was broken for file systems whose cluster size is greater than page
    size. Pos needs to be incremented as we loop through the descriptors, and
    len needs to be capped to the size of a single cluster.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     
  • The ocfs2 write code loops through a page much like the block code, except
    that ocfs2 allocation units can be any size, including larger than page
    size. Typically it's equal to or larger than page size - most kernels run 4k
    pages, the minimum ocfs2 allocation (cluster) size.

    Some changes introduced during 2.6.23 changed the way writes to pages are
    handled, and inadvertantly broke support for > 4k page size. Instead of just
    writing one cluster at a time, we now handle the whole page in one pass.

    This means that multiple (small) seperate allocations might happen in the
    same pass. The allocation code howver typically optimizes by getting the
    maximum which was reserved. This triggered a BUG_ON in the extend code where
    it'd ask for a single bit (for one part of a > 4k page) and get back more
    than it asked for.

    Fix this by providing a variant of the high level allocation function which
    allows the caller to specify a maximum. The traditional function remains and
    just calls the new one with a maximum determined from the initial
    reservation.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     
  • This simplifies signalfd code, by avoiding it to remain attached to the
    sighand during its lifetime.

    In this way, the signalfd remain attached to the sighand only during
    poll(2) (and select and epoll) and read(2). This also allows to remove
    all the custom "tsk == current" checks in kernel/signal.c, since
    dequeue_signal() will only be called by "current".

    I think this is also what Ben was suggesting time ago.

    The external effect of this, is that a thread can extract only its own
    private signals and the group ones. I think this is an acceptable
    behaviour, in that those are the signals the thread would be able to
    fetch w/out signalfd.

    Signed-off-by: Davide Libenzi
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

20 Sep, 2007

7 commits

  • The new xlog_recover_do_reg_buffer checks call be16_to_cpu on di_gen which
    is a 32bit value so sparse rightly complains. Fortunately the warning is
    harmless because we don't care for the value, but only whether it's
    non-NULL. Due to that fact we can simply kill the endian swaps on this and
    the previous di_mode check entirely.

    SGI-PV: 969656
    SGI-Modid: xfs-linux-melb:xfs-kern:29709a

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Lachlan McIlroy
    Signed-off-by: Tim Shimmin

    Christoph Hellwig
     
  • xfs_filestream_mount() sets up an mru cache with:
    err = xfs_mru_cache_create(&mp->m_filestream, lifetime, grp_count,
    (xfs_mru_cache_free_func_t)xfs_fstrm_free_func);
    but that cast is causing problems...
    typedef void (*xfs_mru_cache_free_func_t)(unsigned long, void*);
    but:
    void xfs_fstrm_free_func( xfs_ino_t ino, fstrm_item_t *item)
    so on a 32-bit box, it's casting (32, 32) args into (64, 32) and I assume
    it's getting garbage for *item, which subsequently causes an explosion.
    With this change the filestreams xfsqa tests don't oops on my 32-bit box.

    SGI-PV: 967795
    SGI-Modid: xfs-linux-melb:xfs-kern:29510a

    Signed-off-by: Eric Sandeen
    Signed-off-by: David Chinner
    Signed-off-by: Tim Shimmin

    Eric Sandeen
     
  • Paul Mackerras
     
  • * 'for-linus' of git://oss.sgi.com:8090/xfs/xfs-2.6:
    [XFS] Avoid replaying inode buffer initialisation log items if on-disk version is newer.
    [XFS] Ensure file size updates have been completed before writing inode to disk.
    [XFS] On-demand reaping of the MRU cache

    Linus Torvalds
     
  • The do_split() function for htree dir blocks is intended to split a leaf
    block to make room for a new entry. It sorts the entries in the original
    block by hash value, then moves the last half of the entries to the new
    block - without accounting for how much space this actually moves. (IOW,
    it moves half of the entry *count* not half of the entry *space*). If by
    chance we have both large & small entries, and we move only the smallest
    entries, and we have a large new entry to insert, we may not have created
    enough space for it.

    The patch below stores each record size when calculating the dx_map, and
    then walks the hash-sorted dx_map, calculating how many entries must be
    moved to more evenly split the existing entries between the old block and
    the new block, guaranteeing enough space for the new entry.

    The dx_map "offs" member is reduced to u16 so that the overall map size
    does not change - it is temporarily stored at the end of the new block, and
    if it grows too large it may be overwritten. By making offs and size both
    u16, we won't grow the map size.

    Also add a few comments to the functions involved.

    This fixes the testcase reported by hooanon05@yahoo.co.jp on the
    linux-ext4 list, "ext3 dir_index causes an error"

    Thanks to Andreas Dilger for discussing the problem & solution with me.

    Signed-off-by: Eric Sandeen
    Signed-off-by: Andreas Dilger
    Tested-by: Junjiro Okajima
    Cc: Theodore Ts'o
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sandeen
     
  • NFS unregisters sysctls only if V4 support is compiled in. However, sysctl
    table is not V4 specific, so unregister it always.

    Steps to reproduce:

    [build nfs.ko with CONFIG_NFS_V4=n]
    modrobe nfs
    rmmod nfs
    ls /proc/sys

    Unable to handle kernel paging request at ffffffff880661c0 RIP:
    [] proc_sys_readdir+0xd3/0x350
    PGD 203067 PUD 207063 PMD 7e216067 PTE 0
    Oops: 0000 [1] SMP
    CPU 1
    Modules linked in: lockd nfs_acl sunrpc
    Pid: 3335, comm: ls Not tainted 2.6.23-rc3-bloat #2
    RIP: 0010:[] [] proc_sys_readdir+0xd3/0x350
    RSP: 0018:ffff81007fd93e78 EFLAGS: 00010286
    RAX: ffffffff880661c0 RBX: ffffffff80466370 RCX: ffffffff880661c0
    RDX: 00000000000014c0 RSI: ffff81007f3ad020 RDI: ffff81007efd8b40
    RBP: 0000000000000018 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000001 R11: ffffffff802a8570 R12: ffffffff880661c0
    R13: ffff81007e219640 R14: ffff81007efd8b40 R15: ffff81007ded7280
    FS: 00002ba25ef03060(0000) GS:ffff81007ff81258(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: ffffffff880661c0 CR3: 000000007dfaf000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process ls (pid: 3335, threadinfo ffff81007fd92000, task ffff81007d8a0000)
    Stack: ffff81007f3ad150 ffffffff80283f30 ffff81007fd93f48 ffff81007efd8b40
    ffff81007ee00440 0000000422222222 0000000200035593 ffffffff88037e9a
    2222222222222222 ffffffff80466500 ffff81007e416400 ffff81007e219640
    Call Trace:
    [] filldir+0x0/0xf0
    [] filldir+0x0/0xf0
    [] vfs_readdir+0xa7/0xc0
    [] sys_getdents+0x96/0xe0
    [] system_call+0x7e/0x83

    Code: 41 8b 14 24 85 d2 74 dc 49 8b 44 24 08 48 85 c0 74 e7 49 3b
    RIP [] proc_sys_readdir+0xd3/0x350
    RSP
    CR2: ffffffff880661c0
    Kernel panic - not syncing: Fatal exception

    Signed-off-by: Alexey Dobriyan
    Acked-by: Trond Myklebust
    Cc: "J. Bruce Fields"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Convert asserts (BUGs) in dx_probe from bad on-disk data to recoverable
    errors with helpful warnings. With help catching other asserts from Duane
    Griffin

    Signed-off-by: Eric Sandeen
    Acked-by: Duane Griffin
    Acked-by: Theodore Ts'o
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sandeen
     

19 Sep, 2007

1 commit

  • To start with, arch_notes_size() etc. is a little too ambiguous a name for
    my liking, so change the function names to be more explicit.

    Calling through macros is ugly, especially with hidden parameters, so don't
    do that, call the routines directly.

    Use ARCH_HAVE_EXTRA_ELF_NOTES as the only flag, and based on it decide
    whether we want the extern declarations or the empty versions.

    Since we have empty routines, actually use them in the coredump code to
    save a few #ifdefs.

    We want to change the handling of foffset so that the write routine updates
    foffset as it goes, instead of using file->f_pos (so that writing to a pipe
    works). So pass foffset to the write routine, and for now just set it to
    file->f_pos at the end of writing.

    It should also be possible for the write routine to fail, so change it to
    return int and treat a non-zero return as failure.

    Signed-off-by: Michael Ellerman
    Signed-off-by: Jeremy Kerr
    Signed-off-by: Paul Mackerras

    Michael Ellerman
     

18 Sep, 2007

2 commits


17 Sep, 2007

1 commit

  • Instead of running the mru cache reaper all the time based on a timeout,
    we should only run it when the cache has active objects. This allows CPUs
    to sleep when there is no activity rather than be woken repeatedly just to
    check if there is anything to do.

    SGI-PV: 968554
    SGI-Modid: xfs-linux-melb:xfs-kern:29305a

    Signed-off-by: David Chinner
    Signed-off-by: Donald Douwsma
    Signed-off-by: Tim Shimmin

    David Chinner
     

16 Sep, 2007

1 commit


15 Sep, 2007

1 commit


12 Sep, 2007

9 commits

  • * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2:
    ocfs2: Fix calculation of i_blocks during truncate
    [PATCH] ocfs2: Fix a wrong cluster calculation.
    [PATCH] ocfs2: fix mount option parsing
    ocfs2: update docs for new features

    Linus Torvalds
     
  • The inode->i_flock list contains the leases, flocks and posix
    locks in the specified order. However, the flocks are added in
    the head of this list thus hiding the leases from F_GETLEASE
    command, from time_out_leases() and other code that expects
    the leases to come first.

    The following example will demonstrate this:

    #define _GNU_SOURCE

    #include
    #include
    #include
    #include

    static void show_lease(int fd)
    {
    int res;

    res = fcntl(fd, F_GETLEASE);
    switch (res) {
    case F_RDLCK:
    printf("Read lease\n");
    break;
    case F_WRLCK:
    printf("Write lease\n");
    break;
    case F_UNLCK:
    printf("No leases\n");
    break;
    default:
    printf("Some shit\n");
    break;
    }
    }

    int main(int argc, char **argv)
    {
    int fd, res;

    fd = open(argv[1], O_RDONLY);
    if (fd == -1) {
    perror("Can't open file");
    return 1;
    }

    res = fcntl(fd, F_SETLEASE, F_WRLCK);
    if (res == -1) {
    perror("Can't set lease");
    return 1;
    }

    show_lease(fd);

    if (flock(fd, LOCK_SH) == -1) {
    perror("Can't flock shared");
    return 1;
    }

    show_lease(fd);

    return 0;
    }

    The first call to show_lease() will show the write lease set, but
    the second will show no leases.

    Fix the flock adding so that the leases always stay in the head
    of this list.

    Found during making the flocks pid-namespaces aware.

    Signed-off-by: Pavel Emelyanov
    Acked-by: "J. Bruce Fields"
    Cc: Trond Myklebust
    Cc: Andrew Morton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Taneli Vähäkangas reported that commit
    786d7e1612f0b0adb6046f19b906609e4fe8b1ba aka "Fix rmmod/read/write races
    in /proc entries" broke SBCL + SLIME combo.

    The old code in do_select() used DEFAULT_POLLMASK, if couldn't find
    ->poll handler. The new code makes ->poll always there and returns 0 by
    default, which is not correct. Return DEFAULT_POLLMASK instead.

    Steps to reproduce:

    install emacs, SBCL, SLIME
    emacs
    M-x slime in *inferior-lisp* buffer
    [watch it doing "Connecting to Swank on port X.."]

    Please, apply before 2.6.23.

    P.S.: why SBCL can't just read(2) /proc/cpuinfo is a mystery.

    Signed-off-by: Alexey Dobriyan
    Cc: T Taneli Vahakangas
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • dput must be called before mntput here.

    Signed-off-by: Andreas Gruenbacher
    Acked-By: David Howells
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andreas Gruenbacher
     
  • If we fail to start a transaction when releasing dquot, we have to call
    dquot_release() anyway to mark dquot structure as inactive. Otherwise we
    end in an infinite loop inside dqput().

    Signed-off-by: Jan Kara
    Cc: xb
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • We were setting i_blocks too early - before truncating any allocation.
    Correct things to set i_blocks after the allocation change.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     
  • In ocfs2_alloc_write_write_ctxt, the written clusters length is calculated
    by the byte length only. This may cause some problems if we start to write
    at some position in the end of one cluster and last to a second cluster
    while the "len" is smaller than a cluster size. In that case, we have to
    write 2 clusters actually.
    So we have to take the start position into consideration also.

    Signed-off-by: Tao Ma
    Signed-off-by: Mark Fasheh

    tao.ma@oracle.com
     
  • For some mount option types, ocfs2_parse_options() will try to access
    sb->s_fs_info to get at the ocfs2 private superblock. Unfortunately, that
    hasn't been allocated yet and will cause a kernel crash.

    Fix this by storing options in a struct which can then get pushed into the
    ocfs2_super once it's been allocated later. If we need more options which
    store to the ocfs2_super in the future, we can just fields to this struct.

    Signed-off-by: Tiger Yang
    Signed-off-by: Mark Fasheh

    Tiger Yang
     
  • Update documentation listing ocfs2 features to reflect the current state of
    the file system. Add missing descriptions for some mount options which ocfs2
    supports.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     

11 Sep, 2007

2 commits

  • fsid_source decided where to get the 'fsid' number to
    return for a GETATTR based on the type of filehandle.
    It can be from the device, from the fsid, or from the
    UUID.

    It is possible for the filehandle to be inconsistent
    with the export information, so make sure the export information
    actually has the info implied by the value returned by
    fsid_source.

    Signed-off-by: Neil Brown
    Cc: "Luiz Fernando N. Capitulino"
    Signed-off-by: "J. Bruce Fields"
    Signed-off-by: Linus Torvalds

    Neil Brown
     
  • Recent changes in NFSd cause a directory which is mounted-on
    to not appear properly when the filesystem containing it is exported.

    *exp_get* now returns -ENOENT rather than NULL and when
    commit 5d3dbbeaf56d0365ac6b5c0a0da0bd31cc4781e1
    removed the NULL checks, it didn't add a check for -ENOENT.

    Signed-off-by: Neil Brown
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Linus Torvalds

    Neil Brown