11 Oct, 2007

7 commits

  • The problem: proc_net files remember which network namespace the are
    against but do not remember hold a reference count (as that would pin
    the network namespace). So we currently have a small window where
    the reference count on a network namespace may be incremented when opening
    a /proc file when it has already gone to zero.

    To fix this introduce maybe_get_net and get_proc_net.

    maybe_get_net increments the network namespace reference count only if it is
    greater then zero, ensuring we don't increment a reference count after it
    has gone to zero.

    get_proc_net handles all of the magic to go from a proc inode to the network
    namespace instance and call maybe_get_net on it.

    PROC_NET the old accessor is removed so that we don't get confused and use
    the wrong helper function.

    Then I fix up the callers to use get_proc_net and handle the case case
    where get_proc_net returns NULL. In that case I return -ENXIO because
    effectively the network namespace has already gone away so the files
    we are trying to access don't exist anymore.

    Signed-off-by: Eric W. Biederman
    Acked-by: Paul E. McKenney
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Add the appropriate EXPORT_SYMBOLS for proc_net_create,
    proc_net_fops_create and proc_net_remove to fix errors when
    compiling allmodconfig

    Signed-off-by: Mark Nelson
    Acked-by: Benjamin Thery
    Signed-off-by: David S. Miller

    Daniel Lezcano
     
  • My bad.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • This patch makes most of the generic device layer network
    namespace safe. This patch makes dev_base_head a
    network namespace variable, and then it picks up
    a few associated variables. The functions:
    dev_getbyhwaddr
    dev_getfirsthwbytype
    dev_get_by_flags
    dev_get_by_name
    __dev_get_by_name
    dev_get_by_index
    __dev_get_by_index
    dev_ioctl
    dev_ethtool
    dev_load
    wireless_process_ioctl

    were modified to take a network namespace argument, and
    deal with it.

    vlan_ioctl_set and brioctl_set were modified so their
    hooks will receive a network namespace argument.

    So basically anthing in the core of the network stack that was
    affected to by the change of dev_base was modified to handle
    multiple network namespaces. The rest of the network stack was
    simply modified to explicitly use &init_net the initial network
    namespace. This can be fixed when those components of the network
    stack are modified to handle multiple network namespaces.

    For now the ifindex generator is left global.

    Fundametally ifindex numbers are per namespace, or else
    we will have corner case problems with migration when
    we get that far.

    At the same time there are assumptions in the network stack
    that the ifindex of a network device won't change. Making
    the ifindex number global seems a good compromise until
    the network stack can cope with ifindex changes when
    you change namespaces, and the like.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Each netlink socket will live in exactly one network namespace,
    this includes the controlling kernel sockets.

    This patch updates all of the existing netlink protocols
    to only support the initial network namespace. Request
    by clients in other namespaces will get -ECONREFUSED.
    As they would if the kernel did not have the support for
    that netlink protocol compiled in.

    As each netlink protocol is updated to be multiple network
    namespace safe it can register multiple kernel sockets
    to acquire a presence in the rest of the network namespaces.

    The implementation in af_netlink is a simple filter implementation
    at hash table insertion and hash table look up time.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • This patch makes /proc/net per network namespace. It modifies the global
    variables proc_net and proc_net_stat to be per network namespace.
    The proc_net file helpers are modified to take a network namespace argument,
    and all of their callers are fixed to pass &init_net for that argument.
    This ensures that all of the /proc/net files are only visible and
    usable in the initial network namespace until the code behind them
    has been updated to be handle multiple network namespaces.

    Making /proc/net per namespace is necessary as at least some files
    in /proc/net depend upon the set of network devices which is per
    network namespace, and even more files in /proc/net have contents
    that are relevant to a single network namespace.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • The current implementation of dev_ifname makes maintenance difficult
    because updates to the implementation of the ioctl have to made in two
    places. So this patch updates dev_ifname32 to do a classic 32/64
    structure conversion and call sys_ioctl like the rest of the
    compat calls do.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

10 Oct, 2007

1 commit

  • The recent fix for a circular lock dependency unfortunately introduced a
    potential memory leak in the event where the call to nlmsvc_lookup_host
    fails for some reason.

    Thanks to Roel Kluin for spotting this.

    Signed-off-by: Trond Myklebust
    Signed-off-by: Linus Torvalds

    Trond Myklebust
     

09 Oct, 2007

1 commit

  • When IOCB_FLAG_RESFD flag is set and iocb->aio_resfd is incorrect,
    statement 'goto out_put_req' is executed. At label 'out_put_req',
    aio_put_req(..) is called, which requires 'req->ki_filp' set.

    Signed-off-by: Yan Zheng
    Cc: Zach Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yan Zheng
     

04 Oct, 2007

2 commits


03 Oct, 2007

2 commits


02 Oct, 2007

1 commit

  • Nick Piggin points out that splice isn't being good about the mmap
    semaphore: while two readers can nest inside each others, it does leave
    a possible deadlock if a writer (ie a new mmap()) comes in during that
    nesting.

    Original "just move the locking" patch by Nick, replaced by one by me
    based on an optimistic pagefault_disable(). And then Jens tested and
    updated that patch.

    Reported-by: Nick Piggin
    Tested-by: Jens Axboe
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

01 Oct, 2007

1 commit

  • This reverts commit b394e43e995d08821588a22561c6a71a63b4ff27.

    Lachlan McIlroy says:
    It tried to fix an issue where log replay is replaying an inode cluster
    initialisation transaction that should not be replayed because the inode
    cluster on disk is more up to date. Since we don't log file sizes (we
    rely on inode flushing to get them to disk) then we can't just replay
    all the transations in the log and expect the inode to be completely
    restored. We lose file size updates. Unfortunately this fix is causing
    more (serious) problems than it is fixing.

    SGI-PV: 969656
    SGI-Modid: xfs-linux-melb:xfs-kern:29804a

    Signed-off-by: Lachlan McIlroy
    Signed-off-by: Tim Shimmin

    Tim Shimmin
     

29 Sep, 2007

1 commit

  • It doesn't look as if the NFS file name limit is being initialised correctly
    in the struct nfs_server. Make sure that we limit whatever is being set in
    nfs_probe_fsinfo() and nfs_init_server().

    Also ensure that readdirplus and nfs4_path_walk respect our file name
    limits.

    Signed-off-by: Trond Myklebust
    Signed-off-by: Linus Torvalds

    Trond Myklebust
     

27 Sep, 2007

1 commit

  • The problem is that the garbage collector for the 'host' structures
    nlm_gc_hosts(), holds nlm_host_mutex while calling down to
    nlmsvc_mark_resources, which, eventually takes the file->f_mutex.

    We cannot therefore call nlmsvc_lookup_host() from within
    nlmsvc_create_block, since the caller will already hold file->f_mutex, so
    the attempt to grab nlm_host_mutex may deadlock.

    Fix the problem by calling nlmsvc_lookup_host() outside the file->f_mutex.

    Signed-off-by: Trond Myklebust
    Signed-off-by: Linus Torvalds

    Trond Myklebust
     

26 Sep, 2007

2 commits


25 Sep, 2007

1 commit

  • Different types of ufs hold state in different places, to hide complexity
    of this, there is ufs_get_fs_state, it returns state according to
    "UFS_SB(sb)->s_flags", but during mount ufs_get_fs_state is called, before
    setting s_flags, this cause message for ufs types like sun ufs: "fs need
    fsck", and remount in readonly state.

    Signed-off-by: Evgeniy Dushistov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Evgeniy Dushistov
     

23 Sep, 2007

1 commit


22 Sep, 2007

1 commit


21 Sep, 2007

6 commits

  • Johannes just found that we are missing a compat-ioctl
    declaration. The fix is trivial. As previous patches for compat-ioctl,
    this should also go to stable.

    More info :
    http://marc.info/?l=linux-wireless&m=119029667902588&w=2

    Signed-off-by: Jean Tourrilhes
    Signed-off-by: John W. Linville

    Jean Tourrilhes
     
  • The ocfs2_vote_msg and ocfs2_response_msg structs needed to be
    packed to ensure similar sizeofs in 32-bit and 64-bit arches. Without this,
    we had inadvertantly broken 32/64 bit cross mounts.

    Signed-off-by: Sunil Mushran
    Signed-off-by: Mark Fasheh

    Sunil Mushran
     
  • The target page offsets were being incorrectly set a second time in
    ocfs2_prepare_page_for_write(), which was causing problems on a 16k page
    size kernel. Additionally, ocfs2_write_failure() was incorrectly using those
    parameters instead of the parameters for the individual page being cleaned
    up.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     
  • This was broken for file systems whose cluster size is greater than page
    size. Pos needs to be incremented as we loop through the descriptors, and
    len needs to be capped to the size of a single cluster.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     
  • The ocfs2 write code loops through a page much like the block code, except
    that ocfs2 allocation units can be any size, including larger than page
    size. Typically it's equal to or larger than page size - most kernels run 4k
    pages, the minimum ocfs2 allocation (cluster) size.

    Some changes introduced during 2.6.23 changed the way writes to pages are
    handled, and inadvertantly broke support for > 4k page size. Instead of just
    writing one cluster at a time, we now handle the whole page in one pass.

    This means that multiple (small) seperate allocations might happen in the
    same pass. The allocation code howver typically optimizes by getting the
    maximum which was reserved. This triggered a BUG_ON in the extend code where
    it'd ask for a single bit (for one part of a > 4k page) and get back more
    than it asked for.

    Fix this by providing a variant of the high level allocation function which
    allows the caller to specify a maximum. The traditional function remains and
    just calls the new one with a maximum determined from the initial
    reservation.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     
  • This simplifies signalfd code, by avoiding it to remain attached to the
    sighand during its lifetime.

    In this way, the signalfd remain attached to the sighand only during
    poll(2) (and select and epoll) and read(2). This also allows to remove
    all the custom "tsk == current" checks in kernel/signal.c, since
    dequeue_signal() will only be called by "current".

    I think this is also what Ben was suggesting time ago.

    The external effect of this, is that a thread can extract only its own
    private signals and the group ones. I think this is an acceptable
    behaviour, in that those are the signals the thread would be able to
    fetch w/out signalfd.

    Signed-off-by: Davide Libenzi
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

20 Sep, 2007

6 commits

  • The new xlog_recover_do_reg_buffer checks call be16_to_cpu on di_gen which
    is a 32bit value so sparse rightly complains. Fortunately the warning is
    harmless because we don't care for the value, but only whether it's
    non-NULL. Due to that fact we can simply kill the endian swaps on this and
    the previous di_mode check entirely.

    SGI-PV: 969656
    SGI-Modid: xfs-linux-melb:xfs-kern:29709a

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Lachlan McIlroy
    Signed-off-by: Tim Shimmin

    Christoph Hellwig
     
  • xfs_filestream_mount() sets up an mru cache with:
    err = xfs_mru_cache_create(&mp->m_filestream, lifetime, grp_count,
    (xfs_mru_cache_free_func_t)xfs_fstrm_free_func);
    but that cast is causing problems...
    typedef void (*xfs_mru_cache_free_func_t)(unsigned long, void*);
    but:
    void xfs_fstrm_free_func( xfs_ino_t ino, fstrm_item_t *item)
    so on a 32-bit box, it's casting (32, 32) args into (64, 32) and I assume
    it's getting garbage for *item, which subsequently causes an explosion.
    With this change the filestreams xfsqa tests don't oops on my 32-bit box.

    SGI-PV: 967795
    SGI-Modid: xfs-linux-melb:xfs-kern:29510a

    Signed-off-by: Eric Sandeen
    Signed-off-by: David Chinner
    Signed-off-by: Tim Shimmin

    Eric Sandeen
     
  • * 'for-linus' of git://oss.sgi.com:8090/xfs/xfs-2.6:
    [XFS] Avoid replaying inode buffer initialisation log items if on-disk version is newer.
    [XFS] Ensure file size updates have been completed before writing inode to disk.
    [XFS] On-demand reaping of the MRU cache

    Linus Torvalds
     
  • The do_split() function for htree dir blocks is intended to split a leaf
    block to make room for a new entry. It sorts the entries in the original
    block by hash value, then moves the last half of the entries to the new
    block - without accounting for how much space this actually moves. (IOW,
    it moves half of the entry *count* not half of the entry *space*). If by
    chance we have both large & small entries, and we move only the smallest
    entries, and we have a large new entry to insert, we may not have created
    enough space for it.

    The patch below stores each record size when calculating the dx_map, and
    then walks the hash-sorted dx_map, calculating how many entries must be
    moved to more evenly split the existing entries between the old block and
    the new block, guaranteeing enough space for the new entry.

    The dx_map "offs" member is reduced to u16 so that the overall map size
    does not change - it is temporarily stored at the end of the new block, and
    if it grows too large it may be overwritten. By making offs and size both
    u16, we won't grow the map size.

    Also add a few comments to the functions involved.

    This fixes the testcase reported by hooanon05@yahoo.co.jp on the
    linux-ext4 list, "ext3 dir_index causes an error"

    Thanks to Andreas Dilger for discussing the problem & solution with me.

    Signed-off-by: Eric Sandeen
    Signed-off-by: Andreas Dilger
    Tested-by: Junjiro Okajima
    Cc: Theodore Ts'o
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sandeen
     
  • NFS unregisters sysctls only if V4 support is compiled in. However, sysctl
    table is not V4 specific, so unregister it always.

    Steps to reproduce:

    [build nfs.ko with CONFIG_NFS_V4=n]
    modrobe nfs
    rmmod nfs
    ls /proc/sys

    Unable to handle kernel paging request at ffffffff880661c0 RIP:
    [] proc_sys_readdir+0xd3/0x350
    PGD 203067 PUD 207063 PMD 7e216067 PTE 0
    Oops: 0000 [1] SMP
    CPU 1
    Modules linked in: lockd nfs_acl sunrpc
    Pid: 3335, comm: ls Not tainted 2.6.23-rc3-bloat #2
    RIP: 0010:[] [] proc_sys_readdir+0xd3/0x350
    RSP: 0018:ffff81007fd93e78 EFLAGS: 00010286
    RAX: ffffffff880661c0 RBX: ffffffff80466370 RCX: ffffffff880661c0
    RDX: 00000000000014c0 RSI: ffff81007f3ad020 RDI: ffff81007efd8b40
    RBP: 0000000000000018 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000001 R11: ffffffff802a8570 R12: ffffffff880661c0
    R13: ffff81007e219640 R14: ffff81007efd8b40 R15: ffff81007ded7280
    FS: 00002ba25ef03060(0000) GS:ffff81007ff81258(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: ffffffff880661c0 CR3: 000000007dfaf000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process ls (pid: 3335, threadinfo ffff81007fd92000, task ffff81007d8a0000)
    Stack: ffff81007f3ad150 ffffffff80283f30 ffff81007fd93f48 ffff81007efd8b40
    ffff81007ee00440 0000000422222222 0000000200035593 ffffffff88037e9a
    2222222222222222 ffffffff80466500 ffff81007e416400 ffff81007e219640
    Call Trace:
    [] filldir+0x0/0xf0
    [] filldir+0x0/0xf0
    [] vfs_readdir+0xa7/0xc0
    [] sys_getdents+0x96/0xe0
    [] system_call+0x7e/0x83

    Code: 41 8b 14 24 85 d2 74 dc 49 8b 44 24 08 48 85 c0 74 e7 49 3b
    RIP [] proc_sys_readdir+0xd3/0x350
    RSP
    CR2: ffffffff880661c0
    Kernel panic - not syncing: Fatal exception

    Signed-off-by: Alexey Dobriyan
    Acked-by: Trond Myklebust
    Cc: "J. Bruce Fields"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Convert asserts (BUGs) in dx_probe from bad on-disk data to recoverable
    errors with helpful warnings. With help catching other asserts from Duane
    Griffin

    Signed-off-by: Eric Sandeen
    Acked-by: Duane Griffin
    Acked-by: Theodore Ts'o
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sandeen
     

18 Sep, 2007

2 commits


17 Sep, 2007

1 commit

  • Instead of running the mru cache reaper all the time based on a timeout,
    we should only run it when the cache has active objects. This allows CPUs
    to sleep when there is no activity rather than be woken repeatedly just to
    check if there is anything to do.

    SGI-PV: 968554
    SGI-Modid: xfs-linux-melb:xfs-kern:29305a

    Signed-off-by: David Chinner
    Signed-off-by: Donald Douwsma
    Signed-off-by: Tim Shimmin

    David Chinner
     

16 Sep, 2007

1 commit


15 Sep, 2007

1 commit


12 Sep, 2007

1 commit