13 Aug, 2012

1 commit

  • Here's a quote of the comment about the BUG macro from asm-generic/bug.h:

    Don't use BUG() or BUG_ON() unless there's really no way out; one
    example might be detecting data structure corruption in the middle
    of an operation that can't be backed out of. If the (sub)system
    can somehow continue operating, perhaps with reduced functionality,
    it's probably not BUG-worthy.

    If you're tempted to BUG(), think again: is completely giving up
    really the *only* solution? There are usually better options, where
    users don't need to reboot ASAP and can mostly shut down cleanly.

    In our case, the status flag of a ring buffer slot is managed from both sides,
    the kernel space and the user space. This means that even though the kernel
    side might work as expected, the user space screws up and changes this flag
    right between the send(2) is triggered when the flag is changed to
    TP_STATUS_SENDING and a given skb is destructed after some time. Then, this
    will hit the BUG macro. As David suggested, the best solution is to simply
    remove this statement since it cannot be used for kernel side internal
    consistency checks. I've tested it and the system still behaves /stable/ in
    this case, so in accordance with the above comment, we should rather remove it.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    danborkmann@iogearbox.net
     

11 Aug, 2012

3 commits


10 Aug, 2012

2 commits

  • commit 5d299f3d3c8a2fb (net: ipv6: fix TCP early demux) added a
    regression for ipv6_mapped case.

    [ 67.422369] SELinux: initialized (dev autofs, type autofs), uses
    genfs_contexts
    [ 67.449678] SELinux: initialized (dev autofs, type autofs), uses
    genfs_contexts
    [ 92.631060] BUG: unable to handle kernel NULL pointer dereference at
    (null)
    [ 92.631435] IP: [< (null)>] (null)
    [ 92.631645] PGD 0
    [ 92.631846] Oops: 0010 [#1] SMP
    [ 92.632095] Modules linked in: autofs4 sunrpc ipv6 dm_mirror
    dm_region_hash dm_log dm_multipath dm_mod video sbs sbshc battery ac lp
    parport sg snd_hda_intel snd_hda_codec snd_seq_oss snd_seq_midi_event
    snd_seq snd_seq_device pcspkr snd_pcm_oss snd_mixer_oss snd_pcm
    snd_timer serio_raw button floppy snd i2c_i801 i2c_core soundcore
    snd_page_alloc shpchp ide_cd_mod cdrom microcode ehci_hcd ohci_hcd
    uhci_hcd
    [ 92.634294] CPU 0
    [ 92.634294] Pid: 4469, comm: sendmail Not tainted 3.6.0-rc1 #3
    [ 92.634294] RIP: 0010:[] [< (null)>]
    (null)
    [ 92.634294] RSP: 0018:ffff880245fc7cb0 EFLAGS: 00010282
    [ 92.634294] RAX: ffffffffa01985f0 RBX: ffff88024827ad00 RCX:
    0000000000000000
    [ 92.634294] RDX: 0000000000000218 RSI: ffff880254735380 RDI:
    ffff88024827ad00
    [ 92.634294] RBP: ffff880245fc7cc8 R08: 0000000000000001 R09:
    0000000000000000
    [ 92.634294] R10: 0000000000000000 R11: ffff880245fc7bf8 R12:
    ffff880254735380
    [ 92.634294] R13: ffff880254735380 R14: 0000000000000000 R15:
    7fffffffffff0218
    [ 92.634294] FS: 00007f4516ccd6f0(0000) GS:ffff880256600000(0000)
    knlGS:0000000000000000
    [ 92.634294] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [ 92.634294] CR2: 0000000000000000 CR3: 0000000245ed1000 CR4:
    00000000000007f0
    [ 92.634294] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
    0000000000000000
    [ 92.634294] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
    0000000000000400
    [ 92.634294] Process sendmail (pid: 4469, threadinfo ffff880245fc6000,
    task ffff880254b8cac0)
    [ 92.634294] Stack:
    [ 92.634294] ffffffff813837a7 ffff88024827ad00 ffff880254b6b0e8
    ffff880245fc7d68
    [ 92.634294] ffffffff81385083 00000000001d2680 ffff8802547353a8
    ffff880245fc7d18
    [ 92.634294] ffffffff8105903a ffff88024827ad60 0000000000000002
    00000000000000ff
    [ 92.634294] Call Trace:
    [ 92.634294] [] ? tcp_finish_connect+0x2c/0xfa
    [ 92.634294] [] tcp_rcv_state_process+0x2b6/0x9c6
    [ 92.634294] [] ? sched_clock_cpu+0xc3/0xd1
    [ 92.634294] [] ? local_clock+0x2b/0x3c
    [ 92.634294] [] tcp_v4_do_rcv+0x63a/0x670
    [ 92.634294] [] release_sock+0x128/0x1bd
    [ 92.634294] [] __inet_stream_connect+0x1b1/0x352
    [ 92.634294] [] ? lock_sock_nested+0x74/0x7f
    [ 92.634294] [] ? wake_up_bit+0x25/0x25
    [ 92.634294] [] ? lock_sock_nested+0x74/0x7f
    [ 92.634294] [] ? inet_stream_connect+0x22/0x4b
    [ 92.634294] [] inet_stream_connect+0x33/0x4b
    [ 92.634294] [] sys_connect+0x78/0x9e
    [ 92.634294] [] ? sysret_check+0x1b/0x56
    [ 92.634294] [] ? __audit_syscall_entry+0x195/0x1c8
    [ 92.634294] [] ? trace_hardirqs_on_thunk+0x3a/0x3f
    [ 92.634294] [] system_call_fastpath+0x16/0x1b
    [ 92.634294] Code: Bad RIP value.
    [ 92.634294] RIP [< (null)>] (null)
    [ 92.634294] RSP
    [ 92.634294] CR2: 0000000000000000
    [ 92.648982] ---[ end trace 24e2bed94314c8d9 ]---
    [ 92.649146] Kernel panic - not syncing: Fatal exception in interrupt

    Fix this using inet_sk_rx_dst_set(), and export this function in case
    IPv6 is modular.

    Reported-by: Andrew Morton
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • commit be9f4a44e7d41cee (ipv4: tcp: remove per net tcp_sock) added a
    selinux regression, reported and bisected by John Stultz

    selinux_ip_postroute_compat() expect to find a valid sk->sk_security
    pointer, but this field is NULL for unicast_sock

    It turns out that unicast_sock are really temporary stuff to be able
    to reuse part of IP stack (ip_append_data()/ip_push_pending_frames())

    Fact is that frames sent by ip_send_unicast_reply() should be orphaned
    to not fool LSM.

    Note IPv6 never had this problem, as tcp_v6_send_response() doesnt use a
    fake socket at all. I'll probably implement tcp_v4_send_response() to
    remove these unicast_sock in linux-3.7

    Reported-by: John Stultz
    Bisected-by: John Stultz
    Signed-off-by: Eric Dumazet
    Cc: Paul Moore
    Cc: Eric Paris
    Cc: "Serge E. Hallyn"
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 Aug, 2012

8 commits

  • We currently leak all tcp metrics at struct net dismantle time.

    tcp_net_metrics_exit() frees the hash table, we must first
    iterate it to free all metrics.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Do not leak memory by updating pointer with potentially NULL realloc return value.

    Found by Linux Driver Verification project (linuxtesting.org).

    Signed-off-by: Alexey Khoroshilov
    Signed-off-by: David S. Miller

    Alexey Khoroshilov
     
  • Memory is allocated for 'tt_change_node' with kmalloc().
    'tt_change_node' may go out of scope really being used for anything
    (except have a few members initialized) if we hit the 'del:' label.
    This patch makes sure we free the memory in that case.

    Signed-off-by: Jesper Juhl
    Acked-by: Antonio Quartulli
    Signed-off-by: David S. Miller

    Jesper Juhl
     
  • [Resending again, as the text was corrupted by the email client]

    To speed up operations, QFQ internally divides classes into
    groups. Which group a class belongs to depends on the ratio between
    the maximum packet length and the weight of the class. Unfortunately
    the function qfq_change_class lacks the steps for changing the group
    of a class when the ratio max_pkt_len/weight of the class changes.

    For example, when the last of the following three commands is
    executed, the group of class 1:1 is not correctly changed:

    tc disc add dev XXX root handle 1: qfq
    tc class add dev XXX parent 1: qfq classid 1:1 weight 1
    tc class change dev XXX parent 1: classid 1:1 qfq weight 4

    Not changing the group of a class does not affect the long-term
    bandwidth guaranteed to the class, as the latter is independent of the
    maximum packet length, and correctly changes (only) if the weight of
    the class changes. In contrast, if the group of the class is not
    updated, the class is still guaranteed the short-term bandwidth and
    packet delay related to its old group, instead of the guarantees that
    it should receive according to its new weight and/or maximum packet
    length. This may also break service guarantees for other classes.
    This patch adds the missing operations.

    Signed-off-by: Paolo Valente
    Signed-off-by: David S. Miller

    Paolo Valente
     
  • While investigating on network performance problems, I found this little
    gem :

    $ nm -v vmlinux | grep -1 dst_default_metrics
    ffffffff82736540 b busy.46605
    ffffffff82736560 B dst_default_metrics
    ffffffff82736598 b dst_busy_list

    Apparently, declaring a const array without initializer put it in
    (writeable) bss section, in middle of possibly often dirtied cache
    lines.

    Since we really want dst_default_metrics be const to avoid any possible
    false sharing and catch any buggy writes, I force a null initializer.

    ffffffff818a4c20 R dst_default_metrics

    Signed-off-by: Eric Dumazet
    Cc: Ben Hutchings
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • After IP route cache removal, I believe rcu_bh() has very little use and
    we should remove this RCU variant, since it adds some cycles in fast
    path.

    Anyway, the call_rcu_bh() use in fib_true is obviously wrong, since
    some users only assert rcu_read_lock().

    Signed-off-by: Eric Dumazet
    Cc: "Paul E. McKenney"
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Quiets the sparse warning:
    warning: Using plain integer as NULL pointer

    Signed-off-by: Ying Xue
    Signed-off-by: David S. Miller

    Ying Xue
     
  • Pull networking fixes from David Miller:

    1) Missed rcu_assign_pointer() in mac80211 scanning, from Johannes
    Berg.

    2) Allow devices to limit the number of segments that an individual
    TCP TSO packet can use at a time, to deal with device and/or driver
    specific limitations. From Ben Hutchings.

    3) Fix unexpected hard IPSEC expiration after setting the date. From
    Fan Du.

    4) Memory leak fix in bxn2x driver, from Jesper Juhl.

    5) Fix two memory leaks in libertas driver, from Daniel Drake.

    6) Fix deref of out-of-range array index in packet scheduler generic
    actions layer. From Hiroaki SHIMODA.

    7) Fix TX flow control errors in mlx4 driver, from Yevgeny Petrilin.

    8) Fix CRIS eth_v10.c driver build, from Randy Dunlap.

    9) Fix wrong SKB freeing in LLC protocol layer, from Sorin Dumitru.

    10) The IP output path checks neigh lookup errors incorrectly, it needs
    to use IS_ERR(). From Vasiliy Kulikov.

    11) An estimator leak leads to deref of freed memory in timer handler,
    fix from Hiroaki SHIMODA.

    12) TCP early demux in ipv6 needs to use DST cookies in order to
    validate the RX route properly. Fix from Eric Dumazet.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (43 commits)
    net: ipv6: fix TCP early demux
    net: Use PTR_RET rather than if(IS_ERR(.. [1]
    net_sched: act: Delete estimator in error path.
    ip: fix error handling in ip_finish_output2()
    llc: free the right skb
    ixp4xx_eth: fix ptp_ixp46x build failure
    drivers/atm/iphase.c: fix error return code
    tcp_output: fix sparse warning for tcp_wfree
    drivers/net/phy/mdio-mux-gpio.c: drop devm_kfree of devm_kzalloc'd data
    batman-adv: select an internet gateway if none was chosen
    mISDN: Bugfix for layer2 fixed TEI mode
    igb: don't break user visible strings over multiple lines in igb_ethtool.c
    igb: correct hardware type (i210/i211) check in igb_loopback_test()
    igb: Fix for failure to init on some 82576 devices.
    cris: fix eth_v10.c build error
    cdc-ncm: tag Ericsson WWAN devices (eg F5521gw) with FLAG_WWAN
    isdnloop: fix and simplify isdnloop_init()
    hyperv: Move wait completion msg code into rndis_filter_halt_device()
    net/mlx4_core: Remove port type restrictions
    net/mlx4_en: Fixing TX queue stop/wake flow
    ...

    Linus Torvalds
     

07 Aug, 2012

7 commits

  • IPv6 needs a cookie in dst_check() call.

    We need to add rx_dst_cookie and provide a family independent
    sk_rx_dst_set(sk, skb) method to properly support IPv6 TCP early demux.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Some action modules free struct tcf_common in their error path
    while estimator is still active. This results in est_timer()
    dereference freed memory.
    Add gen_kill_estimator() in ipt, pedit and simple action.

    Signed-off-by: Hiroaki SHIMODA
    Signed-off-by: David S. Miller

    Hiroaki SHIMODA
     
  • __neigh_create() returns either a pointer to struct neighbour or PTR_ERR().
    But the caller expects it to return either a pointer or NULL. Replace
    the NULL check with IS_ERR() check.

    The bug was introduced in a263b3093641fb1ec377582c90986a7fd0625184
    ("ipv4: Make neigh lookups directly in output packet path.").

    Signed-off-by: Vasily Kulikov
    Signed-off-by: David S. Miller

    Vasiliy Kulikov
     
  • We are freeing skb instead of nskb, resulting in a double
    free on skb and a leak from nskb.

    Signed-off-by: Sorin Dumitru
    Signed-off-by: David S. Miller

    Sorin Dumitru
     
  • Fix sparse warning:
    * symbol 'tcp_wfree' was not declared. Should it be static?

    Signed-off-by: Silviu-Mihai Popescu
    Signed-off-by: David S. Miller

    Silviu-Mihai Popescu
     
  • This is a regression introduced by: 2265c141086474bbae55a5bb3afa1ebb78ccaa7c
    ("batman-adv: gateway election code refactoring")

    Reported-by: Nicolás Echániz
    Signed-off-by: Marek Lindner
    Acked-by: Antonio Quartulli
    Signed-off-by: Antonio Quartulli
    Signed-off-by: David S. Miller

    Marek Lindner
     
  • libertas currently calls cfg80211_disconnected() when it is being
    brought down. This causes an event to be allocated, but since the
    wdev is already removed from the rdev by the time that the event
    processing work executes, the event is never processed or freed.
    http://article.gmane.org/gmane.linux.kernel.wireless.general/95666

    Fix this leak, and other possible situations, by processing the event
    queue when a device is being unregistered. Thanks to Johannes Berg for
    the suggestion.

    Signed-off-by: Daniel Drake
    Cc: stable@vger.kernel.org
    Reviewed-by: Johannes Berg
    Signed-off-by: John W. Linville

    Daniel Drake
     

04 Aug, 2012

1 commit


03 Aug, 2012

2 commits


02 Aug, 2012

9 commits

  • Restore the default state to the "beacon_found" flag when
    the channel flags are restored. Otherwise, we can end up
    with a channel that we can no longer transmit on even when
    we can see beacons on that channel.

    Signed-off-by: Paul Stewart
    Signed-off-by: Johannes Berg

    Paul Stewart
     
  • Currently the only way for wireless drivers to tell whether or not OFDM
    is allowed on the current channel is to check the regulatory
    information. However, this requires hodling cfg80211_mutex, which is not
    visible to the drivers.

    Other regulatory restrictions are provided as flags in the channel
    definition, so let's do similarly with OFDM. This patch adds a new flag,
    IEEE80211_CHAN_NO_OFDM, to tell drivers that OFDM on a channel is not
    allowed. This flag is set on any channels for which regulatory indicates
    that OFDM is prohibited.

    Signed-off-by: Seth Forshee
    Tested-by: Arend van Spriel
    Signed-off-by: Johannes Berg

    Seth Forshee
     
  • Remove unused includes after IP cache removal

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • After SA is setup, one timer is armed to detect soft/hard expiration,
    however the timer handler uses xtime to do the math. This makes hard
    expiration occurs first before soft expiration after setting new date
    with big interval. As a result new child SA is deleted before rekeying
    the new one.

    Signed-off-by: Fan Du
    Signed-off-by: David S. Miller

    Fan Du
     
  • Cache the device gso_max_segs in sock::sk_gso_max_segs and use it to
    limit the size of TSO skbs. This avoids the need to fall back to
    software GSO for local TCP senders.

    Signed-off-by: Ben Hutchings
    Signed-off-by: David S. Miller

    Ben Hutchings
     
  • A peer (or local user) may cause TCP to use a nominal MSS of as little
    as 88 (actual MSS of 76 with timestamps). Given that we have a
    sufficiently prodigious local sender and the peer ACKs quickly enough,
    it is nevertheless possible to grow the window for such a connection
    to the point that we will try to send just under 64K at once. This
    results in a single skb that expands to 861 segments.

    In some drivers with TSO support, such an skb will require hundreds of
    DMA descriptors; a substantial fraction of a TX ring or even more than
    a full ring. The TX queue selected for the skb may stall and trigger
    the TX watchdog repeatedly (since the problem skb will be retried
    after the TX reset). This particularly affects sfc, for which the
    issue is designated as CVE-2012-3412.

    Therefore:
    1. Add the field net_device::gso_max_segs holding the device-specific
    limit.
    2. In netif_skb_features(), if the number of segments is too high then
    mask out GSO features to force fall back to software GSO.

    Signed-off-by: Ben Hutchings
    Signed-off-by: David S. Miller

    Ben Hutchings
     
  • The mesh path timer needs to be canceled when
    leaving the mesh as otherwise it could fire
    after the interface has been removed already.

    Cc: stable@vger.kernel.org
    Signed-off-by: Johannes Berg

    Johannes Berg
     
  • There's a corner case that can happen when we
    suspend with a timer running, then resume and
    disconnect. If we connect again, suspend and
    resume we might start timers that shouldn't be
    running. Reset the timer flags to avoid this.

    This affects both mesh and managed modes.

    Signed-off-by: Johannes Berg

    Johannes Berg
     
  • Pull second vfs pile from Al Viro:
    "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
    deadlock reproduced by xfstests 068), symlink and hardlink restriction
    patches, plus assorted cleanups and fixes.

    Note that another fsfreeze deadlock (emergency thaw one) is *not*
    dealt with - the series by Fernando conflicts a lot with Jan's, breaks
    userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
    for massive vfsmount leak; this is going to be handled next cycle.
    There probably will be another pull request, but that stuff won't be
    in it."

    Fix up trivial conflicts due to unrelated changes next to each other in
    drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
    delousing target_core_file a bit
    Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
    fs: Remove old freezing mechanism
    ext2: Implement freezing
    btrfs: Convert to new freezing mechanism
    nilfs2: Convert to new freezing mechanism
    ntfs: Convert to new freezing mechanism
    fuse: Convert to new freezing mechanism
    gfs2: Convert to new freezing mechanism
    ocfs2: Convert to new freezing mechanism
    xfs: Convert to new freezing code
    ext4: Convert to new freezing mechanism
    fs: Protect write paths by sb_start_write - sb_end_write
    fs: Skip atime update on frozen filesystem
    fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
    fs: Improve filesystem freezing handling
    switch the protection of percpu_counter list to spinlock
    nfsd: Push mnt_want_write() outside of i_mutex
    btrfs: Push mnt_want_write() outside of i_mutex
    fat: Push mnt_want_write() outside of i_mutex
    ...

    Linus Torvalds
     

01 Aug, 2012

7 commits

  • Merge Andrew's second set of patches:
    - MM
    - a few random fixes
    - a couple of RTC leftovers

    * emailed patches from Andrew Morton : (120 commits)
    rtc/rtc-88pm80x: remove unneed devm_kfree
    rtc/rtc-88pm80x: assign ret only when rtc_register_driver fails
    mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
    tmpfs: distribute interleave better across nodes
    mm: remove redundant initialization
    mm: warn if pg_data_t isn't initialized with zero
    mips: zero out pg_data_t when it's allocated
    memcg: gix memory accounting scalability in shrink_page_list
    mm/sparse: remove index_init_lock
    mm/sparse: more checks on mem_section number
    mm/sparse: optimize sparse_index_alloc
    memcg: add mem_cgroup_from_css() helper
    memcg: further prevent OOM with too many dirty pages
    memcg: prevent OOM with too many dirty pages
    mm: mmu_notifier: fix freed page still mapped in secondary MMU
    mm: memcg: only check anon swapin page charges for swap cache
    mm: memcg: only check swap cache pages for repeated charging
    mm: memcg: split swapin charge function into private and public part
    mm: memcg: remove needless !mm fixup to init_mm when charging
    mm: memcg: remove unneeded shmem charge type
    ...

    Linus Torvalds
     
  • Pull random subsystem patches from Ted Ts'o:
    "This patch series contains a major revamp of how we collect entropy
    from interrupts for /dev/random and /dev/urandom.

    The goal is to addresses weaknesses discussed in the paper "Mining
    your Ps and Qs: Detection of Widespread Weak Keys in Network Devices",
    by Nadia Heninger, Zakir Durumeric, Eric Wustrow, J. Alex Halderman,
    which will be published in the Proceedings of the 21st Usenix Security
    Symposium, August 2012. (See https://factorable.net for more
    information and an extended version of the paper.)"

    Fix up trivial conflicts due to nearby changes in
    drivers/{mfd/ab3100-core.c, usb/gadget/omap_udc.c}

    * tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random: (33 commits)
    random: mix in architectural randomness in extract_buf()
    dmi: Feed DMI table to /dev/random driver
    random: Add comment to random_initialize()
    random: final removal of IRQF_SAMPLE_RANDOM
    um: remove IRQF_SAMPLE_RANDOM which is now a no-op
    sparc/ldc: remove IRQF_SAMPLE_RANDOM which is now a no-op
    [ARM] pxa: remove IRQF_SAMPLE_RANDOM which is now a no-op
    board-palmz71: remove IRQF_SAMPLE_RANDOM which is now a no-op
    isp1301_omap: remove IRQF_SAMPLE_RANDOM which is now a no-op
    pxa25x_udc: remove IRQF_SAMPLE_RANDOM which is now a no-op
    omap_udc: remove IRQF_SAMPLE_RANDOM which is now a no-op
    goku_udc: remove IRQF_SAMPLE_RANDOM which was commented out
    uartlite: remove IRQF_SAMPLE_RANDOM which is now a no-op
    drivers: hv: remove IRQF_SAMPLE_RANDOM which is now a no-op
    xen-blkfront: remove IRQF_SAMPLE_RANDOM which is now a no-op
    n2_crypto: remove IRQF_SAMPLE_RANDOM which is now a no-op
    pda_power: remove IRQF_SAMPLE_RANDOM which is now a no-op
    i2c-pmcmsp: remove IRQF_SAMPLE_RANDOM which is now a no-op
    input/serio/hp_sdc.c: remove IRQF_SAMPLE_RANDOM which is now a no-op
    mfd: remove IRQF_SAMPLE_RANDOM which is now a no-op
    ...

    Linus Torvalds
     
  • Pull second wave of NFS client updates from Trond Myklebust:

    - Patches from Bryan to allow splitting of the NFSv2/v3/v4 code into
    separate modules.

    - Fix Oopses in the NFSv4 idmapper

    - Fix a deadlock whereby rpciod tries to allocate a new socket and ends
    up recursing into the NFS code due to memory reclaim.

    - Increase the number of permitted callback connections.

    * tag 'nfs-for-3.6-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    nfs: explicitly reject LOCK_MAND flock() requests
    nfs: increase number of permitted callback connections.
    SUNRPC: return negative value in case rpcbind client creation error
    NFS: Convert v4 into a module
    NFS: Convert v3 into a module
    NFS: Convert v2 into a module
    NFS: Keep module parameters in the generic NFS client
    NFS: Split out remaining NFS v4 inode functions
    NFS: Pass super operations and xattr handlers in the nfs_subversion
    NFS: Only initialize the ACL client in the v3 case
    NFS: Create a try_mount rpc op
    NFS: Remove the NFS v4 xdev mount function
    NFS: Add version registering framework
    NFS: Fix a number of bugs in the idmapper
    nfs: skip commit in releasepage if we're freeing memory for fs-related reasons
    sunrpc: clarify comments on rpc_make_runnable
    pnfsblock: bail out partial page IO

    Linus Torvalds
     
  • Pull networking update from David S. Miller:
    "I think Eric Dumazet and I have dealt with all of the known routing
    cache removal fallout. Some other minor fixes all around.

    1) Fix RCU of cached routes, particular of output routes which require
    liberation via call_rcu() instead of call_rcu_bh(). From Eric
    Dumazet.

    2) Make sure we purge net device references in cached routes properly.

    3) TG3 driver bug fixes from Michael Chan.

    4) Fix reported 'expires' value in ipv6 routes, from Li Wei.

    5) TUN driver ioctl leaks kernel bytes to userspace, from Mathias
    Krause."

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (22 commits)
    ipv4: Properly purge netdev references on uncached routes.
    ipv4: Cache routes in nexthop exception entries.
    ipv4: percpu nh_rth_output cache
    ipv4: Restore old dst_free() behavior.
    bridge: make port attributes const
    ipv4: remove rt_cache_rebuild_count
    net: ipv4: fix RCU races on dst refcounts
    net: TCP early demux cleanup
    tun: Fix formatting.
    net/tun: fix ioctl() based info leaks
    tg3: Update version to 3.124
    tg3: Fix race condition in tg3_get_stats64()
    tg3: Add New 5719 Read DMA workaround
    tg3: Fix Read DMA workaround for 5719 A0.
    tg3: Request APE_LOCK_PHY before PHY access
    ipv6: fix incorrect route 'expires' value passed to userspace
    mISDN: Bugfix only few bytes are transfered on a connection
    seeq: use PTR_RET at init_module of driver
    bnx2x: remove cast around the kmalloc in bnx2x_prev_mark_path
    ipv4: clean up put_child
    ...

    Linus Torvalds
     
  • Implement the new swapfile a_ops for NFS and hook up ->direct_IO. This
    will set the NFS socket to SOCK_MEMALLOC and run socket reconnect under
    PF_MEMALLOC as well as reset SOCK_MEMALLOC before engaging the protocol
    ->connect() method.

    PF_MEMALLOC should allow the allocation of struct socket and related
    objects and the early (re)setting of SOCK_MEMALLOC should allow us to
    receive the packets required for the TCP connection buildup.

    [jlayton@redhat.com: Restore PF_MEMALLOC task flags in all cases]
    [dfeng@redhat.com: Fix handling of multiple swap files]
    [a.p.zijlstra@chello.nl: Original patch]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Christoph Hellwig
    Cc: David S. Miller
    Cc: Eric B Munson
    Cc: Eric Paris
    Cc: James Morris
    Cc: Mel Gorman
    Cc: Mike Christie
    Cc: Neil Brown
    Cc: Sebastian Andrzej Siewior
    Cc: Trond Myklebust
    Cc: Xiaotian Feng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patch series is based on top of "Swap-over-NBD without deadlocking
    v15" as it depends on the same reservation of PF_MEMALLOC reserves logic.

    When a user or administrator requires swap for their application, they
    create a swap partition and file, format it with mkswap and activate it
    with swapon. In diskless systems this is not an option so if swap if
    required then swapping over the network is considered. The two likely
    scenarios are when blade servers are used as part of a cluster where the
    form factor or maintenance costs do not allow the use of disks and thin
    clients.

    The Linux Terminal Server Project recommends the use of the Network Block
    Device (NBD) for swap but this is not always an option. There is no
    guarantee that the network attached storage (NAS) device is running Linux
    or supports NBD. However, it is likely that it supports NFS so there are
    users that want support for swapping over NFS despite any performance
    concern. Some distributions currently carry patches that support swapping
    over NFS but it would be preferable to support it in the mainline kernel.

    Patch 1 avoids a stream-specific deadlock that potentially affects TCP.

    Patch 2 is a small modification to SELinux to avoid using PFMEMALLOC
    reserves.

    Patch 3 adds three helpers for filesystems to handle swap cache pages.
    For example, page_file_mapping() returns page->mapping for
    file-backed pages and the address_space of the underlying
    swap file for swap cache pages.

    Patch 4 adds two address_space_operations to allow a filesystem
    to pin all metadata relevant to a swapfile in memory. Upon
    successful activation, the swapfile is marked SWP_FILE and
    the address space operation ->direct_IO is used for writing
    and ->readpage for reading in swap pages.

    Patch 5 notes that patch 3 is bolting
    filesystem-specific-swapfile-support onto the side and that
    the default handlers have different information to what
    is available to the filesystem. This patch refactors the
    code so that there are generic handlers for each of the new
    address_space operations.

    Patch 6 adds an API to allow a vector of kernel addresses to be
    translated to struct pages and pinned for IO.

    Patch 7 adds support for using highmem pages for swap by kmapping
    the pages before calling the direct_IO handler.

    Patch 8 updates NFS to use the helpers from patch 3 where necessary.

    Patch 9 avoids setting PF_private on PG_swapcache pages within NFS.

    Patch 10 implements the new swapfile-related address_space operations
    for NFS and teaches the direct IO handler how to manage
    kernel addresses.

    Patch 11 prevents page allocator recursions in NFS by using GFP_NOIO
    where appropriate.

    Patch 12 fixes a NULL pointer dereference that occurs when using
    swap-over-NFS.

    With the patches applied, it is possible to mount a swapfile that is on an
    NFS filesystem. Swap performance is not great with a swap stress test
    taking roughly twice as long to complete than if the swap device was
    backed by NBD.

    This patch: netvm: prevent a stream-specific deadlock

    It could happen that all !SOCK_MEMALLOC sockets have buffered so much data
    that we're over the global rmem limit. This will prevent SOCK_MEMALLOC
    buffers from receiving data, which will prevent userspace from running,
    which is needed to reduce the buffered data.

    Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit. Once
    this change it applied, it is important that sockets that set
    SOCK_MEMALLOC do not clear the flag until the socket is being torn down.
    If this happens, a warning is generated and the tokens reclaimed to avoid
    accounting errors until the bug is fixed.

    [davem@davemloft.net: Warning about clearing SOCK_MEMALLOC]
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Acked-by: David S. Miller
    Acked-by: Rik van Riel
    Cc: Trond Myklebust
    Cc: Neil Brown
    Cc: Christoph Hellwig
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • In order to make sure pfmemalloc packets receive all memory needed to
    proceed, ensure processing of pfmemalloc SKBs happens under PF_MEMALLOC.
    This is limited to a subset of protocols that are expected to be used for
    writing to swap. Taps are not allowed to use PF_MEMALLOC as these are
    expected to communicate with userspace processes which could be paged out.

    [a.p.zijlstra@chello.nl: Ideas taken from various patches]
    [jslaby@suse.cz: Lock imbalance fix]
    Signed-off-by: Mel Gorman
    Acked-by: David S. Miller
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman