03 Jun, 2014

11 commits

  • __sk_prepare_filter() was reworked in commit bd4cf0ed3 (net: filter:
    rework/optimize internal BPF interpreter's instruction set) so that it should
    have uncharged memory once things went wrong. However that work isn't complete.
    Error is handled only in __sk_migrate_filter() while memory can still leak in
    the error path right after sk_chk_filter().

    Fixes: bd4cf0ed331a ("net: filter: rework/optimize internal BPF interpreter's instruction set")
    Signed-off-by: Leon Yu
    Acked-by: Alexei Starovoitov
    Tested-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Leon Yu
     
  • The ec_bhf driver is specific to the Beckhoff CX embedded PC series.
    These are based on Intel x86 CPU. So we can add a dependency on
    X86, with COMPILE_TEST as an alternative to still allow for broader
    build-testing.

    Signed-off-by: Jean Delvare
    Cc: Darek Marcinkiewicz
    Cc: David S. Miller
    Signed-off-by: David S. Miller

    Jean Delvare
     
  • This bug is discovered by an recent F-RTO issue on tcpm list
    https://www.ietf.org/mail-archive/web/tcpm/current/msg08794.html

    The bug is that currently F-RTO does not use DSACK to undo cwnd in
    certain cases: upon receiving an ACK after the RTO retransmission in
    F-RTO, and the ACK has DSACK indicating the retransmission is spurious,
    the sender only calls tcp_try_undo_loss() if some never retransmisted
    data is sacked (FLAG_ORIG_DATA_SACKED).

    The correct behavior is to unconditionally call tcp_try_undo_loss so
    the DSACK information is used properly to undo the cwnd reduction.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • It was possible to get a setuid root or setcap executable to write to
    it's stdout or stderr (which has been set made a netlink socket) and
    inadvertently reconfigure the networking stack.

    To prevent this we check that both the creator of the socket and
    the currentl applications has permission to reconfigure the network
    stack.

    Unfortunately this breaks Zebra which always uses sendto/sendmsg
    and creates it's socket without any privileges.

    To keep Zebra working don't bother checking if the creator of the
    socket has privilege when a destination address is specified. Instead
    rely exclusively on the privileges of the sender of the socket.

    Note from Andy: This is exactly Eric's code except for some comment
    clarifications and formatting fixes. Neither I nor, I think, anyone
    else is thrilled with this approach, but I'm hesitant to wait on a
    better fix since 3.15 is almost here.

    Note to stable maintainers: This is a mess. An earlier series of
    patches in 3.15 fix a rather serious security issue (CVE-2014-0181),
    but they did so in a way that breaks Zebra. The offending series
    includes:

    commit aa4cf9452f469f16cea8c96283b641b4576d4a7b
    Author: Eric W. Biederman
    Date: Wed Apr 23 14:28:03 2014 -0700

    net: Add variants of capable for use on netlink messages

    If a given kernel version is missing that series of fixes, it's
    probably worth backporting it and this patch. if that series is
    present, then this fix is critical if you care about Zebra.

    Cc: stable@vger.kernel.org
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Andy Lutomirski
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Each iPad model has a different product id, this patch adds support for iPad 2
    (pid 0x12a2) and iPad 3 (pid 0x12a6). Note that iPad 2 must be jailbroken and a
    third-party app must be used for tethering to work. On iPad 3, tethering works
    out of the box (assuming your ISP is nice).

    Signed-off-by: Kristian Evensen
    Signed-off-by: David S. Miller

    Kristian Evensen
     
  • Now it is not possible to set mtu to team device which has a port
    enslaved to it. The reason is that when team_change_mtu() calls
    dev_set_mtu() for port device, notificator for NETDEV_PRECHANGEMTU
    event is called and team_device_event() returns NOTIFY_BAD forbidding
    the change. So fix this by returning NOTIFY_DONE here in case team is
    changing mtu in team_change_mtu().

    Introduced-by: 3d249d4c "net: introduce ethernet teaming device"
    Signed-off-by: Jiri Pirko
    Acked-by: Flavio Leitner
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • I noticed we were sending wrong IPv4 ID in TCP flows when MTU discovery
    is disabled.
    Note how GSO/TSO packets do not have monotonically incrementing ID.

    06:37:41.575531 IP (id 14227, proto: TCP (6), length: 4396)
    06:37:41.575534 IP (id 14272, proto: TCP (6), length: 65212)
    06:37:41.575544 IP (id 14312, proto: TCP (6), length: 57972)
    06:37:41.575678 IP (id 14317, proto: TCP (6), length: 7292)
    06:37:41.575683 IP (id 14361, proto: TCP (6), length: 63764)

    It appears I introduced this bug in linux-3.1.

    inet_getid() must return the old value of peer->ip_id_count,
    not the new one.

    Lets revert this part, and remove the prevention of
    a null identification field in IPv6 Fragment Extension Header,
    which is dubious and not even done properly.

    Fixes: 87c48fa3b463 ("ipv6: make fragment identifications less predictable")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This interface is unusable, as the cdc-wdm character device doesn't reply to
    any QMI command. Also, the out-of-tree Sierra Wireless GobiNet driver fully
    skips it.

    Signed-off-by: Aleksander Morgado
    Acked-by: Bjørn Mork
    Signed-off-by: David S. Miller

    Aleksander Morgado
     
  • A set of new VID/PIDs retrieved from the out-of-tree GobiNet/GobiSerial
    Sierra Wireless drivers.

    Signed-off-by: Aleksander Morgado
    Acked-by: Bjørn Mork
    Signed-off-by: David S. Miller

    Aleksander Morgado
     
  • br_handle_local_finish() is allowing us to insert an FDB entry with
    disallowed vlan. For example, when port 1 and 2 are communicating in
    vlan 10, and even if vlan 10 is disallowed on port 3, port 3 can
    interfere with their communication by spoofed src mac address with
    vlan id 10.

    Note: Even if it is judged that a frame should not be learned, it should
    not be dropped because it is destined for not forwarding layer but higher
    layer. See IEEE 802.1Q-2011 8.13.10.

    Signed-off-by: Toshiaki Makita
    Acked-by: Vlad Yasevich
    Signed-off-by: David S. Miller

    Toshiaki Makita
     
  • Any process is able to send netlink messages with leftover bytes.
    Make the warning rate-limited to prevent too much log spam.

    The warning is supposed to help find userspace bugs, so print the
    triggering command name to implicate the buggy program.

    [v2: Use pr_warn_ratelimited instead of printk_ratelimited.]

    Signed-off-by: Michal Schmidt
    Signed-off-by: David S. Miller

    Michal Schmidt
     

02 Jun, 2014

3 commits

  • There has been a number incidents recently where customers running KVM have
    reported that VM hosts on different Hypervisors are unreachable. Based on
    pcap traces we found that the bridge was broadcasting the ARP request out
    onto the network. However some NICs have an inbuilt switch which on occasions
    were broadcasting the VMs ARP request back through the physical NIC on the
    Hypervisor. This resulted in the bridge changing ports and incorrectly learning
    that the VMs mac address was external. As a result the ARP reply was directed
    back onto the external network and VM never updated it's ARP cache. This patch
    will notify the bridge command, after a fdb has been updated to identify such
    port toggling.

    Signed-off-by: Jon Maxwell
    Reviewed-by: Jiri Pirko
    Acked-by: Toshiaki Makita
    Acked-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Jon Maxwell
     
  • Signed-off-by: Aleksander Morgado
    Acked-by: Bjørn Mork
    Signed-off-by: David S. Miller

    Aleksander Morgado
     
  • After 1e785f48d29a ("net: Start with correct mac_len in
    skb_network_protocol") skb->mac_len is used as a start of the
    calculation in skb_network_protocol() but that is not always correct. If
    skb->protocol == 8021Q/AD, usually the vlan header is already inserted
    in the skb (i.e. vlan reorder hdr == 0). Usually when the packet enters
    dev_hard_xmit it has mac_len == 0 so we take 2 bytes from the
    destination mac address (skb->data + VLAN_HLEN) as a type in
    skb_network_protocol() and return vlan_depth == 4. In the case where TSO is
    off, then the mac_len is set but it's == 18 (ETH_HLEN + VLAN_HLEN), so
    skb_network_protocol() returns a type from inside the packet and
    offset == 22. Also make vlan_depth unsigned as suggested before.
    As suggested by Eric Dumazet, move the while() loop in the if() so we
    can avoid additional testing in fast path.

    Here are few netperf tests + debug printk's to illustrate:
    cat netperf.tso-on.reorder-on.bugged
    - Vlan -> device (reorder on, default, this case is okay)
    MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
    192.168.3.1 () port 0 AF_INET
    Recv Send Send
    Socket Socket Message Elapsed
    Size Size Size Time Throughput
    bytes bytes bytes secs. 10^6bits/sec

    87380 16384 16384 10.00 7111.54
    [ 81.605435] skb->len 65226 skb->gso_size 1448 skb->proto 0x800
    skb->mac_len 0 vlan_depth 0 type 0x800

    - Vlan -> device (reorder off, bad)
    cat netperf.tso-on.reorder-off.bugged
    MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
    192.168.3.1 () port 0 AF_INET
    Recv Send Send
    Socket Socket Message Elapsed
    Size Size Size Time Throughput
    bytes bytes bytes secs. 10^6bits/sec

    87380 16384 16384 10.00 241.35
    [ 204.578332] skb->len 1518 skb->gso_size 0 skb->proto 0x8100
    skb->mac_len 0 vlan_depth 4 type 0x5301
    0x5301 are the last two bytes of the destination mac.

    And if we stop TSO, we may get even the following:
    [ 83.343156] skb->len 2966 skb->gso_size 1448 skb->proto 0x8100
    skb->mac_len 18 vlan_depth 22 type 0xb84
    Because mac_len already accounts for VLAN_HLEN.

    After the fix:
    cat netperf.tso-on.reorder-off.fixed
    MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
    192.168.3.1 () port 0 AF_INET
    Recv Send Send
    Socket Socket Message Elapsed
    Size Size Size Time Throughput
    bytes bytes bytes secs. 10^6bits/sec

    87380 16384 16384 10.01 5001.46
    [ 81.888489] skb->len 65230 skb->gso_size 1448 skb->proto 0x8100
    skb->mac_len 0 vlan_depth 18 type 0x800

    CC: Vlad Yasevich
    CC: Eric Dumazet
    CC: Daniel Borkman
    CC: David S. Miller

    Fixes:1e785f48d29a ("net: Start with correct mac_len in
    skb_network_protocol")
    Signed-off-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     

01 Jun, 2014

1 commit

  • Included changes:
    - prevent NULL dereference in multicast code

    Antonion Quartulli says:

    ====================
    pull request net: batman-adv 20140527

    here you have another very small fix intended for net/linux-3.15.
    It prevents some multicast functions from dereferencing a NULL pointer.
    (Actually it was nothing more than a typo)
    I hope it is not too late for such a small patch.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

31 May, 2014

6 commits


27 May, 2014

1 commit

  • Commit a1ef7bd9fce8 ("can: rename LED trigger name on netdev renames") renames
    the led trigger names according to the changed netdevice name.

    As not every CAN driver supports and initializes the led triggers, checking for
    the CAN private datastructure with safe_candev_priv() in the notifier chain is
    not enough.

    This patch adds a check when CONFIG_CAN_LEDS is enabled and the driver does not
    support led triggers.

    For stable 3.9+

    Cc: Fabio Baltieri
    Signed-off-by: Oliver Hartkopp
    Acked-by: Kurt Van Dijck
    Cc: linux-stable
    Signed-off-by: Marc Kleine-Budde

    Oliver Hartkopp
     

26 May, 2014

1 commit

  • Receiving a ICMP response to an IPIP packet in a non-linear skb could
    cause a kernel panic in __skb_pull.

    The problem was introduced in
    commit f2edb9f7706dcb2c0d9a362b2ba849efe3a97f5e ("ipvs: implement
    passive PMTUD for IPIP packets").

    Signed-off-by: Peter Christensen
    Acked-by: Julian Anastasov
    Signed-off-by: Simon Horman

    Peter Christensen
     

25 May, 2014

3 commits


24 May, 2014

14 commits

  • Pull sparc fixes from David Miller:
    "A small bunch of bug fixes, in particular:

    1) On older cpus we need a different chunk of virtual address space
    to map the huge page TSB.

    2) Missing memory barrier in Niagara2 memcpy.

    3) trinity showed some places where fault validation was
    unnecessarily loud on sparc64

    4) Some sysfs printf's need a type adjustment, from Toralf Förster"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
    sparc64: fix format string mismatch in arch/sparc/kernel/sysfs.c
    sparc64: Add membar to Niagara2 memcpy code.
    sparc64: Fix huge TSB mapping on pre-UltraSPARC-III cpus.
    sparc64: Don't bark so loudly about 32-bit tasks generating 64-bit fault addresses.

    Linus Torvalds
     
  • Pull networking fixes from David Miller:
    "It looks like a sizeble collection but this is nearly 3 weeks of bug
    fixing while you were away.

    1) Fix crashes over IPSEC tunnels with NAT, the latter can reroute
    the packet through a non-IPSEC protected path and the code has to
    be able to handle SKBs attached to routes lacking an attached xfrm
    state. From Steffen Klassert.

    2) Fix OOPSs in ipv4 and ipv6 ipsec layers for unsupported
    sub-protocols, also from Steffen Klassert.

    3) Set local_df on fragmented netfilter skbs otherwise we won't be
    able to forward successfully, from Florian Westphal.

    4) cdc_mbim ipv6 neighbour code does __vlan_find_dev_deep without
    holding RCU lock, from Bjorn Mork.

    5) local_df test in ip_may_fragment is inverted, from Florian
    Westphal.

    6) jme driver doesn't check for DMA mapping failures, from Neil
    Horman.

    7) qlogic driver doesn't calculate number of TX queues properly, from
    Shahed Shaikh.

    8) fib_info_cnt can drift irreversibly positive if we fail to
    allocate the fi->fib_metrics array, from Sergey Popovich.

    9) Fix use after free in ip6_route_me_harder(), also from Sergey
    Popovich.

    10) When SYSCTL is disabled, we don't handle local_port_range and
    ping_group_range defaults properly at all, from Cong Wang.

    11) Unaccelerated VLAN tagged frames improperly handled by cdc_mbim
    driver, fix from Bjorn Mork.

    12) cassini driver needs nested lock annotations for TX locking, from
    Emil Goode.

    13) On init error ipv6 VTI driver can unregister pernet ops twice,
    oops. Fix from Mahtias Krause.

    14) If macvlan device is down, don't propagate IFF_ALLMULTI changes,
    from Peter Christensen.

    15) Missing NULL pointer check while parsing netlink config options in
    ip6_tnl_validate(). From Susant Sahani.

    16) Fix handling of neighbour entries during ipv6 router reachability
    probing, from Duan Jiong.

    17) x86 and s390 JIT address randomization has some address
    calculation bugs leading to crashes, from Alexei Starovoitov and
    Heiko Carstens.

    18) Clear up those uglies with nop patching and net_get_random_once(),
    from Hannes Frederic Sowa.

    19) Option length miscalculated in ip6_append_data(), fix also from
    Hannes Frederic Sowa.

    20) A while ago we fixed a race during device unregistry when a
    namespace went down, turns out there is a second place that needs
    similar protection. From Cong Wang.

    21) In the new Altera TSE driver multicast filtering isn't working,
    disable it and just use promisc mode until the cause is found.
    From Vince Bridgers.

    22) When we disable router enabling in ipv6 we have to flush the
    cached routes explicitly, from Duan Jiong.

    23) NBMA tunnels should not cache routes on the tunnel object because
    the key is variable, from Timo Teräs.

    24) With stacked devices GRO information in skb->cb[] can be not setup
    properly, make sure it is in all code paths. From Eric Dumazet.

    25) Really fix stacked vlan locking, multiple levels of nesting with
    intervening non-vlan devices are possible. From Vlad Yasevich.

    26) Fallback ipip tunnel device's mtu is not setup properly, from
    Steffen Klassert.

    27) The packet scheduler's tcindex filter can crash because we
    structure copy objects with list_head's inside, oops. From Cong
    Wang.

    28) Fix CHECKSUM_COMPLETE handling for ipv6 GRE tunnels, from Eric
    Dumazet.

    29) In some configurations 'itag' in __mkroute_input() can end up
    being used uninitialized because of how fib_validate_source()
    works. Fix it by explitly initializing itag to zero like all the
    other fib_validate_source() callers do, from Li RongQing"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (116 commits)
    batman: fix a bogus warning from batadv_is_on_batman_iface()
    ipv4: initialise the itag variable in __mkroute_input
    bonding: Send ALB learning packets using the right source
    bonding: Don't assume 802.1Q when sending alb learning packets.
    net: doc: Update references to skb->rxhash
    stmmac: Remove unbalanced clk_disable call
    ipv6: gro: fix CHECKSUM_COMPLETE support
    net_sched: fix an oops in tcindex filter
    can: peak_pci: prevent use after free at netdev removal
    ip_tunnel: Initialize the fallback device properly
    vlan: Fix build error wth vlan_get_encap_level()
    can: c_can: remove obsolete STRICT_FRAME_ORDERING Kconfig option
    MAINTAINERS: Pravin Shelar is Open vSwitch maintainer.
    bnx2x: Convert return 0 to return rc
    bonding: Fix alb mode to only use first level vlans.
    bonding: Fix stacked device detection in arp monitoring
    macvlan: Fix lockdep warnings with stacked macvlan devices
    vlan: Fix lockdep warning with stacked vlan devices.
    net: Allow for more then a single subclass for netif_addr_lock
    net: Find the nesting level of a given device by type.
    ...

    Linus Torvalds
     
  • Pull scheduler fixes from Ingo Molnar:
    "The biggest commit is an irqtime accounting loop latency fix, the rest
    are misc fixes all over the place: deadline scheduling, docs, numa,
    balancer and a bad to-idle latency fix"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/numa: Initialize newidle balance stats in sd_numa_init()
    sched: Fix updating rq->max_idle_balance_cost and rq->next_balance in idle_balance()
    sched: Skip double execution of pick_next_task_fair()
    sched: Use CPUPRI_NR_PRIORITIES instead of MAX_RT_PRIO in cpupri check
    sched/deadline: Fix memory leak
    sched/deadline: Fix sched_yield() behavior
    sched: Sanitize irq accounting madness
    sched/docbook: Fix 'make htmldocs' warnings caused by missing description

    Linus Torvalds
     
  • Pull perf fixes from Ingo Molnar:
    "The biggest changes are fixes for races that kept triggering Trinity
    crashes, plus liblockdep build fixes and smaller misc fixes.

    The liblockdep bits in perf/urgent are a pull mistake - they should
    have been in locking/urgent - but by the time I noticed other commits
    were added and testing was done :-/ Sorry about that"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf: Fix a race between ring_buffer_detach() and ring_buffer_attach()
    perf: Prevent false warning in perf_swevent_add
    perf: Limit perf_event_attr::sample_period to 63 bits
    tools/liblockdep: Remove all build files when doing make clean
    tools/liblockdep: Build liblockdep from tools/Makefile
    perf/x86/intel: Fix Silvermont's event constraints
    perf: Fix perf_event_init_context()
    perf: Fix race in removing an event

    Linus Torvalds
     
  • Pull drm radeon and nouveau fixes from Dave Airlie:
    "Fixes for the other big two.

    The radeon VCE one is large but it fixes some userspace triggerable
    issues, otherwise its blackscreens and oopses.

    Nouveau fixes a bleeding laptop panel issue when displayport is used
    sometimes"

    * 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
    drm/radeon/pm: don't allow debugfs/sysfs access when PX card is off (v2)
    drm/radeon: avoid segfault on device open when accel is not working.
    drm/radeon: fix typo in finding PLL params
    drm/radeon: fix register typo on si
    drm/radeon: fix buffer placement under memory pressure v2
    drm/radeon: fix page directory update size estimation
    drm/radeon: handle non-VGA class pci devices with ATRM
    drm/radeon: fix DCE83 check for mullins
    drm/radeon: check VCE relocation buffer range v3
    drm/radeon: also try GART for CPU accessed buffers
    drm/gf119-/disp: fix nasty bug which can clobber SOR0's clock setup
    drm/nvd9/therm: handle another kind of PWM fan

    Linus Torvalds
     
  • Merge misc fixes from Andrew Morton:
    "9 fixes"

    * emailed patches from Andrew Morton :
    MAINTAINERS: add closing angle bracket to Vince Bridgers' email address
    Documentation: fix DOCBOOKS=... building
    ocfs2: fix double kmem_cache_destroy in dlm_init
    mm/memory-failure.c: fix memory leak by race between poison and unpoison
    wait: swap EXIT_ZOMBIE(Z) and EXIT_DEAD(X) chars in TASK_STATE_TO_CHAR_STR
    memcg: fix swapcache charge from kernel thread context
    mm: madvise: fix MADV_WILLNEED on shmem swapouts
    mm/filemap.c: avoid always dirtying mapping->flags on O_DIRECT
    hwpoison, hugetlb: lock_page/unlock_page does not match for handling a free hugepage

    Linus Torvalds
     
  • Signed-off-by: Tobias Klauser
    Cc: Vince Bridgers
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tobias Klauser
     
  • Prior to commit 4266129964b8 ("[media] DocBook: Move all media docbook
    stuff into its own directory") it was possible to build only a single
    (or more) book(s) by calling, for example

    make htmldocs DOCBOOKS=80211.xml

    This now fails:

    cp: target `.../Documentation/DocBook//media_api' is not a directory

    Ignore errors from that copy to make this possible again.

    Fixes: 4266129964b8 ("[media] DocBook: Move all media docbook stuff into its own directory")
    Signed-off-by: Johannes Berg
    Acked-by: Randy Dunlap
    Cc: Mauro Carvalho Chehab
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Berg
     
  • In dlm_init, if create dlm_lockname_cache failed in
    dlm_init_master_caches, it will destroy dlm_lockres_cache which created
    before twice. And this will cause system die when loading modules.

    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • When a memory error happens on an in-use page or (free and in-use)
    hugepage, the victim page is isolated with its refcount set to one.

    When you try to unpoison it later, unpoison_memory() calls put_page()
    for it twice in order to bring the page back to free page pool (buddy or
    free hugepage list). However, if another memory error occurs on the
    page which we are unpoisoning, memory_failure() returns without
    releasing the refcount which was incremented in the same call at first,
    which results in memory leak and unconsistent num_poisoned_pages
    statistics. This patch fixes it.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: [2.6.32+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • In commit ad86622b478e ("wait: swap EXIT_ZOMBIE and EXIT_DEAD to hide
    EXIT_TRACE from user-space") the order of task state definitions were
    changed: EXIT_DEAD and EXIT_ZOMBIE were swapped. Though the charterers
    for the states in TASK_STATE_TO_CHAR_STR string were not updated. This
    patch synchronizes the string to the order of definitions.

    Signed-off-by: Masatake YAMATO
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masatake YAMATO
     
  • Commit 284f39afeaa4 ("mm: memcg: push !mm handling out to page cache
    charge function") explicitly checks for page cache charges without any
    mm context (from kernel thread context[1]).

    This seemed to be the only possible case where memory could be charged
    without mm context so commit 03583f1a631c ("memcg: remove unnecessary
    !mm check from try_get_mem_cgroup_from_mm()") removed the mm check from
    get_mem_cgroup_from_mm(). This however caused another NULL ptr
    dereference during early boot when loopback kernel thread splices to
    tmpfs as reported by Stephan Kulow:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000360
    IP: get_mem_cgroup_from_mm.isra.42+0x2b/0x60
    Oops: 0000 [#1] SMP
    Modules linked in: btrfs dm_multipath dm_mod scsi_dh multipath raid10 raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 md_mod parport_pc parport nls_utf8 isofs usb_storage iscsi_ibft iscsi_boot_sysfs arc4 ecb fan thermal nfs lockd fscache nls_iso8859_1 nls_cp437 sg st hid_generic usbhid af_packet sunrpc sr_mod cdrom ata_generic uhci_hcd virtio_net virtio_blk ehci_hcd usbcore ata_piix floppy processor button usb_common virtio_pci virtio_ring virtio edd squashfs loop ppa]
    CPU: 0 PID: 97 Comm: loop1 Not tainted 3.15.0-rc5-5-default #1
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    Call Trace:
    __mem_cgroup_try_charge_swapin+0x40/0xe0
    mem_cgroup_charge_file+0x8b/0xd0
    shmem_getpage_gfp+0x66b/0x7b0
    shmem_file_splice_read+0x18f/0x430
    splice_direct_to_actor+0xa2/0x1c0
    do_lo_receive+0x5a/0x60 [loop]
    loop_thread+0x298/0x720 [loop]
    kthread+0xc6/0xe0
    ret_from_fork+0x7c/0xb0

    Also Branimir Maksimovic reported the following oops which is tiggered
    for the swapcache charge path from the accounting code for kernel threads:

    CPU: 1 PID: 160 Comm: kworker/u8:5 Tainted: P OE 3.15.0-rc5-core2-custom #159
    Hardware name: System manufacturer System Product Name/MAXIMUSV GENE, BIOS 1903 08/19/2013
    task: ffff880404e349b0 ti: ffff88040486a000 task.ti: ffff88040486a000
    RIP: get_mem_cgroup_from_mm.isra.42+0x2b/0x60
    Call Trace:
    __mem_cgroup_try_charge_swapin+0x45/0xf0
    mem_cgroup_charge_file+0x9c/0xe0
    shmem_getpage_gfp+0x62c/0x770
    shmem_write_begin+0x38/0x40
    generic_perform_write+0xc5/0x1c0
    __generic_file_aio_write+0x1d1/0x3f0
    generic_file_aio_write+0x4f/0xc0
    do_sync_write+0x5a/0x90
    do_acct_process+0x4b1/0x550
    acct_process+0x6d/0xa0
    do_exit+0x827/0xa70
    kthread+0xc3/0xf0

    This patch fixes the issue by reintroducing mm check into
    get_mem_cgroup_from_mm. We could do the same trick in
    __mem_cgroup_try_charge_swapin as we do for the regular page cache path
    but it is not worth troubles. The check is not that expensive and it is
    better to have get_mem_cgroup_from_mm more robust.

    [1] - http://marc.info/?l=linux-mm&m=139463617808941&w=2

    Fixes: 03583f1a631c ("memcg: remove unnecessary !mm check from try_get_mem_cgroup_from_mm()")
    Reported-and-tested-by: Stephan Kulow
    Reported-by: Branimir Maksimovic
    Signed-off-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • MADV_WILLNEED currently does not read swapped out shmem pages back in.

    Commit 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page
    cache radix trees") made find_get_page() filter exceptional radix tree
    entries but failed to convert all find_get_page() callers that WANT
    exceptional entries over to find_get_entry(). One of them is shmem swap
    readahead in madvise, which now skips over any swap-out records.

    Convert it to find_get_entry().

    Fixes: 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page cache radix trees")
    Signed-off-by: Johannes Weiner
    Reported-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • In some testing I ran today (some fio jobs that spread over two nodes),
    we end up spending 40% of the time in filemap_check_errors(). That
    smells fishy. Looking further, this is basically what happens:

    blkdev_aio_read()
    generic_file_aio_read()
    filemap_write_and_wait_range()
    if (!mapping->nr_pages)
    filemap_check_errors()

    and filemap_check_errors() always attempts two test_and_clear_bit() on
    the mapping flags, thus dirtying it for every single invocation. The
    patch below tests each of these bits before clearing them, avoiding this
    issue. In my test case (4-socket box), performance went from 1.7M IOPS
    to 4.0M IOPS.

    Signed-off-by: Jens Axboe
    Acked-by: Jeff Moyer
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jens Axboe