27 Feb, 2019

2 commits

  • [ Upstream commit 59d49076ae3e6912e6d7df2fd68e2337f3d02036 ]

    Fix the refcounting of the authentication keys in the file locking code.
    The vnode->lock_key member points to a key on which it expects to be
    holding a ref, but it isn't always given an extra ref, however.

    Fixes: 0fafdc9f888b ("afs: Fix file locking")
    Signed-off-by: David Howells
    Signed-off-by: Sasha Levin

    David Howells
     
  • [ Upstream commit 4882a27cec24319d10f95e978ecc80050e3e3e15 ]

    A cb_interest record is not necessarily attached to the vnode on entry to
    afs_validate(), which can cause an oops when we try to bring the vnode's
    cb_s_break up to date in the default case (ie. no current callback promise
    and the vnode has not been deleted).

    Fix this by simply removing the line, as vnode->cb_s_break will be set when
    needed by afs_register_server_cb_interest() when we next get a callback
    promise from RPC call.

    The oops looks something like:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
    ...
    RIP: 0010:afs_validate+0x66/0x250 [kafs]
    ...
    Call Trace:
    afs_d_revalidate+0x8d/0x340 [kafs]
    ? __d_lookup+0x61/0x150
    lookup_dcache+0x44/0x70
    ? lookup_dcache+0x44/0x70
    __lookup_hash+0x24/0xa0
    do_unlinkat+0x11d/0x2c0
    __x64_sys_unlink+0x23/0x30
    do_syscall_64+0x4d/0xf0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Fixes: ae3b7361dc0e ("afs: Fix validation/callback interaction")
    Signed-off-by: Marc Dionne
    Signed-off-by: David Howells
    Signed-off-by: Sasha Levin

    Marc Dionne
     

17 Dec, 2018

1 commit

  • [ Upstream commit ae3b7361dc0ee9a425bf7d77ce211f533500b39b ]

    When afs_validate() is called to validate a vnode (inode), there are two
    unhandled cases in the fastpath at the top of the function:

    (1) If the vnode is promised (AFS_VNODE_CB_PROMISED is set), the break
    counters match and the data has expired, then there's an implicit case
    in which the vnode needs revalidating.

    This has no consequences since the default "valid = false" set at the
    top of the function happens to do the right thing.

    (2) If the vnode is not promised and it hasn't been deleted
    (AFS_VNODE_DELETED is not set) then there's a default case we're not
    handling in which the vnode is invalid. If the vnode is invalid, we
    need to bring cb_s_break and cb_v_break up to date before we refetch
    the status.

    As a consequence, once the server loses track of the client
    (ie. sufficient time has passed since we last sent it an operation),
    it will send us a CB.InitCallBackState* operation when we next try to
    talk to it. This calls afs_init_callback_state() which increments
    afs_server::cb_s_break, but this then doesn't propagate to the
    afs_vnode record.

    The result being that every afs_validate() call thereafter sends a
    status fetch operation to the server.

    Clarify and fix this by:

    (A) Setting valid in all the branches rather than initialising it at the
    top so that the compiler catches where we've missed.

    (B) Restructuring the logic in the 'promised' branch so that we set valid
    to false if the callback is due to expire (or has expired) and so that
    the final case is that the vnode is still valid.

    (C) Adding an else-statement that ups cb_s_break and cb_v_break if the
    promised and deleted cases don't match.

    Fixes: c435ee34551e ("afs: Overhaul the callback handling")
    Signed-off-by: David Howells
    Signed-off-by: Al Viro
    Signed-off-by: Sasha Levin

    David Howells
     

27 Nov, 2018

1 commit

  • [ Upstream commit 4ac15ea53622272c01954461b4814892b7481b40 ]

    Fix afs_deliver_to_call() to handle -EIO being returned by the operation
    delivery function, indicating that the call found itself in the wrong
    state, by printing an error and aborting the call.

    Currently, an assertion failure will occur. This can happen, say, if the
    delivery function falls off the end without calling afs_extract_data() with
    the want_more parameter set to false to collect the end of the Rx phase of
    a call.

    The assertion failure looks like:

    AFS: Assertion failed
    4 == 7 is false
    0x4 == 0x7 is false
    ------------[ cut here ]------------
    kernel BUG at fs/afs/rxrpc.c:462!

    and is matched in the trace buffer by a line like:

    kworker/7:3-3226 [007] ...1 85158.030203: afs_io_error: c=0003be0c r=-5 CM_REPLY

    Fixes: 98bf40cd99fc ("afs: Protect call->state changes against signals")
    Reported-by: Marc Dionne
    Signed-off-by: David Howells
    Signed-off-by: Sasha Levin

    David Howells
     

15 Oct, 2018

1 commit

  • The recent patch to fix the afs_server struct leak didn't actually fix the
    bug, but rather fixed some of the symptoms. The problem is that an
    asynchronous call that holds a resource pointed to by call->reply[0] will
    find the pointer cleared in the call destructor, thereby preventing the
    resource from being cleaned up.

    In the case of the server record leak, the afs_fs_get_capabilities()
    function in devel code sets up a call with reply[0] pointing at the server
    record that should be altered when the result is obtained, but this was
    being cleared before the destructor was called, so the put in the
    destructor does nothing and the record is leaked.

    Commit f014ffb025c1 removed the additional ref obtained by
    afs_install_server(), but the removal of this ref is actually used by the
    garbage collector to mark a server record as being defunct after the record
    has expired through lack of use.

    The offending clearance of call->reply[0] upon completion in
    afs_process_async_call() has been there from the origin of the code, but
    none of the asynchronous calls actually use that pointer currently, so it
    should be safe to remove (note that synchronous calls don't involve this
    function).

    Fix this by the following means:

    (1) Revert commit f014ffb025c1.

    (2) Remove the clearance of reply[0] from afs_process_async_call().

    Without this, afs_manage_servers() will suffer an assertion failure if it
    sees a server record that didn't get used because the usage count is not 1.

    Fixes: f014ffb025c1 ("afs: Fix afs_server struct leak")
    Fixes: 08e0e7c82eea ("[AF_RXRPC]: Make the in-kernel AFS filesystem use AF_RXRPC.")
    Signed-off-by: David Howells
    Cc: stable
    Signed-off-by: Greg Kroah-Hartman

    David Howells
     

12 Oct, 2018

2 commits

  • Fix a leak of afs_server structs. The routine that installs them in the
    various lookup lists and trees gets a ref on leaving the function, whether
    it added the server or a server already exists. It shouldn't increment
    the refcount if it added the server.

    The effect of this that "rmmod kafs" will hang waiting for the leaked
    server to become unused.

    Fixes: d2ddc776a458 ("afs: Overhaul volume and server record caching and fileserver rotation")
    Signed-off-by: David Howells
    Signed-off-by: Greg Kroah-Hartman

    David Howells
     
  • Access to the list of cells by /proc/net/afs/cells has a couple of
    problems:

    (1) It should be checking against SEQ_START_TOKEN for the keying the
    header line.

    (2) It's only holding the RCU read lock, so it can't just walk over the
    list without following the proper RCU methods.

    Fix these by using an hlist instead of an ordinary list and using the
    appropriate accessor functions to follow it with RCU.

    Since the code that adds a cell to the list must also necessarily change,
    sort the list on insertion whilst we're at it.

    Fixes: 989782dcdc91 ("afs: Overhaul cell database management")
    Signed-off-by: David Howells
    Signed-off-by: Greg Kroah-Hartman

    David Howells
     

08 Sep, 2018

1 commit

  • Fix the cell specification mechanism to allow cells to be pre-created
    without having to specify at least one address (the addresses will be
    upcalled for).

    This allows the cell information preload service to avoid the need to issue
    loads of DNS lookups during boot to get the addresses for each cell (500+
    lookups for the 'standard' cell list[*]). The lookups can be done later as
    each cell is accessed through the filesystem.

    Also remove the print statement that prints a line every time a new cell is
    added.

    [*] There are 144 cells in the list. Each cell is first looked up for an
    SRV record, and if that fails, for an AFSDB record. These get a list
    of server names, each of which then has to be looked up to get the
    addresses for that server. E.g.:

    dig srv _afs3-vlserver._udp.grand.central.org

    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    David Howells
     

24 Aug, 2018

1 commit

  • Use new return type vm_fault_t for fault handler in struct
    vm_operations_struct. For now, this is just documenting that the
    function returns a VM_FAULT value rather than an errno. Once all
    instances are converted, vm_fault_t will become a distinct type.

    See 1c8f422059ae ("mm: change return type to vm_fault_t") for reference.

    Link: http://lkml.kernel.org/r/20180702152017.GA3780@jordon-HP-15-Notebook-PC
    Signed-off-by: Souptick Joarder
    Reviewed-by: Matthew Wilcox
    Cc: Matthew Wilcox
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     

16 Aug, 2018

1 commit

  • Pull networking updates from David Miller:
    "Highlights:

    - Gustavo A. R. Silva keeps working on the implicit switch fallthru
    changes.

    - Support 802.11ax High-Efficiency wireless in cfg80211 et al, From
    Luca Coelho.

    - Re-enable ASPM in r8169, from Kai-Heng Feng.

    - Add virtual XFRM interfaces, which avoids all of the limitations of
    existing IPSEC tunnels. From Steffen Klassert.

    - Convert GRO over to use a hash table, so that when we have many
    flows active we don't traverse a long list during accumluation.

    - Many new self tests for routing, TC, tunnels, etc. Too many
    contributors to mention them all, but I'm really happy to keep
    seeing this stuff.

    - Hardware timestamping support for dpaa_eth/fsl-fman from Yangbo Lu.

    - Lots of cleanups and fixes in L2TP code from Guillaume Nault.

    - Add IPSEC offload support to netdevsim, from Shannon Nelson.

    - Add support for slotting with non-uniform distribution to netem
    packet scheduler, from Yousuk Seung.

    - Add UDP GSO support to mlx5e, from Boris Pismenny.

    - Support offloading of Team LAG in NFP, from John Hurley.

    - Allow to configure TX queue selection based upon RX queue, from
    Amritha Nambiar.

    - Support ethtool ring size configuration in aquantia, from Anton
    Mikaev.

    - Support DSCP and flowlabel per-transport in SCTP, from Xin Long.

    - Support list based batching and stack traversal of SKBs, this is
    very exciting work. From Edward Cree.

    - Busyloop optimizations in vhost_net, from Toshiaki Makita.

    - Introduce the ETF qdisc, which allows time based transmissions. IGB
    can offload this in hardware. From Vinicius Costa Gomes.

    - Add parameter support to devlink, from Moshe Shemesh.

    - Several multiplication and division optimizations for BPF JIT in
    nfp driver, from Jiong Wang.

    - Lots of prepatory work to make more of the packet scheduler layer
    lockless, when possible, from Vlad Buslov.

    - Add ACK filter and NAT awareness to sch_cake packet scheduler, from
    Toke Høiland-Jørgensen.

    - Support regions and region snapshots in devlink, from Alex Vesker.

    - Allow to attach XDP programs to both HW and SW at the same time on
    a given device, with initial support in nfp. From Jakub Kicinski.

    - Add TLS RX offload and support in mlx5, from Ilya Lesokhin.

    - Use PHYLIB in r8169 driver, from Heiner Kallweit.

    - All sorts of changes to support Spectrum 2 in mlxsw driver, from
    Ido Schimmel.

    - PTP support in mv88e6xxx DSA driver, from Andrew Lunn.

    - Make TCP_USER_TIMEOUT socket option more accurate, from Jon
    Maxwell.

    - Support for templates in packet scheduler classifier, from Jiri
    Pirko.

    - IPV6 support in RDS, from Ka-Cheong Poon.

    - Native tproxy support in nf_tables, from Máté Eckl.

    - Maintain IP fragment queue in an rbtree, but optimize properly for
    in-order frags. From Peter Oskolkov.

    - Improvde handling of ACKs on hole repairs, from Yuchung Cheng"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1996 commits)
    bpf: test: fix spelling mistake "REUSEEPORT" -> "REUSEPORT"
    hv/netvsc: Fix NULL dereference at single queue mode fallback
    net: filter: mark expected switch fall-through
    xen-netfront: fix warn message as irq device name has '/'
    cxgb4: Add new T5 PCI device ids 0x50af and 0x50b0
    net: dsa: mv88e6xxx: missing unlock on error path
    rds: fix building with IPV6=m
    inet/connection_sock: prefer _THIS_IP_ to current_text_addr
    net: dsa: mv88e6xxx: bitwise vs logical bug
    net: sock_diag: Fix spectre v1 gadget in __sock_diag_cmd()
    ieee802154: hwsim: using right kind of iteration
    net: hns3: Add vlan filter setting by ethtool command -K
    net: hns3: Set tx ring' tc info when netdev is up
    net: hns3: Remove tx ring BD len register in hns3_enet
    net: hns3: Fix desc num set to default when setting channel
    net: hns3: Fix for phy link issue when using marvell phy driver
    net: hns3: Fix for information of phydev lost problem when down/up
    net: hns3: Fix for command format parsing error in hclge_is_all_function_id_zero
    net: hns3: Add support for serdes loopback selftest
    bnxt_en: take coredump_record structure off stack
    ...

    Linus Torvalds
     

14 Aug, 2018

1 commit

  • Pull vfs lookup() updates from Al Viro:
    "More conversions of ->lookup() to d_splice_alias().

    Should be reasonably complete now - the only leftovers are in ceph"

    * 'work.lookup' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    afs_try_auto_mntpt(): return NULL instead of ERR_PTR(-ENOENT)
    afs_lookup(): switch to d_splice_alias()
    afs: switch dynroot lookups to d_splice_alias()
    hpfs: fix an inode leak in lookup, switch to d_splice_alias()
    hostfs_lookup: switch to d_splice_alias()

    Linus Torvalds
     

06 Aug, 2018

3 commits


04 Aug, 2018

1 commit


21 Jun, 2018

1 commit

  • While __atomic_add_unless() was originally intended as a building-block
    for atomic_add_unless(), it's now used in a number of places around the
    kernel. It's the only common atomic operation named __atomic*(), rather
    than atomic_*(), and for consistency it would be better named
    atomic_fetch_add_unless().

    This lack of consistency is slightly confusing, and gets in the way of
    scripting atomics. Given that, let's clean things up and promote it to
    an official part of the atomics API, in the form of
    atomic_fetch_add_unless().

    This patch converts definitions and invocations over to the new name,
    including the instrumented version, using the following script:

    ----
    git grep -w __atomic_add_unless | while read line; do
    sed -i '{s/\/atomic_fetch_add_unless/}' "${line%%:*}";
    done
    git grep -w __arch_atomic_add_unless | while read line; do
    sed -i '{s/\/arch_atomic_fetch_add_unless/}' "${line%%:*}";
    done
    ----

    Note that we do not have atomic{64,_long}_fetch_add_unless(), which will
    be introduced by later patches.

    There should be no functional change as a result of this patch.

    Signed-off-by: Mark Rutland
    Reviewed-by: Will Deacon
    Acked-by: Geert Uytterhoeven
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Palmer Dabbelt
    Cc: Boqun Feng
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Link: https://lore.kernel.org/lkml/20180621121321.4761-2-mark.rutland@arm.com
    Signed-off-by: Ingo Molnar

    Mark Rutland
     

16 Jun, 2018

1 commit

  • Pull AFS updates from Al Viro:
    "Assorted AFS stuff - ended up in vfs.git since most of that consists
    of David's AFS-related followups to Christoph's procfs series"

    * 'afs-proc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    afs: Optimise callback breaking by not repeating volume lookup
    afs: Display manually added cells in dynamic root mount
    afs: Enable IPv6 DNS lookups
    afs: Show all of a server's addresses in /proc/fs/afs/servers
    afs: Handle CONFIG_PROC_FS=n
    proc: Make inline name size calculation automatic
    afs: Implement network namespacing
    afs: Mark afs_net::ws_cell as __rcu and set using rcu functions
    afs: Fix a Sparse warning in xdr_decode_AFSFetchStatus()
    proc: Add a way to make network proc files writable
    afs: Rearrange fs/afs/proc.c to remove remaining predeclarations.
    afs: Rearrange fs/afs/proc.c to move the show routines up
    afs: Rearrange fs/afs/proc.c by moving fops and open functions down
    afs: Move /proc management functions to the end of the file

    Linus Torvalds
     

15 Jun, 2018

6 commits

  • At the moment, afs_break_callbacks calls afs_break_one_callback() for each
    separate FID it was given, and the latter looks up the volume individually
    for each one.

    However, this is inefficient if two or more FIDs have the same vid as we
    could reuse the volume. This is complicated by cell aliasing whereby we
    may have multiple cells sharing a volume and can therefore have multiple
    callback interests for any particular volume ID.

    At the moment afs_break_one_callback() scans the entire list of volumes
    we're getting from a server and breaks the appropriate callback in every
    matching volume, regardless of cell. This scan is done for every FID.

    Optimise callback breaking by the following means:

    (1) Sort the FID list by vid so that all FIDs belonging to the same volume
    are clumped together.

    This is done through the use of an indirection table as we cannot do
    an insertion sort on the afs_callback_break array as we decode FIDs
    into it as we subsequently also have to decode callback info into it
    that corresponds by array index only.

    We also don't really want to bubblesort afterwards if we can avoid it.

    (2) Sort the server->cb_interests array by vid so that all the matching
    volumes are grouped together. This permits the scan to stop after
    finding a record that has a higher vid.

    (3) When breaking FIDs, we try to keep server->cb_break_lock as long as
    possible, caching the start point in the array for that volume group
    as long as possible.

    It might make sense to add another layer in that list and have a
    refcounted volume ID anchor that has the matching interests attached
    to it rather than being in the list. This would allow the lock to be
    dropped without losing the cursor.

    Signed-off-by: David Howells

    David Howells
     
  • Alter the dynroot mount so that cells created by manipulation of
    /proc/fs/afs/cells and /proc/fs/afs/rootcell and by specification of a root
    cell as a module parameter will cause directories for those cells to be
    created in the dynamic root superblock for the network namespace[*].

    To this end:

    (1) Only one dynamic root superblock is now created per network namespace
    and this is shared between all attempts to mount it. This makes it
    easier to find the superblock to modify.

    (2) When a dynamic root superblock is created, the list of cells is walked
    and directories created for each cell already defined.

    (3) When a new cell is added, if a dynamic root superblock exists, a
    directory is created for it.

    (4) When a cell is destroyed, the directory is removed.

    (5) These directories are created by calling lookup_one_len() on the root
    dir which automatically creates them if they don't exist.

    [*] Inasmuch as network namespaces are currently supported here.

    Signed-off-by: David Howells

    David Howells
     
  • Remove the restriction on DNS lookup upcalls that prevents ipv6 addresses
    from being looked up.

    Signed-off-by: David Howells

    David Howells
     
  • Show all of a server's addresses in /proc/fs/afs/servers, placing the
    second plus addresses on padded lines of their own. The current address is
    marked with a star.

    Signed-off-by: David Howells

    David Howells
     
  • The AFS filesystem depends at the moment on /proc for configuration and
    also presents information that way - however, this causes a compilation
    failure if procfs is disabled.

    Fix it so that the procfs bits aren't compiled in if procfs is disabled.

    This means that you can't configure the AFS filesystem directly, but it is
    still usable provided that an up-to-date keyutils is installed to look up
    cells by SRV or AFSDB DNS records.

    Reported-by: Al Viro
    Signed-off-by: David Howells

    David Howells
     
  • Pull inode timestamps conversion to timespec64 from Arnd Bergmann:
    "This is a late set of changes from Deepa Dinamani doing an automated
    treewide conversion of the inode and iattr structures from 'timespec'
    to 'timespec64', to push the conversion from the VFS layer into the
    individual file systems.

    As Deepa writes:

    'The series aims to switch vfs timestamps to use struct timespec64.
    Currently vfs uses struct timespec, which is not y2038 safe.

    The series involves the following:
    1. Add vfs helper functions for supporting struct timepec64
    timestamps.
    2. Cast prints of vfs timestamps to avoid warnings after the switch.
    3. Simplify code using vfs timestamps so that the actual replacement
    becomes easy.
    4. Convert vfs timestamps to use struct timespec64 using a script.
    This is a flag day patch.

    Next steps:
    1. Convert APIs that can handle timespec64, instead of converting
    timestamps at the boundaries.
    2. Update internal data structures to avoid timestamp conversions'

    Thomas Gleixner adds:

    'I think there is no point to drag that out for the next merge
    window. The whole thing needs to be done in one go for the core
    changes which means that you're going to play that catchup game
    forever. Let's get over with it towards the end of the merge window'"

    * tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground:
    pstore: Remove bogus format string definition
    vfs: change inode times to use struct timespec64
    pstore: Convert internal records to timespec64
    udf: Simplify calls to udf_disk_stamp_to_time
    fs: nfs: get rid of memcpys for inode times
    ceph: make inode time prints to be long long
    lustre: Use long long type to print inode time
    fs: add timespec64_truncate()

    Linus Torvalds
     

13 Jun, 2018

1 commit

  • The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
    patch replaces cases of:

    kmalloc(a * b, gfp)

    with:
    kmalloc_array(a * b, gfp)

    as well as handling cases of:

    kmalloc(a * b * c, gfp)

    with:

    kmalloc(array3_size(a, b, c), gfp)

    as it's slightly less ugly than:

    kmalloc_array(array_size(a, b), c, gfp)

    This does, however, attempt to ignore constant size factors like:

    kmalloc(4 * 1024, gfp)

    though any constants defined via macros get caught up in the conversion.

    Any factors with a sizeof() of "unsigned char", "char", and "u8" were
    dropped, since they're redundant.

    The tools/ directory was manually excluded, since it has its own
    implementation of kmalloc().

    The Coccinelle script used for this was:

    // Fix redundant parens around sizeof().
    @@
    type TYPE;
    expression THING, E;
    @@

    (
    kmalloc(
    - (sizeof(TYPE)) * E
    + sizeof(TYPE) * E
    , ...)
    |
    kmalloc(
    - (sizeof(THING)) * E
    + sizeof(THING) * E
    , ...)
    )

    // Drop single-byte sizes and redundant parens.
    @@
    expression COUNT;
    typedef u8;
    typedef __u8;
    @@

    (
    kmalloc(
    - sizeof(u8) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(__u8) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(char) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(unsigned char) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(u8) * COUNT
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(__u8) * COUNT
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(char) * COUNT
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(unsigned char) * COUNT
    + COUNT
    , ...)
    )

    // 2-factor product with sizeof(type/expression) and identifier or constant.
    @@
    type TYPE;
    expression THING;
    identifier COUNT_ID;
    constant COUNT_CONST;
    @@

    (
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * (COUNT_ID)
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * COUNT_ID
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * (COUNT_CONST)
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * COUNT_CONST
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * (COUNT_ID)
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * COUNT_ID
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * (COUNT_CONST)
    + COUNT_CONST, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * COUNT_CONST
    + COUNT_CONST, sizeof(THING)
    , ...)
    )

    // 2-factor product, only identifiers.
    @@
    identifier SIZE, COUNT;
    @@

    - kmalloc
    + kmalloc_array
    (
    - SIZE * COUNT
    + COUNT, SIZE
    , ...)

    // 3-factor product with 1 sizeof(type) or sizeof(expression), with
    // redundant parens removed.
    @@
    expression THING;
    identifier STRIDE, COUNT;
    type TYPE;
    @@

    (
    kmalloc(
    - sizeof(TYPE) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    )

    // 3-factor product with 2 sizeof(variable), with redundant parens removed.
    @@
    expression THING1, THING2;
    identifier COUNT;
    type TYPE1, TYPE2;
    @@

    (
    kmalloc(
    - sizeof(TYPE1) * sizeof(TYPE2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kmalloc(
    - sizeof(THING1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kmalloc(
    - sizeof(THING1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    )

    // 3-factor product, only identifiers, with redundant parens removed.
    @@
    identifier STRIDE, SIZE, COUNT;
    @@

    (
    kmalloc(
    - (COUNT) * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - (COUNT) * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - (COUNT) * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - (COUNT) * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    )

    // Any remaining multi-factor products, first at least 3-factor products,
    // when they're not all constants...
    @@
    expression E1, E2, E3;
    constant C1, C2, C3;
    @@

    (
    kmalloc(C1 * C2 * C3, ...)
    |
    kmalloc(
    - (E1) * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kmalloc(
    - (E1) * (E2) * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kmalloc(
    - (E1) * (E2) * (E3)
    + array3_size(E1, E2, E3)
    , ...)
    |
    kmalloc(
    - E1 * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    )

    // And then all remaining 2 factors products when they're not all constants,
    // keeping sizeof() as the second factor argument.
    @@
    expression THING, E1, E2;
    type TYPE;
    constant C1, C2, C3;
    @@

    (
    kmalloc(sizeof(THING) * C2, ...)
    |
    kmalloc(sizeof(TYPE) * C2, ...)
    |
    kmalloc(C1 * C2 * C3, ...)
    |
    kmalloc(C1 * C2, ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * (E2)
    + E2, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * E2
    + E2, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * (E2)
    + E2, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * E2
    + E2, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - (E1) * E2
    + E1, E2
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - (E1) * (E2)
    + E1, E2
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - E1 * E2
    + E1, E2
    , ...)
    )

    Signed-off-by: Kees Cook

    Kees Cook
     

07 Jun, 2018

3 commits

  • Pull networking updates from David Miller:

    1) Add Maglev hashing scheduler to IPVS, from Inju Song.

    2) Lots of new TC subsystem tests from Roman Mashak.

    3) Add TCP zero copy receive and fix delayed acks and autotuning with
    SO_RCVLOWAT, from Eric Dumazet.

    4) Add XDP_REDIRECT support to mlx5 driver, from Jesper Dangaard
    Brouer.

    5) Add ttl inherit support to vxlan, from Hangbin Liu.

    6) Properly separate ipv6 routes into their logically independant
    components. fib6_info for the routing table, and fib6_nh for sets of
    nexthops, which thus can be shared. From David Ahern.

    7) Add bpf_xdp_adjust_tail helper, which can be used to generate ICMP
    messages from XDP programs. From Nikita V. Shirokov.

    8) Lots of long overdue cleanups to the r8169 driver, from Heiner
    Kallweit.

    9) Add BTF ("BPF Type Format"), from Martin KaFai Lau.

    10) Add traffic condition monitoring to iwlwifi, from Luca Coelho.

    11) Plumb extack down into fib_rules, from Roopa Prabhu.

    12) Add Flower classifier offload support to igb, from Vinicius Costa
    Gomes.

    13) Add UDP GSO support, from Willem de Bruijn.

    14) Add documentation for eBPF helpers, from Quentin Monnet.

    15) Add TLS tx offload to mlx5, from Ilya Lesokhin.

    16) Allow applications to be given the number of bytes available to read
    on a socket via a control message returned from recvmsg(), from
    Soheil Hassas Yeganeh.

    17) Add x86_32 eBPF JIT compiler, from Wang YanQing.

    18) Add AF_XDP sockets, with zerocopy support infrastructure as well.
    From Björn Töpel.

    19) Remove indirect load support from all of the BPF JITs and handle
    these operations in the verifier by translating them into native BPF
    instead. From Daniel Borkmann.

    20) Add GRO support to ipv6 gre tunnels, from Eran Ben Elisha.

    21) Allow XDP programs to do lookups in the main kernel routing tables
    for forwarding. From David Ahern.

    22) Allow drivers to store hardware state into an ELF section of kernel
    dump vmcore files, and use it in cxgb4. From Rahul Lakkireddy.

    23) Various RACK and loss detection improvements in TCP, from Yuchung
    Cheng.

    24) Add TCP SACK compression, from Eric Dumazet.

    25) Add User Mode Helper support and basic bpfilter infrastructure, from
    Alexei Starovoitov.

    26) Support ports and protocol values in RTM_GETROUTE, from Roopa
    Prabhu.

    27) Support bulking in ->ndo_xdp_xmit() API, from Jesper Dangaard
    Brouer.

    28) Add lots of forwarding selftests, from Petr Machata.

    29) Add generic network device failover driver, from Sridhar Samudrala.

    * ra.kernel.org:/pub/scm/linux/kernel/git/davem/net-next: (1959 commits)
    strparser: Add __strp_unpause and use it in ktls.
    rxrpc: Fix terminal retransmission connection ID to include the channel
    net: hns3: Optimize PF CMDQ interrupt switching process
    net: hns3: Fix for VF mailbox receiving unknown message
    net: hns3: Fix for VF mailbox cannot receiving PF response
    bnx2x: use the right constant
    Revert "net: sched: cls: Fix offloading when ingress dev is vxlan"
    net: dsa: b53: Fix for brcm tag issue in Cygnus SoC
    enic: fix UDP rss bits
    netdev-FAQ: clarify DaveM's position for stable backports
    rtnetlink: validate attributes in do_setlink()
    mlxsw: Add extack messages for port_{un, }split failures
    netdevsim: Add extack error message for devlink reload
    devlink: Add extack to reload and port_{un, }split operations
    net: metrics: add proper netlink validation
    ipmr: fix error path when ipmr_new_table fails
    ip6mr: only set ip6mr_table from setsockopt when ip6mr_new_table succeeds
    net: hns3: remove unused hclgevf_cfg_func_mta_filter
    netfilter: provide udp*_lib_lookup for nf_tproxy
    qed*: Utilize FW 8.37.2.0
    ...

    Linus Torvalds
     
  • Pull overflow updates from Kees Cook:
    "This adds the new overflow checking helpers and adds them to the
    2-factor argument allocators. And this adds the saturating size
    helpers and does a treewide replacement for the struct_size() usage.
    Additionally this adds the overflow testing modules to make sure
    everything works.

    I'm still working on the treewide replacements for allocators with
    "simple" multiplied arguments:

    *alloc(a * b, ...) -> *alloc_array(a, b, ...)

    and

    *zalloc(a * b, ...) -> *calloc(a, b, ...)

    as well as the more complex cases, but that's separable from this
    portion of the series. I expect to have the rest sent before -rc1
    closes; there are a lot of messy cases to clean up.

    Summary:

    - Introduce arithmetic overflow test helper functions (Rasmus)

    - Use overflow helpers in 2-factor allocators (Kees, Rasmus)

    - Introduce overflow test module (Rasmus, Kees)

    - Introduce saturating size helper functions (Matthew, Kees)

    - Treewide use of struct_size() for allocators (Kees)"

    * tag 'overflow-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    treewide: Use struct_size() for devm_kmalloc() and friends
    treewide: Use struct_size() for vmalloc()-family
    treewide: Use struct_size() for kmalloc()-family
    device: Use overflow helpers for devm_kmalloc()
    mm: Use overflow helpers in kvmalloc()
    mm: Use overflow helpers in kmalloc_array*()
    test_overflow: Add memory allocation overflow tests
    overflow.h: Add allocation size calculation helpers
    test_overflow: Report test failures
    test_overflow: macrofy some more, do more tests for free
    lib: add runtime test of check_*_overflow functions
    compiler.h: enable builtin overflow checkers and add fallback code

    Linus Torvalds
     
  • One of the more common cases of allocation size calculations is finding
    the size of a structure that has a zero-sized array at the end, along
    with memory for some number of elements for that array. For example:

    struct foo {
    int stuff;
    void *entry[];
    };

    instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);

    Instead of leaving these open-coded and prone to type mistakes, we can
    now use the new struct_size() helper:

    instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);

    This patch makes the changes for kmalloc()-family (and kvmalloc()-family)
    uses. It was done via automatic conversion with manual review for the
    "CHECKME" non-standard cases noted below, using the following Coccinelle
    script:

    // pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len *
    // sizeof *pkey_cache->table, GFP_KERNEL);
    @@
    identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
    expression GFP;
    identifier VAR, ELEMENT;
    expression COUNT;
    @@

    - alloc(sizeof(*VAR) + COUNT * sizeof(*VAR->ELEMENT), GFP)
    + alloc(struct_size(VAR, ELEMENT, COUNT), GFP)

    // mr = kzalloc(sizeof(*mr) + m * sizeof(mr->map[0]), GFP_KERNEL);
    @@
    identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
    expression GFP;
    identifier VAR, ELEMENT;
    expression COUNT;
    @@

    - alloc(sizeof(*VAR) + COUNT * sizeof(VAR->ELEMENT[0]), GFP)
    + alloc(struct_size(VAR, ELEMENT, COUNT), GFP)

    // Same pattern, but can't trivially locate the trailing element name,
    // or variable name.
    @@
    identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
    expression GFP;
    expression SOMETHING, COUNT, ELEMENT;
    @@

    - alloc(sizeof(SOMETHING) + COUNT * sizeof(ELEMENT), GFP)
    + alloc(CHECKME_struct_size(&SOMETHING, ELEMENT, COUNT), GFP)

    Signed-off-by: Kees Cook

    Kees Cook
     

06 Jun, 2018

1 commit

  • struct timespec is not y2038 safe. Transition vfs to use
    y2038 safe struct timespec64 instead.

    The change was made with the help of the following cocinelle
    script. This catches about 80% of the changes.
    All the header file and logic changes are included in the
    first 5 rules. The rest are trivial substitutions.
    I avoid changing any of the function signatures or any other
    filesystem specific data structures to keep the patch simple
    for review.

    The script can be a little shorter by combining different cases.
    But, this version was sufficient for my usecase.

    virtual patch

    @ depends on patch @
    identifier now;
    @@
    - struct timespec
    + struct timespec64
    current_time ( ... )
    {
    - struct timespec now = current_kernel_time();
    + struct timespec64 now = current_kernel_time64();
    ...
    - return timespec_trunc(
    + return timespec64_trunc(
    ... );
    }

    @ depends on patch @
    identifier xtime;
    @@
    struct \( iattr \| inode \| kstat \) {
    ...
    - struct timespec xtime;
    + struct timespec64 xtime;
    ...
    }

    @ depends on patch @
    identifier t;
    @@
    struct inode_operations {
    ...
    int (*update_time) (...,
    - struct timespec t,
    + struct timespec64 t,
    ...);
    ...
    }

    @ depends on patch @
    identifier t;
    identifier fn_update_time =~ "update_time$";
    @@
    fn_update_time (...,
    - struct timespec *t,
    + struct timespec64 *t,
    ...) { ... }

    @ depends on patch @
    identifier t;
    @@
    lease_get_mtime( ... ,
    - struct timespec *t
    + struct timespec64 *t
    ) { ... }

    @te depends on patch forall@
    identifier ts;
    local idexpression struct inode *inode_node;
    identifier i_xtime =~ "^i_[acm]time$";
    identifier ia_xtime =~ "^ia_[acm]time$";
    identifier fn_update_time =~ "update_time$";
    identifier fn;
    expression e, E3;
    local idexpression struct inode *node1;
    local idexpression struct inode *node2;
    local idexpression struct iattr *attr1;
    local idexpression struct iattr *attr2;
    local idexpression struct iattr attr;
    identifier i_xtime1 =~ "^i_[acm]time$";
    identifier i_xtime2 =~ "^i_[acm]time$";
    identifier ia_xtime1 =~ "^ia_[acm]time$";
    identifier ia_xtime2 =~ "^ia_[acm]time$";
    @@
    (
    (
    - struct timespec ts;
    + struct timespec64 ts;
    |
    - struct timespec ts = current_time(inode_node);
    + struct timespec64 ts = current_time(inode_node);
    )

    i_xtime, &ts)
    + timespec64_equal(&inode_node->i_xtime, &ts)
    |
    - timespec_equal(&ts, &inode_node->i_xtime)
    + timespec64_equal(&ts, &inode_node->i_xtime)
    |
    - timespec_compare(&inode_node->i_xtime, &ts)
    + timespec64_compare(&inode_node->i_xtime, &ts)
    |
    - timespec_compare(&ts, &inode_node->i_xtime)
    + timespec64_compare(&ts, &inode_node->i_xtime)
    |
    ts = current_time(e)
    |
    fn_update_time(..., &ts,...)
    |
    inode_node->i_xtime = ts
    |
    node1->i_xtime = ts
    |
    ts = inode_node->i_xtime
    |
    ia_xtime ...+> = ts
    |
    ts = attr1->ia_xtime
    |
    ts.tv_sec
    |
    ts.tv_nsec
    |
    btrfs_set_stack_timespec_sec(..., ts.tv_sec)
    |
    btrfs_set_stack_timespec_nsec(..., ts.tv_nsec)
    |
    - ts = timespec64_to_timespec(
    + ts =
    ...
    -)
    |
    - ts = ktime_to_timespec(
    + ts = ktime_to_timespec64(
    ...)
    |
    - ts = E3
    + ts = timespec_to_timespec64(E3)
    |
    - ktime_get_real_ts(&ts)
    + ktime_get_real_ts64(&ts)
    |
    fn(...,
    - ts
    + timespec64_to_timespec(ts)
    ,...)
    )
    ...+>
    (

    )
    |
    - timespec_equal(&node1->i_xtime1, &node2->i_xtime2)
    + timespec64_equal(&node1->i_xtime2, &node2->i_xtime2)
    |
    - timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2)
    + timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2)
    |
    - timespec_compare(&node1->i_xtime1, &node2->i_xtime2)
    + timespec64_compare(&node1->i_xtime1, &node2->i_xtime2)
    |
    node1->i_xtime1 =
    - timespec_trunc(attr1->ia_xtime1,
    + timespec64_trunc(attr1->ia_xtime1,
    ...)
    |
    - attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2,
    + attr1->ia_xtime1 = timespec64_trunc(attr2->ia_xtime2,
    ...)
    |
    - ktime_get_real_ts(&attr1->ia_xtime1)
    + ktime_get_real_ts64(&attr1->ia_xtime1)
    |
    - ktime_get_real_ts(&attr.ia_xtime1)
    + ktime_get_real_ts64(&attr.ia_xtime1)
    )

    @ depends on patch @
    struct inode *node;
    struct iattr *attr;
    identifier fn;
    identifier i_xtime =~ "^i_[acm]time$";
    identifier ia_xtime =~ "^ia_[acm]time$";
    expression e;
    @@
    (
    - fn(node->i_xtime);
    + fn(timespec64_to_timespec(node->i_xtime));
    |
    fn(...,
    - node->i_xtime);
    + timespec64_to_timespec(node->i_xtime));
    |
    - e = fn(attr->ia_xtime);
    + e = fn(timespec64_to_timespec(attr->ia_xtime));
    )

    @ depends on patch forall @
    struct inode *node;
    struct iattr *attr;
    identifier i_xtime =~ "^i_[acm]time$";
    identifier ia_xtime =~ "^ia_[acm]time$";
    identifier fn;
    @@
    {
    + struct timespec ts;
    i_xtime);
    fn (...,
    - &node->i_xtime,
    + &ts,
    ...);
    |
    + ts = timespec64_to_timespec(attr->ia_xtime);
    fn (...,
    - &attr->ia_xtime,
    + &ts,
    ...);
    )
    ...+>
    }

    @ depends on patch forall @
    struct inode *node;
    struct iattr *attr;
    struct kstat *stat;
    identifier ia_xtime =~ "^ia_[acm]time$";
    identifier i_xtime =~ "^i_[acm]time$";
    identifier xtime =~ "^[acm]time$";
    identifier fn, ret;
    @@
    {
    + struct timespec ts;
    i_xtime);
    ret = fn (...,
    - &node->i_xtime,
    + &ts,
    ...);
    |
    + ts = timespec64_to_timespec(node->i_xtime);
    ret = fn (...,
    - &node->i_xtime);
    + &ts);
    |
    + ts = timespec64_to_timespec(attr->ia_xtime);
    ret = fn (...,
    - &attr->ia_xtime,
    + &ts,
    ...);
    |
    + ts = timespec64_to_timespec(attr->ia_xtime);
    ret = fn (...,
    - &attr->ia_xtime);
    + &ts);
    |
    + ts = timespec64_to_timespec(stat->xtime);
    ret = fn (...,
    - &stat->xtime);
    + &ts);
    )
    ...+>
    }

    @ depends on patch @
    struct inode *node;
    struct inode *node2;
    identifier i_xtime1 =~ "^i_[acm]time$";
    identifier i_xtime2 =~ "^i_[acm]time$";
    identifier i_xtime3 =~ "^i_[acm]time$";
    struct iattr *attrp;
    struct iattr *attrp2;
    struct iattr attr ;
    identifier ia_xtime1 =~ "^ia_[acm]time$";
    identifier ia_xtime2 =~ "^ia_[acm]time$";
    struct kstat *stat;
    struct kstat stat1;
    struct timespec64 ts;
    identifier xtime =~ "^[acmb]time$";
    expression e;
    @@
    (
    ( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1 ;
    |
    node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \);
    |
    node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
    |
    node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
    |
    stat->xtime = node2->i_xtime1;
    |
    stat1.xtime = node2->i_xtime1;
    |
    ( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1 ;
    |
    ( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2;
    |
    - e = node->i_xtime1;
    + e = timespec64_to_timespec( node->i_xtime1 );
    |
    - e = attrp->ia_xtime1;
    + e = timespec64_to_timespec( attrp->ia_xtime1 );
    |
    node->i_xtime1 = current_time(...);
    |
    node->i_xtime2 = node->i_xtime1 = node->i_xtime3 =
    - e;
    + timespec_to_timespec64(e);
    |
    node->i_xtime1 = node->i_xtime3 =
    - e;
    + timespec_to_timespec64(e);
    |
    - node->i_xtime1 = e;
    + node->i_xtime1 = timespec_to_timespec64(e);
    )

    Signed-off-by: Deepa Dinamani
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:

    Deepa Dinamani
     

05 Jun, 2018

2 commits

  • Sometimes an in-progress call will stop responding on the fileserver when
    the fileserver quietly cancels the call with an internally marked abort
    (RX_CALL_DEAD), without sending an ABORT to the client.

    This causes the client's call to eventually expire from lack of incoming
    packets directed its way, which currently leads to it being cancelled
    locally with ETIME. Note that it's not currently clear as to why this
    happens as it's really hard to reproduce.

    The rotation policy implement by kAFS, however, doesn't differentiate
    between ETIME meaning we didn't get any response from the server and ETIME
    meaning the call got cancelled mid-flow. The latter leads to an oops when
    fetching data as the rotation partially resets the afs_read descriptor,
    which can result in a cleared page pointer being dereferenced because that
    page has already been filled.

    Handle this by the following means:

    (1) Set a flag on a call when we receive a packet for it.

    (2) Store the highest packet serial number so far received for a call
    (bearing in mind this may wrap).

    (3) If, when the "not received anything recently" timeout expires on a
    call, we've received at least one packet for a call and the connection
    as a whole has received packets more recently than that call, then
    cancel the call locally with ECONNRESET rather than ETIME.

    This indicates that the call was definitely in progress on the server.

    (4) In kAFS, if the rotation algorithm sees ECONNRESET rather than ETIME,
    don't try the next server, but rather abort the call.

    This avoids the oops as we don't try to reuse the afs_read struct.
    Rather, as-yet ungotten pages will be reread at a later data.

    Also:

    (5) Add an rxrpc tracepoint to log detection of the call being reset.

    Without this, I occasionally see an oops like the following:

    general protection fault: 0000 [#1] SMP PTI
    ...
    RIP: 0010:_copy_to_iter+0x204/0x310
    RSP: 0018:ffff8800cae0f828 EFLAGS: 00010206
    RAX: 0000000000000560 RBX: 0000000000000560 RCX: 0000000000000560
    RDX: ffff8800cae0f968 RSI: ffff8800d58b3312 RDI: 0005080000000000
    RBP: ffff8800cae0f968 R08: 0000000000000560 R09: ffff8800ca00f400
    R10: ffff8800c36f28d4 R11: 00000000000008c4 R12: ffff8800cae0f958
    R13: 0000000000000560 R14: ffff8800d58b3312 R15: 0000000000000560
    FS: 00007fdaef108080(0000) GS:ffff8800ca680000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fb28a8fa000 CR3: 00000000d2a76002 CR4: 00000000001606e0
    Call Trace:
    skb_copy_datagram_iter+0x14e/0x289
    rxrpc_recvmsg_data.isra.0+0x6f3/0xf68
    ? trace_buffer_unlock_commit_regs+0x4f/0x89
    rxrpc_kernel_recv_data+0x149/0x421
    afs_extract_data+0x1e0/0x798
    ? afs_wait_for_call_to_complete+0xc9/0x52e
    afs_deliver_fs_fetch_data+0x33a/0x5ab
    afs_deliver_to_call+0x1ee/0x5e0
    ? afs_wait_for_call_to_complete+0xc9/0x52e
    afs_wait_for_call_to_complete+0x12b/0x52e
    ? wake_up_q+0x54/0x54
    afs_make_call+0x287/0x462
    ? afs_fs_fetch_data+0x3e6/0x3ed
    ? rcu_read_lock_sched_held+0x5d/0x63
    afs_fs_fetch_data+0x3e6/0x3ed
    afs_fetch_data+0xbb/0x14a
    afs_readpages+0x317/0x40d
    __do_page_cache_readahead+0x203/0x2ba
    ? ondemand_readahead+0x3a7/0x3c1
    ondemand_readahead+0x3a7/0x3c1
    generic_file_buffered_read+0x18b/0x62f
    __vfs_read+0xdb/0xfe
    vfs_read+0xb2/0x137
    ksys_read+0x50/0x8c
    do_syscall_64+0x7d/0x1a0
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Note the weird value in RDI which is a result of trying to kmap() a NULL
    page pointer.

    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     
  • Pull procfs updates from Al Viro:
    "Christoph's proc_create_... cleanups series"

    * 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (44 commits)
    xfs, proc: hide unused xfs procfs helpers
    isdn/gigaset: add back gigaset_procinfo assignment
    proc: update SIZEOF_PDE_INLINE_NAME for the new pde fields
    tty: replace ->proc_fops with ->proc_show
    ide: replace ->proc_fops with ->proc_show
    ide: remove ide_driver_proc_write
    isdn: replace ->proc_fops with ->proc_show
    atm: switch to proc_create_seq_private
    atm: simplify procfs code
    bluetooth: switch to proc_create_seq_data
    netfilter/x_tables: switch to proc_create_seq_private
    netfilter/xt_hashlimit: switch to proc_create_{seq,single}_data
    neigh: switch to proc_create_seq_data
    hostap: switch to proc_create_{seq,single}_data
    bonding: switch to proc_create_seq_data
    rtc/proc: switch to proc_create_single_data
    drbd: switch to proc_create_single
    resource: switch to proc_create_seq_data
    staging/rtl8192u: simplify procfs code
    jfs: simplify procfs code
    ...

    Linus Torvalds
     

03 Jun, 2018

1 commit


23 May, 2018

3 commits

  • Implement network namespacing within AFS, but don't yet let mounts occur
    outside the init namespace. An additional patch will be required propagate
    the network namespace across automounts.

    Signed-off-by: David Howells

    David Howells
     
  • The afs_net::ws_cell member is sometimes used under RCU conditions from
    within an seq-readlock. It isn't, however, marked __rcu and it isn't set
    using the proper RCU barrier-imposing functions.

    Fix this by annotating it with __rcu and using appropriate barriers to
    make sure accesses are correctly ordered.

    Without this, the code can produce the following warning:

    >> fs/afs/proc.c:151:24: sparse: incompatible types in comparison expression (different address spaces)

    Fixes: f044c8847bb6 ("afs: Lay the groundwork for supporting network namespaces")
    Reported-by: kbuild test robot
    Signed-off-by: David Howells

    David Howells
     
  • Sparse doesn't appear able to handle the conditionally-taken locks in
    xdr_decode_AFSFetchStatus(), even though the lock and unlock are both
    contingent on the same unvarying function argument.

    Deal with this by interpolating a wrapper function that takes the lock if
    needed and calls xdr_decode_AFSFetchStatus() on two separate branches, one
    with the lock held and one without.

    This allows Sparse to work out the locking.

    Signed-off-by: David Howells

    David Howells
     

18 May, 2018

4 commits


17 May, 2018

2 commits

  • In theory the AFS_VLSF_BACKVOL flag for a server in a vldb entry
    would indicate the presence of a backup volume on that server.

    In practice however, this flag is never set, and the presence of
    a backup volume is implied by the entry having AFS_VLF_BACKEXISTS set,
    for the server that hosts the read-write volume (has AFS_VLSF_RWVOL).

    Signed-off-by: Marc Dionne
    Signed-off-by: David Howells

    Marc Dionne
     
  • Doing faccessat("/afs/some/directory", 0) triggers a BUG in the permissions
    check code.

    Fix this by just removing the BUG section. If no permissions are asked
    for, just return okay if the file exists.

    Also:

    (1) Split up the directory check so that it has separate if-statements
    rather than if-else-if (e.g. checking for MAY_EXEC shouldn't skip the
    check for MAY_READ and MAY_WRITE).

    (2) Check for MAY_CHDIR as MAY_EXEC.

    Without the main fix, the following BUG may occur:

    kernel BUG at fs/afs/security.c:386!
    invalid opcode: 0000 [#1] SMP PTI
    ...
    RIP: 0010:afs_permission+0x19d/0x1a0 [kafs]
    ...
    Call Trace:
    ? inode_permission+0xbe/0x180
    ? do_faccessat+0xdc/0x270
    ? do_syscall_64+0x60/0x1f0
    ? entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Fixes: 00d3b7a4533e ("[AFS]: Add security support.")
    Reported-by: Jonathan Billings
    Signed-off-by: David Howells

    David Howells