11 Jun, 2014

1 commit


04 Jun, 2014

1 commit


03 Jun, 2014

2 commits

  • The current probe_filter_length() (the function that calculates the
    length of a test BPF filter) behavior is to declare the end of the
    filter as soon as it finds {0, *, *, 0}. This is actually a valid
    insn ("ld #0"), so any filter with includes "BPF_STMT(BPF_LD | BPF_IMM, 0)"
    fails (its length is cut short).

    We are changing probe_filter_length() so as to start from the end, and
    declare the end of the filter as the first instruction which is not
    {0, *, *, 0}. This solution produces a simpler patch than the
    alternative of using an explicit end-of-filter mark. It is technically
    incorrect if your filter ends up with "ld #0", but that should not
    happen anyway.

    We also add a new test (LD_IMM_0) that includes ld #0 (does not work
    without this patch).

    Signed-off-by: Chema Gonzalez
    Acked-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Chema Gonzalez
     
  • Any process is able to send netlink messages with leftover bytes.
    Make the warning rate-limited to prevent too much log spam.

    The warning is supposed to help find userspace bugs, so print the
    triggering command name to implicate the buggy program.

    [v2: Use pr_warn_ratelimited instead of printk_ratelimited.]

    Signed-off-by: Michal Schmidt
    Signed-off-by: David S. Miller

    Michal Schmidt
     

02 Jun, 2014

4 commits

  • This check tests that overloading BPF_LD | BPF_ABS with an
    always invalid BPF extension, that is SKF_AD_MAX, fails to
    make sure classic BPF behaviour is correct in filter checker.

    Also, we add a test for loading at packet offset SKF_AD_OFF-1
    which should pass the filter, but later on fail during runtime.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Also add a test for the scratch memory store that first fills
    all slots and then sucessively reads all of them back adding
    up to A, and eventually returning A. This and the previous
    M[] test with alternating fill/spill will detect possible JIT
    errors on M[].

    Suggested-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • This reverts commit 70a640d0dae3a9b1b222ce673eb5d92c263ddd61
    ("net/mlx4_en: Use affinity hint") and commit
    c8865b64b05b2f4eeefd369373e9c8aeb069e7a1 ("cpumask: Utility function
    to set n'th cpu - local cpu first") because these changes break
    the build when SMP is disabled amongst other things.

    Reported-by: Eric Dumazet
    Signed-off-by: David S. Miller

    David S. Miller
     
  • This function sets the n'th cpu - local cpu's first.
    For example: in a 16 cores server with even cpu's local, will get the
    following values:
    cpumask_set_cpu_local_first(0, numa, cpumask) => cpu 0 is set
    cpumask_set_cpu_local_first(1, numa, cpumask) => cpu 2 is set
    ...
    cpumask_set_cpu_local_first(7, numa, cpumask) => cpu 14 is set
    cpumask_set_cpu_local_first(8, numa, cpumask) => cpu 1 is set
    cpumask_set_cpu_local_first(9, numa, cpumask) => cpu 3 is set
    ...
    cpumask_set_cpu_local_first(15, numa, cpumask) => cpu 15 is set

    Curently this function will be used by multi queue networking devices to
    calculate the irq affinity mask, such that as many local cpu's as
    possible will be utilized to handle the mq device irq's.

    Signed-off-by: Amir Vadai
    Signed-off-by: David S. Miller

    Amir Vadai
     

31 May, 2014

3 commits

  • John W. Linville says:

    ====================
    Please pull this batch of updates intended for 3.16...

    For the mac80211 bits, Johannes says:

    "Here I just have Heikki's rfkill GPIO cleanups.

    The ARM/tegra patch is OK with the maintainer (Stephen). Let me know of
    any problems."

    and;

    "We have a whole bunch of work on CSA by Andrei, Luca and Michal, but
    unfortunately it doesn't seem quite complete yet so it's still disabled.
    There's some TDLS work from Arik, and the rest is mostly minor fixes and
    cleanups."

    For the NFC bits, Samuel says:

    "This is the NFC pull request for 3.16. We have:

    - STMicroeectronics st21nfca support. The st21nfca is an HCI chipset and
    thus relies on the HCI stack. This submission provides support for tag
    redaer/writer mode (including Type 5) and device tree bindings.

    - PM runtime support and a bunch of bug fixes for TI's trf7970a.

    - Device tree support for NXP's pn544. Legacy platform data support is
    obviously kept intact.

    - NFC Tag type 4B support to the NFC Digital stack.

    - SOCK_RAW type support to the raw NFC socket, and allow NCI
    sniffing from that. This can be extended to report HCI frames and also
    proprietarry ones like e.g. the pn533 ones."

    For the iwlwifi bits, Emmanuel says:

    "Eran continues to work on new devices, Eyal is still digging in
    the rate control stuff, and Johannes added new functionality to the
    debug system we have in place now along with a few cleanups he made
    on the way. That's pretty much it."

    and;

    "Avri continues to work on the power code and Eran is improving the
    NVM handling as a preparations for new devices on which he works
    with Liad. Luca cleans up a bit the code while working on CSA. I have
    the regular BT Coex stuff and a small lockdep fix. Johannes has his
    regular amount of clean ups and improvements, the main one is the
    ability to leave 2 chains open to improve diversity and hence the
    throughput in high attenuation scenarios."

    and;

    "The regular amount of housekeeping here. I merged iwlwifi-fixes.git to
    be able to add the patch you didn't want in wireless.git at that stage
    of the -rc cycle. Luca has a few preparations for CSA implementation
    and also what seems to be a bugfix for P2P but hasn't caused issues
    we could notice."

    For the Atheros bits, Kalle says:

    "For ath10k Michal did various small fixes on how we handle
    hardware/firmware problems and he also fixed two memory leaks."

    Also included are a couple of pulls from the wireless tree to
    avoid/resolve merge issues...
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • This patch converts raw opcodes for tcpdump tests into
    BPF_STMT()/BPF_JUMP() combinations, which brings it into
    conformity with the rest of the patches and it also makes
    life easier to grasp what's going on in these particular
    test cases when they ever fail. Also arrange payload from
    the jump+holes test in a way as we have with other packet
    payloads in the test suite.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • This test for classic BPF probes stores and load combination
    via X on all 16 registers of the scratch memory store. It
    initially loads integer 100 and passes this value around
    to each register while incrementing it every time, thus we
    expect to have 116 as a result. Might be useful for JIT
    testing.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

28 May, 2014

1 commit


24 May, 2014

5 commits

  • Conflicts:
    drivers/net/bonding/bond_alb.c
    drivers/net/ethernet/altera/altera_msgdma.c
    drivers/net/ethernet/altera/altera_sgdma.c
    net/ipv6/xfrm6_output.c

    Several cases of overlapping changes.

    The xfrm6_output.c has a bug fix which overlaps the renaming
    of skb->local_df to skb->ignore_df.

    In the Altera TSE driver cases, the register access cleanups
    in net-next overlapped with bug fixes done in net.

    Similarly a bug fix to send ALB packets in the bonding driver using
    the right source address overlaps with cleanups in net-next.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • This patch adds three more test cases:

    1) long jumps with holes of unreachable code
    2) ret x
    3) ldx + ret x

    All three tests are for classical BPF and to make sure that
    any changes will not break some exotic behaviour that exists
    probably since decades. The last two tests are expected to
    fail by the BPF checker already, as in classic BPF only K
    or A are allowed to be returned. Thus, there are now 52 test
    cases for BPF.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • This patch simplifies and refactors the test case code a
    bit and also adds a summary of all test that passed or
    failed in the kernel log, so that it's easier to spot if
    something has failed.

    Future work could further extend the test framework to also
    support different input 'stimuli' i.e. related structures
    to seccomp.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • The sk_unattached_filter_create() API is used by BPF filters that
    are not directly attached or related to sockets, and are used in
    team, ptp, xt_bpf, cls_bpf, etc. As such all users do their own
    internal managment of obtaining filter blocks and thus already
    have them in kernel memory and set up before calling into
    sk_unattached_filter_create(). As a result, due to __user annotation
    in sock_fprog, sparse triggers false positives (incorrect type in
    assignment [different address space]) when filters are set up before
    passing them to sk_unattached_filter_create(). Therefore, let
    sk_unattached_filter_create() API use sock_fprog_kern to overcome
    this issue.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Older gcc's (mine is gcc-4.4.4) make a mess of this.

    lib/test_bpf.c:74: error: unknown field 'insns' specified in initializer
    lib/test_bpf.c:75: warning: missing braces around initializer
    lib/test_bpf.c:75: warning: (near initialization for 'tests[0]..insns[0]')
    lib/test_bpf.c:76: error: extra brace group at end of initializer
    lib/test_bpf.c:76: error: (near initialization for 'tests[0].')
    lib/test_bpf.c:76: warning: excess elements in union initializer
    lib/test_bpf.c:76: warning: (near initialization for 'tests[0].')
    lib/test_bpf.c:77: error: extra brace group at end of initializer

    Cc: Alexei Starovoitov
    Cc: David S. Miller
    Signed-off-by: Andrew Morton
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Andrew Morton
     

22 May, 2014

1 commit

  • Kernel API for classic BPF socket filters is:

    sk_unattached_filter_create() - validate classic BPF, convert, JIT
    SK_RUN_FILTER() - run it
    sk_unattached_filter_destroy() - destroy socket filter

    Cleanup internal BPF kernel API as following:

    sk_filter_select_runtime() - final step of internal BPF creation.
    Try to JIT internal BPF program, if JIT is not available select interpreter
    SK_RUN_FILTER() - run it
    sk_filter_free() - free internal BPF program

    Disallow direct calls to BPF interpreter. Execution of the BPF program should
    be done with SK_RUN_FILTER() macro.

    Example of internal BPF create, run, destroy:

    struct sk_filter *fp;

    fp = kzalloc(sk_filter_size(prog_len), GFP_KERNEL);
    memcpy(fp->insni, prog, prog_len * sizeof(fp->insni[0]));
    fp->len = prog_len;

    sk_filter_select_runtime(fp);

    SK_RUN_FILTER(fp, ctx);

    sk_filter_free(fp);

    Sockets, seccomp, testsuite, tracing are using different ways to populate
    sk_filter, so first steps of program creation are not common.

    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

17 May, 2014

1 commit

  • This eliminates a 1-bit left shift in every single caller,
    and makes the inner loop of the CRC computation more efficient.

    Renamed crc7 to crc7_be (big-endian) since the interface changed.

    Also purged #include from files that don't use it at all.

    Signed-off-by: George Spelvin
    Reviewed-by: Pavel Machek
    Acked-by: Ulf Hansson
    Signed-off-by: John W. Linville

    George Spelvin
     

14 May, 2014

1 commit

  • Fix build when CONFIG_NET is not enabled.
    Fixes these build errors:

    WARNING: "sk_unattached_filter_destroy" [lib/test_bpf.ko] undefined!
    WARNING: "kfree_skb" [lib/test_bpf.ko] undefined!
    WARNING: "sk_unattached_filter_create" [lib/test_bpf.ko] undefined!
    WARNING: "sk_run_filter_int_skb" [lib/test_bpf.ko] undefined!
    WARNING: "__alloc_skb" [lib/test_bpf.ko] undefined!

    Signed-off-by: Randy Dunlap
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Randy Dunlap
     

12 May, 2014

2 commits

  • All tests should pass with and without JIT.

    Example output:
    test_bpf: #0 TAX 35 16 16 PASS
    test_bpf: #1 TXA 7 7 7 PASS
    test_bpf: #2 ADD_SUB_MUL_K 10 PASS
    test_bpf: #3 DIV_KX 33 PASS
    test_bpf: #4 AND_OR_LSH_K 10 10 PASS
    test_bpf: #5 LD_IND 8 8 8 PASS
    test_bpf: #6 LD_ABS 8 8 8 PASS
    test_bpf: #7 LD_ABS_LL 13 14 PASS
    test_bpf: #8 LD_IND_LL 12 12 12 PASS
    test_bpf: #9 LD_ABS_NET 10 12 PASS
    test_bpf: #10 LD_IND_NET 11 12 12 PASS
    ...

    Numbers are times in nsec per filter for given input data.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • The testsuite covers classic and internal BPF instructions.
    It is particularly useful for JIT compiler developers.
    Adds to "net" selftest target.

    The testsuite can be used as a set of micro-benchmarks.
    It measures execution time of each BPF program in nsec.

    This patch adds core framework.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

06 May, 2014

1 commit


19 Apr, 2014

1 commit

  • This appears to be a copy/paste error. Update the description to
    reflect extra rbtree debug and checks for the config option instead of
    duplicating CONFIG_DEBUG_VM.

    Signed-off-by: Davidlohr Bueso
    Cc: Aswin Chandramouleeswaran
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

13 Apr, 2014

2 commits

  • Pull vfs updates from Al Viro:
    "The first vfs pile, with deep apologies for being very late in this
    window.

    Assorted cleanups and fixes, plus a large preparatory part of iov_iter
    work. There's a lot more of that, but it'll probably go into the next
    merge window - it *does* shape up nicely, removes a lot of
    boilerplate, gets rid of locking inconsistencie between aio_write and
    splice_write and I hope to get Kent's direct-io rewrite merged into
    the same queue, but some of the stuff after this point is having
    (mostly trivial) conflicts with the things already merged into
    mainline and with some I want more testing.

    This one passes LTP and xfstests without regressions, in addition to
    usual beating. BTW, readahead02 in ltp syscalls testsuite has started
    giving failures since "mm/readahead.c: fix readahead failure for
    memoryless NUMA nodes and limit readahead pages" - might be a false
    positive, might be a real regression..."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    missing bits of "splice: fix racy pipe->buffers uses"
    cifs: fix the race in cifs_writev()
    ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
    kill generic_file_buffered_write()
    ocfs2_file_aio_write(): switch to generic_perform_write()
    ceph_aio_write(): switch to generic_perform_write()
    xfs_file_buffered_aio_write(): switch to generic_perform_write()
    export generic_perform_write(), start getting rid of generic_file_buffer_write()
    generic_file_direct_write(): get rid of ppos argument
    btrfs_file_aio_write(): get rid of ppos
    kill the 5th argument of generic_file_buffered_write()
    kill the 4th argument of __generic_file_aio_write()
    lustre: don't open-code kernel_recvmsg()
    ocfs2: don't open-code kernel_recvmsg()
    drbd: don't open-code kernel_recvmsg()
    constify blk_rq_map_user_iov() and friends
    lustre: switch to kernel_sendmsg()
    ocfs2: don't open-code kernel_sendmsg()
    take iov_iter stuff to mm/iov_iter.c
    process_vm_access: tidy up a bit
    ...

    Linus Torvalds
     
  • Pull audit updates from Eric Paris.

    * git://git.infradead.org/users/eparis/audit: (28 commits)
    AUDIT: make audit_is_compat depend on CONFIG_AUDIT_COMPAT_GENERIC
    audit: renumber AUDIT_FEATURE_CHANGE into the 1300 range
    audit: do not cast audit_rule_data pointers pointlesly
    AUDIT: Allow login in non-init namespaces
    audit: define audit_is_compat in kernel internal header
    kernel: Use RCU_INIT_POINTER(x, NULL) in audit.c
    sched: declare pid_alive as inline
    audit: use uapi/linux/audit.h for AUDIT_ARCH declarations
    syscall_get_arch: remove useless function arguments
    audit: remove stray newline from audit_log_execve_info() audit_panic() call
    audit: remove stray newlines from audit_log_lost messages
    audit: include subject in login records
    audit: remove superfluous new- prefix in AUDIT_LOGIN messages
    audit: allow user processes to log from another PID namespace
    audit: anchor all pid references in the initial pid namespace
    audit: convert PPIDs to the inital PID namespace.
    pid: get pid_t ppid of task in init_pid_ns
    audit: rename the misleading audit_get_context() to audit_take_context()
    audit: Add generic compat syscall support
    audit: Add CONFIG_HAVE_ARCH_AUDITSYSCALL
    ...

    Linus Torvalds
     

09 Apr, 2014

1 commit

  • I got a bug report yesterday from Laszlo Ersek in which he states that
    his kvm instance fails to suspend. Laszlo bisected it down to this
    commit 1cf7e9c68fe8 ("virtio_blk: blk-mq support") where virtio-blk is
    converted to use the blk-mq infrastructure.

    After digging a bit, it became clear that the issue was with the queue
    drain. blk-mq tracks queue usage in a percpu counter, which is
    incremented on request alloc and decremented when the request is freed.
    The initial hunt was for an inconsistency in blk-mq, but everything
    seemed fine. In fact, the counter only returned crazy values when
    suspend was in progress.

    When a CPU is unplugged, the percpu counters merges that CPU state with
    the general state. blk-mq takes care to register a hotcpu notifier with
    the appropriate priority, so we know it runs after the percpu counter
    notifier. However, the percpu counter notifier only merges the state
    when the CPU is fully gone. This leaves a state transition where the
    CPU going away is no longer in the online mask, yet it still holds
    private values. This means that in this state, percpu_counter_sum()
    returns invalid results, and the suspend then hangs waiting for
    abs(dead-cpu-value) requests to complete which of course will never
    happen.

    Fix this by clearing the state earlier, so we never have a case where
    the CPU isn't in online mask but still holds private state. This bug
    has been there since forever, I guess we don't have a lot of users where
    percpu counters needs to be reliable during the suspend cycle.

    Signed-off-by: Jens Axboe
    Reported-by: Laszlo Ersek
    Tested-by: Laszlo Ersek
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

08 Apr, 2014

5 commits

  • We define a check function in order to avoid trouble with the include
    files. Then the higher level __this_cpu macros are modified to invoke
    the preemption check.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Christoph Lameter
    Acked-by: Ingo Molnar
    Cc: Tejun Heo
    Tested-by: Grygorii Strashko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • If the renamed symbol is defined lib/iomap.c implements ioport_map and
    ioport_unmap and currently (nearly) all platforms define the port
    accessor functions outb/inb and friend unconditionally. So
    HAS_IOPORT_MAP is the better name for this.

    Consequently NO_IOPORT is renamed to NO_IOPORT_MAP.

    The motivation for this change is to reintroduce a symbol HAS_IOPORT
    that signals if outb/int et al are available. I will address that at
    least one merge window later though to keep surprises to a minimum and
    catch new introductions of (HAS|NO)_IOPORT.

    The changes in this commit were done using:

    $ git grep -l -E '(NO|HAS)_IOPORT' | xargs perl -p -i -e 's/\b((?:CONFIG_)?(?:NO|HAS)_IOPORT)\b/$1_MAP/'

    Signed-off-by: Uwe Kleine-König
    Acked-by: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uwe Kleine-König
     
  • This can greatly aid in narrowing down the real source of initramfs
    problems such as failures related to the compression of the in-kernel
    initramfs when an external initramfs is in use as well. Existing errors
    are ambiguous as to which initramfs is a problem and why.

    [akpm@linux-foundation.org: use pr_debug()]
    Signed-off-by: Daniel M. Weeks
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel M. Weeks
     
  • Replace rcu_assign_pointer(x, NULL) with RCU_INIT_POINTER(x, NULL)

    The rcu_assign_pointer() ensures that the initialization of a structure
    is carried out before storing a pointer to that structure. And in the
    case of the NULL pointer, there is no structure to initialize.

    So, rcu_assign_pointer(p, NULL) can be safely converted to
    RCU_INIT_POINTER(p, NULL)

    Signed-off-by: Monam Agarwal
    Acked-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Monam Agarwal
     
  • Remove no longer used deprecated code, and make local functions
    static.

    Signed-off-by: Stephen Hemminger
    Acked-by: Jean Delvare
    Acked-by: Tejun Heo
    Cc: Jeff Layton
    Cc: Philipp Reisner
    Cc: Jens Axboe
    Cc: George Spelvin
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Hemminger
     

04 Apr, 2014

8 commits

  • Include appropriate header file include/linux/decompress/inflate.h in
    lib/decompress_inflate.c because it has prototype declaration of
    function defined in lib/decompress_inflate.c.

    Also, fix the guard around the header file
    include/linux/decompress/inflate.h to use a more unique guard symbol.
    This avoids conflict with the INFLATE_H defined by
    zlib_inflate/inflate.h.

    This eliminates the following warning in lib/decompress_inflate.c:

    lib/decompress_inflate.c:35:17: warning: no previous prototype for `gunzip' [-Wmissing-prototypes]

    Signed-off-by: Rashika Kheria
    Reviewed-by: Josh Triplett
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rashika Kheria
     
  • Add prototype declarations of functions in lib/clz_ctz.c. These
    functions are required by GCC builtins and hence can not be removed
    despite of their unreferenced appearance in kernel source.

    This eliminates the following warning in lib/clz_ctz.c:

    lib/clz_ctz.c:16:12: warning: no previous prototype for `__ctzsi2' [-Wmissing-prototypes]
    lib/clz_ctz.c:22:12: warning: no previous prototype for `__clzsi2' [-Wmissing-prototypes]
    lib/clz_ctz.c:44:12: warning: no previous prototype for `__clzdi2' [-Wmissing-prototypes]
    lib/clz_ctz.c:50:12: warning: no previous prototype for `__ctzdi2' [-Wmissing-prototypes]

    Signed-off-by: Rashika Kheria
    Reviewed-by: Josh Triplett
    Acked-by: Chanho Min
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rashika Kheria
     
  • These are just some very minor and misc cleanups in the PRNG. In
    prandom_u32() we store the result in an unsigned long which is
    unnecessary as it should be u32 instead that we get from
    prandom_u32_state(). prandom_bytes_state()'s comment is in kdoc format,
    so change it into such as it's done everywhere else. Also, use the
    normal comment style for the header comment. Last but not least for
    readability, add some newlines.

    Signed-off-by: Daniel Borkmann
    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Borkmann
     
  • Having a discussion about sparse warnings in the kernel, and that we
    should clean them up, I decided to pick a random file to do so. This
    happened to be devres.c which gives the following warnings:

    CHECK lib/devres.c
    lib/devres.c:83:9: warning: cast removes address space of expression
    lib/devres.c:117:31: warning: incorrect type in return expression (different address spaces)
    lib/devres.c:117:31: expected void [noderef] *
    lib/devres.c:117:31: got void *
    lib/devres.c:125:31: warning: incorrect type in return expression (different address spaces)
    lib/devres.c:125:31: expected void [noderef] *
    lib/devres.c:125:31: got void *
    lib/devres.c:136:26: warning: incorrect type in assignment (different address spaces)
    lib/devres.c:136:26: expected void [noderef] *[assigned] dest_ptr
    lib/devres.c:136:26: got void *
    lib/devres.c:226:9: warning: cast removes address space of expression

    Mostly it's just the use of typecasting to void * without adding
    __force, or returning ERR_PTR(-ESOMEERR) without typecasting to a
    __iomem type.

    I added a helper macro IOMEM_ERR_PTR() that does the typecast to make
    the code a little nicer than adding ugly typecasts to the code.

    Signed-off-by: Steven Rostedt
    Cc: Tejun Heo
    Acked-by: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     
  • All in-kernel users of %n in format strings have now been removed and
    the %n directive is ignored. Remove the handling of %n so that it is
    treated the same as any other invalid format string directive. Keep a
    warning in place to deter new instances of %n in format strings.

    Signed-off-by: Ryan Mallon
    Acked-by: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ryan Mallon
     
  • It is only used by procfs and procfs cannot be a module.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Currently kobject_uevent has somewhat unpredictable semantics. The
    point is, since it may call a usermode helper and wait for it to execute
    (UMH_WAIT_EXEC), it is impossible to say for sure what lock dependencies
    it will introduce for the caller - strictly speaking it depends on what
    fs the binary is located on and the set of locks fork may take. There
    are quite a few kobject_uevent's users that do not take this into
    account and call it with various mutexes taken, e.g. rtnl_mutex,
    net_mutex, which might potentially lead to a deadlock.

    Since there is actually no reason to wait for the usermode helper to
    execute there, let's make kobject_uevent start the helper asynchronously
    with the aid of the UMH_NO_WAIT flag.

    Personally, I'm interested in this, because I really want kobject_uevent
    to be called under the slab_mutex in the slub implementation as it used
    to be some time ago, because it greatly simplifies synchronization and
    automatically fixes a kmemcg-related race. However, there was a
    deadlock detected on an attempt to call kobject_uevent under the
    slab_mutex (see https://lkml.org/lkml/2012/1/14/45), which was reported
    to be fixed by releasing the slab_mutex for kobject_uevent.

    Unfortunately, there was no information about who exactly blocked on the
    slab_mutex causing the usermode helper to stall, neither have I managed
    to find this out or reproduce the issue.

    BTW, this is not the first attempt to make kobject_uevent use
    UMH_NO_WAIT. Previous one was made by commit f520360d93cd ("kobject:
    don't block for each kobject_uevent"), but it was wrong (it passed
    arguments allocated on stack to async thread) so it was reverted in
    05f54c13cd0c ("Revert "kobject: don't block for each kobject_uevent".").
    It targeted on speeding up the boot process though.

    Signed-off-by: Vladimir Davydov
    Cc: Greg KH
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Previously, page cache radix tree nodes were freed after reclaim emptied
    out their page pointers. But now reclaim stores shadow entries in their
    place, which are only reclaimed when the inodes themselves are
    reclaimed. This is problematic for bigger files that are still in use
    after they have a significant amount of their cache reclaimed, without
    any of those pages actually refaulting. The shadow entries will just
    sit there and waste memory. In the worst case, the shadow entries will
    accumulate until the machine runs out of memory.

    To get this under control, the VM will track radix tree nodes
    exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
    rather than global because we expect the radix tree nodes themselves to
    be allocated node-locally and we want to reduce cross-node references of
    otherwise independent cache workloads. A simple shrinker will then
    reclaim these nodes on memory pressure.

    A few things need to be stored in the radix tree node to implement the
    shadow node LRU and allow tree deletions coming from the list:

    1. There is no index available that would describe the reverse path
    from the node up to the tree root, which is needed to perform a
    deletion. To solve this, encode in each node its offset inside the
    parent. This can be stored in the unused upper bits of the same
    member that stores the node's height at no extra space cost.

    2. The number of shadow entries needs to be counted in addition to the
    regular entries, to quickly detect when the node is ready to go to
    the shadow node LRU list. The current entry count is an unsigned
    int but the maximum number of entries is 64, so a shadow counter
    can easily be stored in the unused upper bits.

    3. Tree modification needs tree lock and tree root, which are located
    in the address space, so store an address_space backpointer in the
    node. The parent pointer of the node is in a union with the 2-word
    rcu_head, so the backpointer comes at no extra cost as well.

    4. The node needs to be linked to an LRU list, which requires a list
    head inside the node. This does increase the size of the node, but
    it does not change the number of objects that fit into a slab page.

    [akpm@linux-foundation.org: export the right function]
    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner