Skip to content

Broadcasting is much slower than necessary #400

Open
@ranocha

Description

@ranocha

Describe the bug 🐞

The documentation of ArrayPartition states

Broadcasting like f.(A) is efficient.

However, I found that broadcasting appears to be much slower than it could be - an order of magnitude in the example below.

Expected behavior

Broadcasting is as fast as the hand-written version below.

Minimal Reproducible Example 👇

using BenchmarkTools: @benchmark
using FastBroadcast: @..
using MuladdMacro: @muladd
using RecursiveArrayTools: ArrayPartition, npartitions

function run_benchmarks(num_arrays, num_elements)
    u0 = ArrayPartition(ntuple(_ -> randn(num_elements), num_arrays)...)
    u1 = ArrayPartition(ntuple(_ -> randn(num_elements), num_arrays)...)
    u2 = ArrayPartition(ntuple(_ -> randn(num_elements), num_arrays)...)
    u3 = ArrayPartition(ntuple(_ -> randn(num_elements), num_arrays)...)
    u4 = ArrayPartition(ntuple(_ -> randn(num_elements), num_arrays)...)

    a1 = randn()
    a2 = randn()
    a3 = randn()

    @info "Base broadcast"
    display(@benchmark base_broadcast!($u0, $u1, $u2, $u3, $u4, $a1, $a2, $a3))
    sleep(1)
    @info "FastBroadcast.jl"
    display(@benchmark fast_broadcast!($u0, $u1, $u2, $u3, $u4, $a1, $a2, $a3))
    sleep(1)
    @info "Own base broadcast"
    display(@benchmark own_base_broadcast!($u0, $u1, $u2, $u3, $u4, $a1, $a2, $a3))
    sleep(1)
    @info "Own fast broadcast"
    display(@benchmark own_fast_broadcast!($u0, $u1, $u2, $u3, $u4, $a1, $a2, $a3))

    return nothing
end

@muladd @noinline function base_broadcast!(u0, u1, u2, u3, u4, a1, a2, a3)
    @. u0 = a1 * u1 + a2 * u2 + a3 * u3 + u4
    return nothing
end

@muladd @noinline function fast_broadcast!(u0, u1, u2, u3, u4, a1, a2, a3)
    @.. u0 = a1 * u1 + a2 * u2 + a3 * u3 + u4
    return nothing
end

@muladd @noinline function own_base_broadcast!(u0, u1, u2, u3, u4, a1, a2, a3)
    for i in npartitions(u0, u1, u2, u3, u4)
        @. u0.x[i] = a1 * u1.x[i] + a2 * u2.x[i] + a3 * u3.x[i] + u4.x[i]
    end
    return nothing
end

@muladd @noinline function own_fast_broadcast!(u0, u1, u2, u3, u4, a1, a2, a3)
    for i in npartitions(u0, u1, u2, u3, u4)
        @.. u0.x[i] = a1 * u1.x[i] + a2 * u2.x[i] + a3 * u3.x[i] + u4.x[i]
    end
    return nothing
end

Error & Stacktrace ⚠️

julia> run_benchmarks(3, 1000)
[ Info: Base broadcast
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min  max):  2.018 μs    4.394 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     2.093 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.145 μs ± 216.232 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆▅▁██▂                 ▁ ▁▂ ▂                               ▂
  ██████▄▁▁▁▁▁▁▁▁▃▄▁▁▁▆▆▇███████▇█▆▆▆▄▄▄▄▁▅▄▄▇▆▇▆▇▇▅▆▆▅▆▅▇▆█▇ █
  2.02 μs      Histogram: log(frequency) by time      3.16 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
[ Info: FastBroadcast.jl
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min  max):  2.023 μs    4.519 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     2.093 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.107 μs ± 108.084 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▃▃  ▆█▇▅▁▂                                                 ▂
  ███▇▅██████▇▆▅▆▄▆▄▃▄▃▄▆▁▅▅▆▄▃▄▄▃▅▁▃▄▄▅▃▄▄▆▅▄▄▅▆▄▅▅▆▆▆▄▇▆▆▇▆ █
  2.02 μs      Histogram: log(frequency) by time      2.62 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
[ Info: Own base broadcast
BenchmarkTools.Trial: 10000 samples with 496 evaluations.
 Range (min  max):  221.101 ns  431.115 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     227.823 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   232.615 ns ±   8.997 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▁█▄▄▂▂▃  ▁▁▃▁▁▁▂▁   ▁▁▁▁▁▁▁   ▁                       ▁
  ▆▃▂▅▃▂▃▄█████████████████████████████████████▇██▇▇▆▆▆▆▇▆▆▆▇▅▆ █
  221 ns        Histogram: log(frequency) by time        263 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
[ Info: Own fast broadcast
BenchmarkTools.Trial: 10000 samples with 511 evaluations.
 Range (min  max):  218.280 ns  419.112 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     224.967 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   229.016 ns ±   8.731 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂       █▇▄▅▃▃▂▂▂▂▄▂▂▃▂▁ ▂▁▁▁▂▂▁▁▁    ▁  ▁ ▁                  ▂
  █▁▃▄▁▃▁▄███████████████████████████▇███████████▇▇▇▇█▇▇▇▇▆▇▇▆▆ █
  218 ns        Histogram: log(frequency) by time        262 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Environment (please complete the following information):

  • Output of using Pkg; Pkg.status()
julia> using Pkg; Pkg.status()
Status `~/..../Project.toml`
  [6e4b80f9] BenchmarkTools v1.5.0
  [7034ab61] FastBroadcast v0.3.5
  [46d2c3a1] MuladdMacro v0.2.4
  [e4faabce] PProf v3.1.0
  [c46f51b8] ProfileView v1.8.0
  [731186ca] RecursiveArrayTools v3.27.0
  • Output of using Pkg; Pkg.status(; mode = PKGMODE_MANIFEST)
  [1520ce14] AbstractTrees v0.4.5
  [7d9f7c33] Accessors v0.1.37
  [79e6a3ab] Adapt v4.0.4
  [4fba245c] ArrayInterface v7.15.0
  [6e4b80f9] BenchmarkTools v1.5.0
  [d1d4a3ce] BitFlags v0.1.9
  [62783981] BitTwiddlingConvenienceFunctions v0.1.6
  [e1450e63] BufferedStreams v1.2.2
  [fa961155] CEnum v0.5.0
  [2a0fbf3d] CPUSummary v0.2.6
  [159f3aea] Cairo v1.1.0
  [fb6a15b2] CloseOpenIntervals v0.1.13
  [944b1d66] CodecZlib v0.7.6
  [3da002f7] ColorTypes v0.11.5
  [5ae59095] Colors v0.12.11
  [f70d9fcc] CommonWorldInvalidations v1.0.0
  [34da2185] Compat v4.16.0
  [a33af91c] CompositionsBase v0.1.2
  [187b0558] ConstructionBase v1.5.6
  [adafc99b] CpuId v0.3.1
  [9a962f9c] DataAPI v1.16.0
  [e2d170a0] DataValueInterfaces v1.0.0
  [ffbed154] DocStringExtensions v0.9.3
  [4e289a0a] EnumX v1.0.4
  [e2ba6199] ExprTools v0.1.10
  [7034ab61] FastBroadcast v0.3.5
  [5789e2e9] FileIO v1.16.3
  [53c48c17] FixedPointNumbers v0.8.5
  [08572546] FlameGraphs v1.0.0
  [46192b85] GPUArraysCore v0.1.6
  [a2bd30eb] Graphics v1.1.2
⌃ [9db2cae5] Gtk4 v0.6.1
  [8710efd8] GtkObservables v2.1.1
  [615f187c] IfElse v0.1.1
  [9b13fd28] IndirectArrays v1.0.0
  [8197267c] IntervalSets v0.7.10
  [3587e190] InverseFunctions v0.1.16
  [82899510] IteratorInterfaceExtensions v1.0.0
  [692b3bcd] JLLWrappers v1.5.0
  [682c06a0] JSON v0.21.4
  [10f19ff3] LayoutPointers v0.1.17
  [1d6d02ad] LeftChildRightSiblingTrees v0.2.0
  [1914dd2f] MacroTools v0.5.13
  [d125e4d3] ManualMemory v0.1.8
  [85b6ec6f] MethodAnalysis v0.4.13
  [46d2c3a1] MuladdMacro v0.2.4
  [77ba4419] NaNMath v1.0.2
  [510215fc] Observables v0.5.5
  [bac558e1] OrderedCollections v1.6.3
  [e4faabce] PProf v3.1.0
  [69de0a69] Parsers v2.8.1
  [f517fe37] Polyester v0.7.16
  [1d0040c9] PolyesterWeave v0.2.2
  [aea7be01] PrecompileTools v1.2.1
  [21216c6a] Preferences v1.4.3
  [c46f51b8] ProfileView v1.8.0
  [92933f4c] ProgressMeter v1.10.2
  [3349acd9] ProtoBuf v1.0.16
  [3cdcf5f2] RecipesBase v1.3.4
  [731186ca] RecursiveArrayTools v3.27.0
  [189a3867] Reexport v1.2.2
  [ae029012] Requires v1.3.0
  [d5f540fe] RoundingIntegers v1.1.0
  [7e49a35a] RuntimeGeneratedFunctions v0.5.13
  [94e857df] SIMDTypes v0.1.0
  [6c6a2e73] Scratch v1.2.1
  [aedffcd0] Static v1.1.1
  [0d7ed370] StaticArrayInterface v1.8.0
  [1e83bf80] StaticArraysCore v1.4.3
  [7792a7ef] StrideArraysCore v0.5.7
  [2efcf032] SymbolicIndexingInterface v0.3.28
  [3783bdb8] TableTraits v1.0.1
  [bd369af6] Tables v1.12.0
  [8290d209] ThreadingUtilities v0.5.2
  [3bb67fe8] TranscodingStreams v0.11.2
  [6e34b625] Bzip2_jll v1.0.8+1
  [83423d85] Cairo_jll v1.18.0+2
  [2702e6a9] EpollShim_jll v0.0.20230411+0
  [2e619515] Expat_jll v2.6.2+0
  [a3f928ae] Fontconfig_jll v2.13.96+0
  [d7e528f0] FreeType2_jll v2.13.2+0
  [559328eb] FriBidi_jll v1.0.14+0
  [6ebb71f1] GTK4_jll v4.12.5+0
  [78b55507] Gettext_jll v0.21.0+0
  [7746bdde] Glib_jll v2.80.2+0
  [75302f13] Graphene_jll v1.10.6+0
  [3b182d85] Graphite2_jll v1.3.14+0
  [3c863552] Graphviz_jll v2.50.0+1
  [2e76f6c2] HarfBuzz_jll v2.8.1+1
  [aacddb02] JpegTurbo_jll v3.0.3+0
⌅ [88015f11] LERC_jll v3.0.0+1
  [1d63c593] LLVMOpenMP_jll v17.0.6+0
  [dd4b983a] LZO_jll v2.10.2+0
  [42c93a91] Libepoxy_jll v1.5.10+0
⌅ [e9f186c6] Libffi_jll v3.2.2+1
  [d4300ac3] Libgcrypt_jll v1.8.11+0
  [7e76a0d4] Libglvnd_jll v1.6.0+0
  [7add5ba3] Libgpg_error_jll v1.49.0+0
  [94ce4f54] Libiconv_jll v1.17.0+0
  [4b2f31a3] Libmount_jll v2.40.1+0
  [925c91fb] Librsvg_jll v2.54.5+0
  [89763e89] Libtiff_jll v4.6.0+0
  [38a345b3] Libuuid_jll v2.40.1+0
  [36c8627f] Pango_jll v1.52.2+0
  [30392449] Pixman_jll v0.43.4+0
  [a2964d1f] Wayland_jll v1.21.0+1
  [2381bf8a] Wayland_protocols_jll v1.31.0+0
  [02c8fc9c] XML2_jll v2.13.1+0
  [aed1982a] XSLT_jll v1.1.41+0
  [ffd25f8a] XZ_jll v5.4.6+0
  [4f6342f7] Xorg_libX11_jll v1.8.6+0
  [0c0b7dd1] Xorg_libXau_jll v1.0.11+0
  [935fb764] Xorg_libXcursor_jll v1.2.0+4
  [0aeada51] Xorg_libXdamage_jll v1.1.5+4
  [a3789734] Xorg_libXdmcp_jll v1.1.4+0
  [1082639a] Xorg_libXext_jll v1.3.6+0
  [d091e8ba] Xorg_libXfixes_jll v5.0.3+4
  [a51aa0fd] Xorg_libXi_jll v1.7.10+4
  [d1454406] Xorg_libXinerama_jll v1.1.4+4
  [ec84b674] Xorg_libXrandr_jll v1.5.2+4
  [ea2f1a96] Xorg_libXrender_jll v0.9.11+0
  [14d82f49] Xorg_libpthread_stubs_jll v0.1.1+0
  [c7cfdc94] Xorg_libxcb_jll v1.17.0+0
  [cc61e674] Xorg_libxkbfile_jll v1.1.2+0
  [35661453] Xorg_xkbcomp_jll v1.4.6+0
  [33bec58e] Xorg_xkeyboard_config_jll v2.39.0+0
  [c5fb5394] Xorg_xtrans_jll v1.5.0+0
  [3161d3a3] Zstd_jll v1.5.6+0
  [b437f822] adwaita_icon_theme_jll v43.0.1+0
  [da03df04] gdk_pixbuf_jll v2.42.10+0
  [059c91fe] hicolor_icon_theme_jll v0.17.0+3
  [bf975903] iso_codes_jll v4.15.1+0
  [b53b4c65] libpng_jll v1.6.43+1
  [cf2c5f97] pprof_jll v1.0.1+0
  [d8fb68d0] xkbcommon_jll v1.4.1+1
  [0dad84c5] ArgTools v1.1.1
  [56f22d72] Artifacts
  [2a0f44e3] Base64
  [ade2ca70] Dates
  [8ba89e20] Distributed
  [f43a241f] Downloads v1.6.0
  [7b1f6079] FileWatching
  [b77e0a4c] InteractiveUtils
  [b27032c2] LibCURL v0.6.4
  [76f85450] LibGit2
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [d6f4376e] Markdown
  [a63ad114] Mmap
  [ca575930] NetworkOptions v1.2.0
  [44cfe95a] Pkg v1.10.0
  [de0858da] Printf
  [9abbd945] Profile
  [3fa0cd96] REPL
  [9a3f8284] Random
  [ea8e919c] SHA v0.7.0
  [9e88b42a] Serialization
  [6462fe0b] Sockets
  [2f01184e] SparseArrays v1.10.0
  [10745b16] Statistics v1.10.0
  [fa267f1f] TOML v1.0.3
  [a4e569a6] Tar v1.10.0
  [8dfed614] Test
  [cf7118a7] UUIDs
  [4ec0a83e] Unicode
  [e66e0078] CompilerSupportLibraries_jll v1.1.1+0
  [deac9b47] LibCURL_jll v8.4.0+0
  [e37daf67] LibGit2_jll v1.6.4+0
  [29816b5a] LibSSH2_jll v1.11.0+1
  [c8ffd9c3] MbedTLS_jll v2.28.2+1
  [14a3606d] MozillaCACerts_jll v2023.1.10
  [4536629a] OpenBLAS_jll v0.3.23+4
  [05823500] OpenLibm_jll v0.8.1+2
  [efcefdf7] PCRE2_jll v10.42.0+1
  [bea87d4a] SuiteSparse_jll v7.2.1+1
  [83775a58] Zlib_jll v1.2.13+1
  [8e850b90] libblastrampoline_jll v5.8.0+1
  [8e850ede] nghttp2_jll v1.52.0+1
  [3f19e933] p7zip_jll v17.4.0+2
  • Output of versioninfo()
julia> versioninfo()
Julia Version 1.10.4
Commit 48d4fd48430 (2024-06-04 10:41 UTC)
Build Info:
  Official https://julialang.org/ release

Additional context

Using Cthulhu.jl, the basic difference in the underlying method

https://github.com/JuliaLang/julia/blob/306cee71560e24d26585fd1143a2aacac41b5508/base/broadcast.jl#L961-L977

seems to be the type of bc

  • the current version with RecursiveArrayTools.jl gives bc::Base.Broadcast.Broadcasted{Nothing, Nothing, typeof(*), Tuple{Float64, Vector{Float64}}}
  • the manual version gives Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, typeof(*), Tuple{Float64, Vector{Float64}}}

For reference, the basic setup of the current broadcasting infrastructure seems to be laid in #41

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions