M
Modularβ€’8mo ago
Martin Dudek

Seeking Clarification on Current and Future Tensor Library Support in Mojo

I wonder if someone can clarify the current state and future direction of a Tensor library for Mojo. I understand that Tensor won't stay in the standard library and i think to understand the rationale behind it. We have also NuMojo, which looks very promising. It is currently based on Tensor but aims to have "Native array types" as a long-term goal. Not clear about the situation, I implemented my own vectorized yet simple Vector and Matrix structs for my KAN experiments. They work, my KAN implementaton outperforms the Python implementation i ported to Mojo (which is numpy based). However, I just found out these are extremely slow compared to torch.matmul, etc. I would greatly appreciate it if someone could clarify what is possible right now for Tensor-based applications and where Mojo is headed in that regard (let's say within this year). To be clear, I'm not demanding anything of course β€” just looking for some clarification on what is possible right now and where we are going as community. Also feedback how others handle the current situation would be great. Thx πŸ™
15 Replies
PhoToN 旭輝
PhoToN 旭輝‒8mo ago
@Martin Dudek I am not sure about the future of Tensor Library once Modular open sources it, so that's there. Since I am part of the NuMojo team, I will try my best to give you a reply based on our current work on NuMojo and the direction it is heading towards. 1) As you pointed out, the main branch of NuMojo is based on Tensors. But we are currently porting all the existing functionality and much more from numpy to our own Array type. Something like numojo.array(), numojo.ndarray() (we try to stick with numpy conventions as much as possible to make it easier to switch) will be made available very soon. 2) Although we will release our own array type, math functions, array creation and manipulation modules and more soon (which should be enough for a lot of basic functionalities), it'll take some time to build a full linear algebra library etc as we want to make sure our building block Array has enough basic functionalities before moving on. 3) Recently, Another community member has joined our "Mojo Numerics & Algorithsm Group" and is contributing to MojoSci which will include a lot of higher order functionalities like SciPy. You can find it here. (https://github.com/Mojo-Numerics-and-Algorithms-group/MojoSci). Future looks bright for Mojo and NuMojo:mojo:. Please feel free to ask any questions here or in DM. Cheers πŸͺ„
Martin Dudek
Martin DudekOPβ€’8mo ago
This sounds fantastic, and surely will be a big contribution for the Mojo community, thank you so much. :mojo: πŸ™ I don't know if this comparison makes sense at all, so i want to ask you. I compared my simple matmul implementation with the following pytorch based one, and the pytorch one is nearly 100 times faster. which came a bit of a shock. Have you done any comparison with pytorch or equivalent performant libs? Is this at all a realistic comparison. I have no clue about compiler/performance/optimization possibilities ... thanks a lot
import torch
import time

def measure_matmul_time(rows_a, cols_a, cols_b):
# Generate random matrices
A = torch.randn(rows_a, cols_a)
B = torch.randn(cols_a, cols_b)

# Warm-up
for _ in range(10):
_ = torch.matmul(A, B)

# Measure the time of a single matrix multiplication
start_time = time.time()
C = torch.matmul(A, B)
end_time = time.time()

elapsed_time = 1000. * (end_time - start_time )
print(f"Time {elapsed_time:.6f} ms")

_ = C[23,6]

measure_matmul_time(1024, 2048, 512)
import torch
import time

def measure_matmul_time(rows_a, cols_a, cols_b):
# Generate random matrices
A = torch.randn(rows_a, cols_a)
B = torch.randn(cols_a, cols_b)

# Warm-up
for _ in range(10):
_ = torch.matmul(A, B)

# Measure the time of a single matrix multiplication
start_time = time.time()
C = torch.matmul(A, B)
end_time = time.time()

elapsed_time = 1000. * (end_time - start_time )
print(f"Time {elapsed_time:.6f} ms")

_ = C[23,6]

measure_matmul_time(1024, 2048, 512)
PhoToN 旭輝
PhoToN 旭輝‒8mo ago
@Martin Dudek Sorry for my delay in reply. For anyone looking at this, I am still learning Mojo and therefore if any of the methods used for benchmarking is not right, please kindly let me know, I would be glad to learn. 1) If I remember correctly, using time() to benchmark pytorch calc is slightly inaccurate, so it's better to use pytorch benchmark utils. (You can read more about it here - https://pytorch.org/tutorials/recipes/recipes/benchmark.html) 2) Currently Mojo benchmark module for our matmul keeps crashing (it works randomly, but it's like playing russian roulette and is not reliable rn) and therefore we aren't able to use that, Instead I used the time() module in mojo (which won't be accurate as doing benchmark, but let's run with it for now and do average). You can find our benchmarks for other math funcs implemented here - https://github.com/Mojo-Numerics-and-Algorithms-group/NuMojo-Examples-and-Benchmarks/blob/main/bench_vect.txt. 3) These benchmarks are usually difficult to interpret without knowing what's going on inside the blackbox, so please take all benchmarks with a grain of NaCl. All these benchmarks were ran in my Macbook M2 - 16GB RAM. Getting those stuff out of the way, let's get to benchmarks. I am performing matmul operation on Tensors/array of size NxN as you have mentioned it. x axis of all the plots correspond to the value of N and y axis corresponds to the mean time value. I have shown Torch("cpu") and Torch("mps") and NuMojo (It's technically cpu as Mojo doesn't have GPU support yet). I will mainly compare NuMojo and Torch("cpu") to make apple to apple comparison of CPU performance and give Torch("mps") as only a reference.
No description
No description
No description
PhoToN 旭輝
PhoToN 旭輝‒8mo ago
Please refer to plots attached below few messages. 1) In the float16 plot, NuMojo implementation is on par and ever slightly so slightly faster than Torch("CPU"). 2) In the float32 and float64, Torch performs better than current NuMojo implementation. Especially in float32 since I think Torch is well optimized for that. There's still a lot of compile time optimizations and tricks we can do in Mojo to reduce overhead, we aim to take advantage of these and slowly optimize all our math implementations step by step. As of now, NuMojo in its early infancy is on par or performs ever so slightly better than Torch("cpu") in some cases and in other cases, it is only roughly an order of magnitude or so slower than Torch in some cases. With Mojo GPU support in future, better compile time optimizations and some tricks, we can hope for more improvements and close in on these differences and improve. NOTE:
1) There are small fluctuations in mean time value in every run, so I have plotted these multiple times and irrespective of fluctuations, the Behaviour stays constant. 2) There are parameters in the matmul that can affect the time such as unroll factor, parallelize size etc. I have fixed certain set of values and went along, I haven't tried optimizing these values. These plots could scale differently in different systems depending on the parameters. 3) I honestly don't know if torch.float and DType.float are similar in representation, so it might not exactly be an apple to apple comparision. If anyone knows more details about this, please do share. "Update: There was error in above plots, so I have updated this text to reflect the correct plots attached below." The code I used for Pytorch benchmark is the following:
import torch.utils.benchmark as benchmark
def batched_dot_mul_sum(a, b):
'''Computes batched dot by multiplying and summing'''
return torch.matmul(a,b)

def measure_matmul_time_pytorch(rows_a, cols_a, cols_b, dev, dtype):

# Input for benchmarking
x = torch.randn(rows_a, cols_a, dtype=dtype).to(dev)
y = torch.randn(cols_a, cols_b, dtype=dtype).to(dev)

t0 = benchmark.Timer(
stmt='batched_dot_mul_sum(x, y)',
setup='from __main__ import batched_dot_mul_sum',
globals={'x': x, 'y':y})

print("done")
return t0.timeit(50).mean

size= [10, 100, 500, 1000, 2000, 5000, 10000]
dtype = torch.float64
times = [measure_matmul_time_pytorch(s, s, s, "cpu", dtype) for s in size]
times_mps = [measure_matmul_time_pytorch(s, s, s, "mps", dtype) for s in size]
import torch.utils.benchmark as benchmark
def batched_dot_mul_sum(a, b):
'''Computes batched dot by multiplying and summing'''
return torch.matmul(a,b)

def measure_matmul_time_pytorch(rows_a, cols_a, cols_b, dev, dtype):

# Input for benchmarking
x = torch.randn(rows_a, cols_a, dtype=dtype).to(dev)
y = torch.randn(cols_a, cols_b, dtype=dtype).to(dev)

t0 = benchmark.Timer(
stmt='batched_dot_mul_sum(x, y)',
setup='from __main__ import batched_dot_mul_sum',
globals={'x': x, 'y':y})

print("done")
return t0.timeit(50).mean

size= [10, 100, 500, 1000, 2000, 5000, 10000]
dtype = torch.float64
times = [measure_matmul_time_pytorch(s, s, s, "cpu", dtype) for s in size]
times_mps = [measure_matmul_time_pytorch(s, s, s, "mps", dtype) for s in size]
and for NuMojo is the following,
import numojo
fn main() raises:
var size:VariadicList[Int] = VariadicList[Int](10,100,500,1000,2000,5000,10000)
var times:List[Float64] = List[Float64]()
alias type:DType = DType.float16
measure_time[type](size, times)

fn measure_time[dtype:DType](size:VariadicList[Int], inout times:List[Float64]) raises:

for i in range(size.__len__()):
var arr1 = numojo.array[dtype](size[i], size[i], random=True)
var arr2 = numojo.array[dtype](size[i], size[i], random=True)

var t0 = time.now()
for _ in range(50):
var arr_mul = matmul[dtype](arr1, arr2)
keep(arr_mul.unsafe_ptr())
times.append(((time.now()-t0)/1e9)/50)

for i in range(size.__len__()):
print(times[i])
import numojo
fn main() raises:
var size:VariadicList[Int] = VariadicList[Int](10,100,500,1000,2000,5000,10000)
var times:List[Float64] = List[Float64]()
alias type:DType = DType.float16
measure_time[type](size, times)

fn measure_time[dtype:DType](size:VariadicList[Int], inout times:List[Float64]) raises:

for i in range(size.__len__()):
var arr1 = numojo.array[dtype](size[i], size[i], random=True)
var arr2 = numojo.array[dtype](size[i], size[i], random=True)

var t0 = time.now()
for _ in range(50):
var arr_mul = matmul[dtype](arr1, arr2)
keep(arr_mul.unsafe_ptr())
times.append(((time.now()-t0)/1e9)/50)

for i in range(size.__len__()):
print(times[i])
Hope that helps! Cheers fellow mojicians πŸͺ„
Martin Dudek
Martin DudekOPβ€’8mo ago
Wow, thank you so much for all your efforts , this looks all very promising. I hope others also see these results. Right now numojo.arrayis not in the repo yet (correct me if i am wrong), any plans when you guys might make it available? I was honestly getting a bit unsure if Mojo is right language for me right now for the projects i want to implement (like KANs), but seeing this makes me looking at Mojo in a more positive light again Thanks a lot for that :mojo:
PhoToN 旭輝
PhoToN 旭輝‒8mo ago
@Martin Dudek you are right, we haven’t released the array (NDArray) yet as we are polishing it and doing tests to clear out edge cases. We will be releasing it soon (pretty exciting! 😁) along with many other functionalities and get some community feedback. Looking forward to see Mojo and NuMojo evolve! πŸ”₯. @Martin Dudek Update: As expected (I fell for my own trap xD, gotta be careful when benchmarking), I made a mistake in parallelization which leads to the reduction in time for larger sizes. I am adding the new plots, please find it here. I will update the text above in accordance with new plots.
PhoToN 旭輝
PhoToN 旭輝‒8mo ago
No description
No description
No description
Martin Dudek
Martin DudekOPβ€’8mo ago
thanks a lot for the update, really looking forward to learn how you guys implemented these functions to be so close to torch.cpu . :mojo:
PhoToN 旭輝
PhoToN 旭輝‒8mo ago
Thank you @Martin Dudek. We are still implementing mostly basic routines available in Mojo which can be optimized better I guess, So Mojo is doing the heavy lifting. Slowly, but surely we will close in on these differences for Matmul especially for float32, float64 operations and hopefully try to be faster someday. Other element wise operations such as add, sub, mul, exp etc are already on par or sometimes faster compared to Torch operations as you can see in the attached plot (I was careful not to make benchmarking mistake again xD) where I measured torch.mul vs numojo.mul for different NxN size of Arrays/Tensors with same specifications. The solid lines show float16, dashed shows float32, dotted shows float64 comparision for Torch("cpu") and NuMojo. All other math operations show similar trends too. Note for anyone looking at the following graph: I forgot to change plot title, it should be mul and not matmul.
No description
Martin Dudek
Martin DudekOPβ€’8mo ago
I am sure the community with its Mojo Gurus will be able to give you valuable feedback once it is published. What you are implementing is just so central for many applications πŸ™ Thanks for keeping me updated here but please dont take too much time for that. I am hooked anyway πŸ˜‰
benny
bennyβ€’8mo ago
so slightly related and I can’t give more info, but specific to matmul there is some news coming next month that will definitely interest you expect more info july 1 mind sharing the code for how you generated this graph?
PhoToN 旭輝
PhoToN 旭輝‒8mo ago
Hi @benny for benchmarking, it’s the same as that I have shared above except for the matplotlib code. Do you want the plotting part of the code?
benny
bennyβ€’8mo ago
The plotting part and where are you getting the matmul function?
PhoToN 旭輝
PhoToN 旭輝‒8mo ago
@benny , the following is the plotting code
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np

mpl.rcParams["text.latex.preamble"] = r"\usepackage{mathpazo}"
plt.rcParams["axes.linewidth"] = 2
plt.rc("text", usetex=True)
plt.rc("font", family="serif")
plt.rcParams["axes.linewidth"] = 2

fig = plt.figure(figsize=(10,6))
fig.tight_layout()

time_torch = [[],[],[]] # times for float16, float32, float64
time_mojo = [[],[],[]] # times for float16, float32, float64
size = [16, 256, 512, 1024, 2048]

# Bar width
bar_width = 0.35
index = np.arange(len(size))

# Adjusted x-axis labels for clarity
x_labels = [f"{s}" for s in size]

# Plot for float16
plt.bar(index - bar_width/2, time_torch[0], bar_width, label='Torch ("cpu)', align='center')
plt.bar(index + bar_width/2, time_mojo[0], bar_width, label='NuMojo', align='center', fill=False)

# Plot for float32
plt.bar(index - bar_width/2, time_torch[1], bar_width, align='center')
plt.bar(index + bar_width/2, time_mojo[1], bar_width, align='center', fill=False)

# Plot for float64
plt.bar(index - bar_width/2, time_torch[2], bar_width, align='center')
plt.bar(index + bar_width/2, time_mojo[2], bar_width, align='center', fill=False)

plt.xlabel('Size')
plt.ylabel('Time')
plt.yscale('log')
plt.xticks(index, x_labels) # Use adjusted x-axis labels
plt.legend()
plt.show()
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np

mpl.rcParams["text.latex.preamble"] = r"\usepackage{mathpazo}"
plt.rcParams["axes.linewidth"] = 2
plt.rc("text", usetex=True)
plt.rc("font", family="serif")
plt.rcParams["axes.linewidth"] = 2

fig = plt.figure(figsize=(10,6))
fig.tight_layout()

time_torch = [[],[],[]] # times for float16, float32, float64
time_mojo = [[],[],[]] # times for float16, float32, float64
size = [16, 256, 512, 1024, 2048]

# Bar width
bar_width = 0.35
index = np.arange(len(size))

# Adjusted x-axis labels for clarity
x_labels = [f"{s}" for s in size]

# Plot for float16
plt.bar(index - bar_width/2, time_torch[0], bar_width, label='Torch ("cpu)', align='center')
plt.bar(index + bar_width/2, time_mojo[0], bar_width, label='NuMojo', align='center', fill=False)

# Plot for float32
plt.bar(index - bar_width/2, time_torch[1], bar_width, align='center')
plt.bar(index + bar_width/2, time_mojo[1], bar_width, align='center', fill=False)

# Plot for float64
plt.bar(index - bar_width/2, time_torch[2], bar_width, align='center')
plt.bar(index + bar_width/2, time_mojo[2], bar_width, align='center', fill=False)

plt.xlabel('Size')
plt.ylabel('Time')
plt.yscale('log')
plt.xticks(index, x_labels) # Use adjusted x-axis labels
plt.legend()
plt.show()
where the time_mojo is generated from the mojo benchmark code above. time_torch is from torch running in the same .py file. As for the matmul, we are using almost same implementation as Modular one (https://docs.modular.com/mojo/notebooks/Matmul), except for the changes in store[], load[] methods where we use single index to reduce some overhead instead of the two index used in modular version. ah now I see the confusion! I'm so sorry @benny. The above graph is for torch.mul and not torch.matmul, I think the label wasn't changed while changing graph from torch.matmul. these are the plots corresponding to matmul.
benny
bennyβ€’8mo ago
got it, I see thanks for clarifying πŸ‘πŸ»

Did you find this page helpful?