M
Modularβ€’2mo ago
TilliFe

Endia

Scientific Computing in Mojo :mojo: Docs/Website Github
Endia
Endia - Scientific Computing in Mojo πŸ”₯
Endia: A PyTorch-like ML library with AutoDiff, complex number support, JIT-compilation and a first-class functional (JAX-like) API.
GitHub
GitHub - endia-org/Endia: A dynamic Array Library for Scientific Co...
A dynamic Array Library for Scientific Computing πŸ”₯ - endia-org/Endia
38 Replies
benny
bennyβ€’2mo ago
this is incredible I was worried you left since I haven’t seen you here in so long, but i’m glad you were working on this in the background πŸ”₯
Jack Clayton
Jack Claytonβ€’2mo ago
Super awesome @TilliFe!
Chris Lattner
Chris Lattnerβ€’2mo ago
This is amazing work @TilliFe !
TilliFe
TilliFeβ€’2mo ago
MAX and Mojo are amazing! You are creating wonderful software. Thank you.
Martin Dudek
Martin Dudekβ€’2mo ago
super cool @TilliFe πŸ”₯ I just ran the benchmarks on MacOS ❯ max --version max 24.4.0 (59977802) Modular version 24.4.0-59977802-release ❯ mojo --version mojo 24.4.0 (59977802) and noticed the Loss is significant smaller for MAX JIT compilation: Running MLP benchmark in eager mode. Iter: 1000 Loss: 0.22504070401191711 Total: 0.0069106340000000023 Fwd: 0.00096554800000000021 Bwd: 0.0015250959999999984 Optim: 0.0023210129999999963 Running MLP benchmark in a functional eager mode with grad: Iter: 1000 Loss: 0.25778612494468689 Total: 0.0048792460000000003 Value_and_Grad: 0.0027390779999999994 Optim: 0.0021332430000000025 Running MLP benchmark with MAX JIT compilation: JIT compiling a new subgraph... Iter: 1000 Loss: 0.061800424009561539 Total: 0.022694156999999975 Value_and_Grad: 0.020552729000000027 Optim: 0.0021339400000000013
TilliFe
TilliFeβ€’2mo ago
The weight initializations of the neural networks might be a bit unstable. (randHe initialization) If you run the benchmarks a couple of times, there shouldn't be any outliers on average. Can you try that? Let me know if you keep encountering these inconsistencies.
Martin Dudek
Martin Dudekβ€’2mo ago
here the results of 10 runs ... the Loss of MAX JIT is not always the lowest , seems to depend on the random weight initialization as you said already ... if you want to extend the benchmarks to calculate averages over multiple runs i am happy to run another test ...
TilliFe
TilliFeβ€’2mo ago
Please go for it. Great! Averages would be most valuable. If you feel like adding these benchmarks to the Endia nightly branch afterwards, fell free to make a pull request. At best as separate files to what is already there. On a small tangent: You can also run the JIT compiled version without MAX, which then uses the same built-in caching mechanisms but does not send the graph to the MAX compiler. The graph is run directly with Endia's ops. At the moment this should match the speed of the eager execution. πŸš€
Martin Dudek
Martin Dudekβ€’2mo ago
i can give it a try but would need to digg into your implementation - basically the weights would need to be initialised for each of the 1000 loops I assume - straight forward for mlp_func and mlp_imp it seems, for JIT, i will try πŸ˜‰
TilliFe
TilliFeβ€’2mo ago
You don't need to dig into it the JIT mechanism. The functional version and the JIT-ed version is one and the same. The implementation only differs in a single line of code. Let's clarify. In the regular functional setup it works as follows: 0. Initialize all Parameters with e.g. rand_he initalization (i.e.a List of nd.Arrays, i.e. a bias and a weight array for each layer) 1. Define the forward function (fwd: a regular Mojo function with a List of arrays as the argument) 2. Pass this fwd function into the nd.value_and_grad(...) function, which returns a Callable that can compute the logits/loss and the gradient of all inputs at the same time. 3. Pass all initialized params (from step 0) into this Callable. This will actually do the work and return Arrays (i.e the loss and the gradients). So the initialization of the weights happen before you do all the work. The only difference between this functional mode and the JIT mode is that we pass the value_and_grad Callable after the step 2 into the nd.jit(...) function. So you don't need to worry about this actually. Step 2 explicitly, the regular way (line 73 in this file):
value_and_grad_fwd = nd.value_and_grad(fwd)
value_and_grad_fwd = nd.value_and_grad(fwd)
vs. the JIT way (line 73 in this file):
value_and_grad_fwd = nd.jit(nd.value_and_grad(fwd), compile_with_MAX=True)
value_and_grad_fwd = nd.jit(nd.value_and_grad(fwd), compile_with_MAX=True)
Martin Dudek
Martin Dudekβ€’2mo ago
@TilliFe Just committed a PR with a simple implementation of multiple runs of the benchmarks to calculate average results. Feel free to use it or modify it as needed. If it doesn't fit, just ignore it. πŸ˜‰
James Usevitch
James Usevitchβ€’2mo ago
@TilliFe This library looks awesome! Just curious--there's a lot of focus on JIT compilation in the docs. Are there any limitations on AOT compilation for Endia?
Martin Dudek
Martin Dudekβ€’2mo ago
Eager Mode: Iter: 10 x 1000 Avg Loss: 0.22449450194835663 Functional Eager Mode with Grad : Iter: 10 x 1000 Avg Loss: 0.28279870748519897 JIT: Iter: 10 x 1000 Avg Loss: 0.099444642663002014
TilliFe
TilliFeβ€’2mo ago
@James Usevitch Thanks. AOT vs. JIT in Endia: All computation graph related things in Endia are fundamentally done JIT. However, Mojo itself seems to be a hybrid approach of mainly AOT, with the possibility to do JIT compilation. When using Endia in eager mode, the main building blocks, i.e. most primitive operations, are compiled AOT (matmul, add, relu, ...) and then chained together at run time. I spent a lot if time thinkig about how to design Endia to be as modular as possible: You can now define (Differentiable) Operations and make them as large and complex as you wish to. For example, have a look at the mul operation in the functional module. It is easy to see that we can define more complex functions like fma etc. with the same approach. These primitive submodules are then compiled AOT. Compare that to doing JIT compilation with MAX, where we merely capture the operations that need to be performed, send this endia graph to the MAX compiler, let it do its magic, and take this compiled MAX graph as a new Callable and cache it for later reuse.
GitHub
Endia/endia/functional/binary_ops/mul_op.mojo at main Β· endia-org/E...
Scientific Computing in Mojo πŸ”₯. Contribute to endia-org/Endia development by creating an account on GitHub.
TilliFe
TilliFeβ€’2mo ago
Thank you for that! I looked into it and could not find any obvious reasons for these differences. However I realized that the way the loss is put out in the first place has been kind of flawed so far. Until now, it was simply averaged from the first to the last iteration and since the loss can be extremely high in the first couple of iterations, the average output at the end is not representative at all of how the loss evolved and decreased over time. In the the new nightly we print out more intermediate loss values and can see, that the number actually falls fairly equally in both the MAX execution and the Endia execution mode and end up at around the same values of around 0.001-0.01. Nonetheless, there is still a very slight difference and the loss of the MAX execution still tends to fall a tiny bit faster. This might have something to do with the internal implementation of the MAX ops (which I would really like to know more about). re: Your pull request. I will try to merge/integrate your changes as soon as possible. I like it.
Want results from more Discord servers?
Add your server