Fastest way to build string from dynamic values!
In Mojo what is the fastest and most efficient way to concatenate and build string from list of dynamic string values (string literal & string objects both)?
23 Replies
I was just trying to figure out the same thing! I don't think there is currently a way? But I could be completely wrong
while the exact implementation would matter, a few ideas immediately are memcpy and gather. you can get the underlying data from a string fairly easily, after you do that it’s just a matter of joining it.
this is 100% possible though, if you give me some example code I can illustrate what I mean better
I am working on the datetime library. For that, I want to write a parser and formatter. There are also other libraries I am considering developing in Mojo. I need that functionality. I think I must try building something like the String Builder found in other languages. Is anyone working on similar project?
https://github.com/maniartech/mojo-datetime
GitHub
GitHub - maniartech/mojo-datetime: DateTime library in pure Mojo la...
DateTime library in pure Mojo language. Contribute to maniartech/mojo-datetime development by creating an account on GitHub.
I am particularly interested in how you would fetch the underlying data, handle memory arithmetic, and perform concatenation in Mojo. If you can provide some hints, I can develop it further!
that loads it into a buffer, but the data from the bitcast might also be enough
it’s a little different for String instead of StringLiteral, but it’s similar
I see, thanks
The slow bits to avoid when doing this are extra data copies and extra allocations. So you want to get the length of the concatenated result, add 1 for the null terminator, and then allocate a
DTypePointer[DType.int8]
of that size. Then you can memcpy each String or StringLiteral's pointer. A String's pointer is at ._as_ptr()
and a StringLiteral's pointer is at .data()
. You can make your result a String using its initilizer from a pointer and length.
Ignoring Variant to handle both String and StringLiteral, it is like this:
this is a 100% perfect explanation, thanks michael
could you use gather with the string pointers? i have no experience with that function
No, I don't think gather is useful in this situation. Gather is for when you want to "gather" elements together that are not next to each stored next to each other and usually following no pattern you can iterate over. Like if you want to work with the the weights in a vector whose corresponding gradients had the highest value in last pass. Find the indices you want, pass it to gather. Here it is just whole strings being copied so best to iterate over the vector and copy entire string at a time.
If you really wanted to be fast here you can actually store the offsets in the first pass getting lengths and the memcopy in parallel since none of the writes will overlap. That could look like scattering into the destination since you don't know which blocks fill first but the code will just look like linear iteration.
It prints junk bytes!
"�^(l�
When I run the following code, it prints junk.
However, when I run the following code! It prints properly! Looks like issue with String._as_ptr()
function!
The junk bytes are probably because your pointer is being created in the String’s last use. So the String is destroyed and the pointer freed before copy. If you want the original String destroyed, you can use ‘_stealptr()’. Or just assign the String to ‘’ after the memo y is done.
Underscores are being removed in my post from mobile.
I don't think that's the case! Here is whole program derived from your
contact_str
function. Or maybe I am not able to see it!
This looks like a bug were a String stored in a DynamicVector does not act like a String not held in a DynamicVector. Some sort of ownership fight I guess. A workaround, that defeats the purpose because it makes a copy, is just to use
var tmp = String(vec[i])
and then not usetmp
where you now use vec[i]
. There may be more elaborate ways to make it give up the String's pointer for copying by constructing a DTypePointer from the DynamicVector's AnyPointer but I am not sure.
You can also see whatever issue this is by trying Reference(vec[i])
which fails while you can easily create a Reference(tmp)
when tmp
is a copy of vec[i]
.
It is also possible that this method is not slow since you aren't actually mutating the String from the DynamicVector, but I think the compiler is not that smart as treats looking at the String's pointers as a mutation which forces the copy.
This is a more elaborate way to make the DynamicVector give you the string without a copy, let tmp = __get_address_as_lvalue((vec.data + i).value)
. I think it creates an immutable shallow copy so it will be fast.Wow! It works!
Still there are some questions to be answered to craft the perfect solution. Here are some of the points that I am considering!
* How does
DynamicVector
work? One of the area I am concerned about memory resize when new strings are pushed to the StringBuilder
. If DynamicVector
is not right data structure for this purpose, we can alternatively utilize resizable Buffer
. That will give good control over the strategy we employ for memory resizing!
* My second concern is how to handle various String types. StringLiteral
, String
, StringRef
etc... What about Stringable
?
Also, when I benchmarked this code against concatenation using + operator. It failed significantly!
Benchmark Code
Result
When I moved to benchmark code into the StringBuilder.__str__
function. I got this result!
That means, calling through __str__
poses significant overheadI think String is built on DynamicVector which can hold anything. So even though String's data is
I think the problem will just disappear once you can specialize by conformance. Then all DynamicVectors holding trivial types will just use memcpy.
int8
it uses slow copying by iterating through each element rather than memcpy
. The slowdown is much worse for longer Strings.I think the problem will just disappear once you can specialize by conformance. Then all DynamicVectors holding trivial types will just use memcpy.
What do you mean by "specializing by conformance", can you elaborate? I am not getting the context!
I updated the code and benchmarks! It now shows that the
StringBuilder
is around 10x faster than + concatenation!
Code
https://github.com/maniartech/mojo-strings/blob/master/strings/builder.mojo
Benchmark
Benchmark Result
GitHub
mojo-strings/strings/builder.mojo at master · maniartech/mojo-strings
Strings manipulation library in Mojo language. Contribute to maniartech/mojo-strings development by creating an account on GitHub.
@Aamir This is great! I gave it a try and I'm also seeing 10x. I'm playing around with Int8 instead of String for the internal vector and I'm seeing 22x, but it's not really tested yet
Congrats @toasty, you just advanced to level 2!
Great, good to hear that!
@Aamir , did a bug report ever get opened for this thow working with simple
vec[i]
?Nope, I didn't do that!
Just finalized the string builder library! Let me know if anyone of you see any improvement areas!
https://github.com/maniartech/mojo-stringbuilder
GitHub
GitHub - maniartech/mojo-stringbuilder: Strings manipulation librar...
Strings manipulation library in Mojo language. Contribute to maniartech/mojo-stringbuilder development by creating an account on GitHub.