Firebolt: In-progress implementation of Apache Arrow in Mojo

16 Replies
Krisztian Szucs
Krisztian Szucs2mo ago
Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format.
Apache Arrow
Apache Arrow
A cross-language development platform for in-memory analytics
GnU So Cute
GnU So Cute2mo ago
u should put test folder out of source folder
Krisztian Szucs
Krisztian Szucs2mo ago
The test runner is able to discover the test cases there as well and it has been my preference for python projects.
GnU So Cute
GnU So Cute2mo ago
i mean the test folder, it should not put inside library folder, so people can reduce the size when use
Krisztian Szucs
Krisztian Szucs2mo ago
Well, the implementation is not there yet.
Darin Simmons
Darin Simmons2mo ago
Looking forward to seeing all the things go brrrr. I hope that the Apache folks agree with you and make mojo first-class. Props on all the mojo contributions. Oh yea, the name writes itself, very nice 🙂 One comment: something about PyArrow requiements in the readme or even requirements.txt. Like if I don't use C Data Interface, is PyArrow optional? mandatory?
Krisztian Szucs
Krisztian Szucs2mo ago
Entirely optional, it is only used for testing the zero copy exchange interface. I am a maintainer of apache/arrow. Once mojo gets adopted enough and the arrow impl gets mature enough, the it will make sense to push it upstream. Though that is a long term goal.
guidorice
guidorice2mo ago
@kszucs cool, and interesting! I am curious what you think about this proposal which I opened: https://github.com/modularml/mojo/issues/1515 Because it seems to me that mojo will need to be enhanced to allow zero-copy interactions with arrow formatted data and have any kind of interoperability with the rest of the arrow ecosystem.
GitHub
[Feature Request] memoryview builtin and support for python buffer ...
Review Mojo's priorities I have read the roadmap and priorities and I believe this request falls within the priorities. What is your request? This enhancement request is to add support for Pyth...
guidorice
guidorice2mo ago
And I also quote from the Arrow documentation in that issue. I do need to read through it all- it sounds like you may have solved the zero-copy use case.
Krisztian Szucs
Krisztian Szucs2mo ago
The python buffer protocol is pretty similar to the arrow c data interface. I think both are really important. Partially, it only works in one direction for now where Mojo is the consumer because the mojo callbacks cannot be passed to the C side. Also the C layout for the used structs are not guaranteed, but hopefully these are going to be sorted out in mojo soon enough.
sa-code
sa-code2mo ago
@Maxim worked on generating the flatbuffer schema files for arrow in mojo here as part of a different effort for arrow in mojo: https://github.com/mojo-data/arrow-schema Just wanted to let you know in case it's useful to you. I'm also open to collaborating if you're open to it as well!
Krisztian Szucs
Krisztian Szucs2mo ago
That will be required for the IPC format along with mojo json for the integration tests. Yes, ideally we should join efforts.
Maxim
Maxim2mo ago
🙏
sa-code
sa-code2mo ago
Awesome, DMing you to coordinate
guidorice
guidorice2mo ago
hi @Krisztian Szucs if I can make a humble suggestion, would you consider release versions that track the mojo tagged release, ex. v24.4?. This is similar to how https://github.com/endia-org/Endia is doing it for example. It makes it easier to get started as a package user. I am interested in using firebolt to make a geoarrow ( https://geoarrow.org ) integration, and after that, to create a rasterization package that converts geo vector data into mojo Tensors.
Want results from more Discord servers?
Add your server