Microsoft DataFrame example not working
Hello there, I'm trying to understand Microsoft DataFrames for a c# project where I need to sum data of prices of items that have the same names inside a csv(sum costs for each entry of "apple", "banana" etc). In my python version I used pandas for that and pivoted after dropping not needed columns to achieve what I wanted with few lines of code.
But now I'm stuck already trying to follow the examples provided by Microsoft for DataFrames. I tried to copy the code mentioned in the "Combine Data Sources" but I'm getting error that column "id" wouldn't exist. Does someone know how good the Microsoft Website is for getting into DataFrames or is there a better place or solution to achive what I want?
https://learn.microsoft.com/en-us/dotnet/machine-learning/how-to-guides/getting-started-dataframe
Getting started with DataFrames - ML.NET
Learn how to use DataFrame to manipulate and prepare data.
12 Replies
Any particular reason you need to use machine learning for that, instead of a simple LINQ query?
Thanks for a quick response. On the Microsoft page it states that DataFrames can also be used for data manipulation and my google search when I looked for a panda equivalent pointed me towards that, so I thought that was alright too since I used panda DataFrames in Python aswell... I also stumbled upon LINQ as well and tried a bit but that didn't worked out as well, I sadly already deleted my LINQ tries so I can't provide that right now. I think I tried to use a group by statement using the itemName column of the csv but the code threw exception.
The csv looks like this, just bigger and with more items and entries:
As I said, on panda i just made a dataFrame, dropped all rows besides itemString, itemName, quantity and price, made an pivot table with itemString, itemName as index and used sum operation for the quantity and price column, that was fairly simple for me as a beginner. But i dont ensist on having to use Microsoft DataFrame if a LINQ solution is easier.
Angius
REPL Result: Failure
Exception: CompilationErrorException
Compile: 608.568ms | Execution: 0.000ms | React with ❌ to remove this embed.
That's what I get for writing code on Discord lol
Angius
REPL Result: Success
Result: List<ValueTuple<string, int>>
Compile: 625.373ms | Execution: 77.272ms | React with ❌ to remove this embed.
'Ere
Could probably make it shorter with
.Aggregate()
oh my god that's working and I tried it yesterdays for hours. I thank you so much, now i need to see what's different here from what i tried and proper understand it. Appreciated!
If you have questions about anything in that query, feel free to ask
Ah okay if you offer that...
so as far as i think to understand, we make a "selection" variable r that get's filled with the content of each row splitted by the seperator ",", so it basically becomes an array i guess. we then make a new selection of r, selecting only the 2nd [1] and the 5th [4] column and name them name and price, we also make sure that the price colmun gets parsed as an integer to make the sum method work.
Then we group by the name column of the r select variable and make a new selection variable that gets called g, get's g value automatically filled by what ever GroupBy spit's out before? I also dont understand why we write "name: g.Key", how do i know whats .key? I hover it in visual studio and read "gets the key of the IGrouping<out TKey, out TElement> but it confuses instead of enlighten me. I think IGrouping is a type of object that gets created by the GroupBy statement? And <out TKey, out TElement> correspond to the name and price column. If i remember correctly, keys of tables/their entries are used i.E when we want to join tables.
I also don't understand why we need to write g.Sum(i => i.price) instead of just g.Sum(price) or g.Sum(int.Parse(r[4])) like we did before in the select statement.
What i think to understand too is that => is lambda and is used inside a linq query variable if we execute the statement later and not immediately. But maybe i missunderstand, I just try to remember stuff i read online when i tried some stuff yesterday
Also why do we need to make a new select variable "g" in the last select?
selecting only the 2nd [1] and the 5th [4] column and name them name and priceWe turn them into a named ValueTuple, to be more precise
get's g value automatically filled by what ever GroupBy spit's out before?
.GroupBy()
spits out a grouping. Basically, a
would result in something like
with Values
being available straight from that grouping thanks to it implementing IEnumerable
interface.
Key
will be what you group on, in this case "even"
or "odd"
. Values will be the items that fit that group
I also don't understand why we need to write g.Sum(i => i.price) instead of just g.Sum(price)Because
g
at this point is an IEnumerable<(string name, int price)>
and it's just the prices we need to sum
Just price
means nothing here. Sum()
takes a lambda whose parameter is the consecutive item of the IEnumerable
you're summing up
or g.Sum(int.Parse(r[4]))
r
does not exist here
Also why do we need to make a new select variable "g" in the last select?Otherwise you'd get an
IEnumerable<IGrouping<string, (string name, int price)>
We want just an IEnumerable<(string name, int price)>
where price
is the totaljust to be clear, "we want just IEnumerable<[...]" or "just IGrouping<[...]"? because visual studio doesnt say g is IEnumerable, but IGrouping.
So does the first Select(r => work with the return value provided by csv.Split? The second Select(r works with the returned value of the first Select(r the Group by works with that returned value and so on?
Again a big thank you that you further explain that to me. I wonder how I would learn that stuff the best if I dont have somebody that can help/explain to me? Reading microsofts documentation for everything?
g
is an IGrouping
, yes
We select just the key and the sum of prices from it, into a tuple
how I would learn that stuff the bestMicrosoft LINQ documentation: https://learn.microsoft.com/en-us/dotnet/csharp/linq/ Though it favours the less-used SQL-like query syntax for it