❔ Format parsing problem with character collision
Hi everyone, assume I have the following bencode list
l4:spam5:helloe
(where the last e
signifies the end of the list)
Apart from reading each element until there is an e which would create code duplication for me, is there any way to easily identify the difference between an e inside of a string and an e as the end of the list object? (or other objects such as integers for that matter)27 Replies
this is an example of what I mean important to note, here I can be sure that the next
e
will be the end of the object
https://wiki.theory.org/BitTorrentSpecification#Bencoding <- the format I'm working withNot really. bencoding relies on the length of each item, as I'm sure you know
the only reason we know that
e
ends the list is that hello
was length-prefixed
Does your parser not handle this for you?I'm building one from scratch
Thats fair, but again, why are you not handling this? 😄
Not sure how this would result in code duplication. A simple parser combinator would handle bencoding fairly well I imagine
well, I have 4 separate functions for parsing each type of object str int dict & list
sure, yep
string and int were easy enough obviously
but a list can contain a list
in which case, I'd prob need to call the function recursively
Makes sense, yep
the way I built my functions, they get a substring that they parse in their own way and return (my version) some BencodeObject derived class
for example, in the list parse function, this is how I parse a string
list
is the list that will eventually be returned from this function
if that makes sense
so assume that i=0 is right where the list starts and i = str.length - 1 is an e that ends the listA fairly common approach to parsers is to return two things
first the result of the parse (the list, string, int etc) and secondly.. the rest of the string
ie, where our parser stopped
I assume this is where I do
i = j + numCharacters
but in way of a return statementsort of, I guess
Let me just say that I'm in no way an expert on parsers, and I've only ever written parser combinators before
but its a very neat way of doing it
I will mention I don't know what parser combinators are
but you did give me a different idea
well, you can google it and I suggest you do, there are plenty of articles and videos on the topic. Its really interesting too, imho
I was thinking of possibly having a recursive function for every element and it combines it into the root
but the idea is that we can write a parser that parses a single character
and if we combine several of those, we can parse a word
etc
so you build progressively larger parsers, by combining smaller ones
it eventually leads to some very pretty code, where you declare a list as "zero or more valid bencode tokens"
and a "bencode token" is declared as "a bencoded list, dictionary, byte, string or int"
etc
that sounds like what I need
I'm gonna look at some articles
might try my idea aswell but this seems like it's perfectly suited for me
there is a library for C# called
Pidgin
that helps you write parsers
but its also entirely doable to make something from scratch, if you want toI'll see that aswell
Got it, using Pidgin 🙂
mind sharing your code? I won't copy as I want to do this myself, haven't had the time to look at Pidgin yet either
just as a general idea
my models that Im parsing into
my parsers
fixed
and here is some test code
the
Between
combinator was awesomedo you really have to use
e
? you are using utf8, right?the
e
is set by the encoding, so yes
you can't just come up with your own characters and still claim to be fully bencode compliant
its like replacing the }
in json :paaaaa ok i didn't know it was a standard
also because if it was a standard, i thought, what is the problem, it's all already done
because there would be already an escaping method or something
As Asher said, he is implementing a parser from scratch. There are ofc several existing parsers.
Was this issue resolved? If so, run
/close
- otherwise I will mark this as stale and this post will be archived until there is new activity.