C#•2y ago

❔ Format parsing problem with character collision

Hi everyone, assume I have the following bencode listl4:spam5:helloe (where the last e signifies the end of the list) Apart from reading each element until there is an e which would create code duplication for me, is there any way to easily identify the difference between an e inside of a string and an e as the end of the list object? (or other objects such as integers for that matter)

27 Replies

AsherOP•2y ago

int indexOfNextE = str.IndexOf('e', i);
                    // TODO: add 0 check
                    BencodeInteger newInteger = DecodeInteger(str.Substring(i + 1, indexOfNextE - 1));
                    list.Add(newInteger);

int indexOfNextE = str.IndexOf('e', i);
                    // TODO: add 0 check
                    BencodeInteger newInteger = DecodeInteger(str.Substring(i + 1, indexOfNextE - 1));
                    list.Add(newInteger);

this is an example of what I mean important to note, here I can be sure that the next e will be the end of the object https://wiki.theory.org/BitTorrentSpecification#Bencoding <- the format I'm working with

Pobiega•2y ago

Not really. bencoding relies on the length of each item, as I'm sure you know the only reason we know that e ends the list is that hello was length-prefixed Does your parser not handle this for you?

AsherOP•2y ago

I'm building one from scratch

Pobiega•2y ago

Thats fair, but again, why are you not handling this? 😄 Not sure how this would result in code duplication. A simple parser combinator would handle bencoding fairly well I imagine

AsherOP•2y ago

well, I have 4 separate functions for parsing each type of object str int dict & list

Pobiega•2y ago

sure, yep

AsherOP•2y ago

string and int were easy enough obviously but a list can contain a list in which case, I'd prob need to call the function recursively

Pobiega•2y ago

Makes sense, yep

AsherOP•2y ago

the way I built my functions, they get a substring that they parse in their own way and return (my version) some BencodeObject derived class for example, in the list parse function, this is how I parse a string

            for (int i = 0; i < str.Length; i++)
            {
                char c = str[i];

                // BencodeString
                if(char.IsNumber(c))
                {
                    int j = i;
                    while(str[j] != ':')
                    {
                        j++;
                    }

                    int numCharacters = int.Parse(str.Substring(i, j - 1));

                    BencodeString newString = DecodeString(str.Substring(i, j + numCharacters));
                    list.Add(newString);
                    i = j + numCharacters;
                }
                ... // rest of the objects

            for (int i = 0; i < str.Length; i++)
            {
                char c = str[i];

                // BencodeString
                if(char.IsNumber(c))
                {
                    int j = i;
                    while(str[j] != ':')
                    {
                        j++;
                    }

                    int numCharacters = int.Parse(str.Substring(i, j - 1));

                    BencodeString newString = DecodeString(str.Substring(i, j + numCharacters));
                    list.Add(newString);
                    i = j + numCharacters;
                }
                ... // rest of the objects

list is the list that will eventually be returned from this function if that makes sense so assume that i=0 is right where the list starts and i = str.length - 1 is an e that ends the list

Pobiega•2y ago

A fairly common approach to parsers is to return two things first the result of the parse (the list, string, int etc) and secondly.. the rest of the string ie, where our parser stopped

AsherOP•2y ago

I assume this is where I do i = j + numCharacters but in way of a return statement

Pobiega•2y ago

sort of, I guess Let me just say that I'm in no way an expert on parsers, and I've only ever written parser combinators before but its a very neat way of doing it

AsherOP•2y ago

I will mention I don't know what parser combinators are but you did give me a different idea

Pobiega•2y ago

well, you can google it and I suggest you do, there are plenty of articles and videos on the topic. Its really interesting too, imho

AsherOP•2y ago

I was thinking of possibly having a recursive function for every element and it combines it into the root

Pobiega•2y ago

but the idea is that we can write a parser that parses a single character and if we combine several of those, we can parse a word etc so you build progressively larger parsers, by combining smaller ones it eventually leads to some very pretty code, where you declare a list as "zero or more valid bencode tokens" and a "bencode token" is declared as "a bencoded list, dictionary, byte, string or int" etc

AsherOP•2y ago

that sounds like what I need I'm gonna look at some articles might try my idea aswell but this seems like it's perfectly suited for me

Pobiega•2y ago

there is a library for C# called Pidgin that helps you write parsers but its also entirely doable to make something from scratch, if you want to

AsherOP•2y ago

I'll see that aswell

Pobiega•2y ago

Got it, using Pidgin 🙂

dictionary: d4:spam4:eggs2:hii16ee
spam: "eggs"
hi: 16

dictionary: d4:spam4:eggs2:hii16ee
spam: "eggs"
hi: 16

AsherOP•2y ago

mind sharing your code? I won't copy as I want to do this myself, haven't had the time to look at Pidgin yet either just as a general idea

Pobiega•2y ago

public abstract record BToken
{
}

public record BString(string Value) : BToken
{
    public override string ToString() => $"\"{Value}\"";
}

public record BInteger(int Value) : BToken
{
    public override string ToString() => Value.ToString();
}

public record BList(ImmutableArray<BToken> Elements) : BToken;

public record BDictionary(ImmutableDictionary<string, BToken> Elements) : BToken;

public abstract record BToken
{
}

public record BString(string Value) : BToken
{
    public override string ToString() => $"\"{Value}\"";
}

public record BInteger(int Value) : BToken
{
    public override string ToString() => Value.ToString();
}

public record BList(ImmutableArray<BToken> Elements) : BToken;

public record BDictionary(ImmutableDictionary<string, BToken> Elements) : BToken;

my models that Im parsing into

private static readonly Parser<char, char> _i = Char('i');
private static readonly Parser<char, char> _l = Char('l');
private static readonly Parser<char, char> _d = Char('d');
private static readonly Parser<char, char> _e = Char('e');
private static readonly Parser<char, char> _colon = Char(':');

private static Parser<char, string> StringWithLength(int length) => Any.RepeatString(length);

private static readonly Parser<char, int> _int = UnsignedInt(10);
private static readonly Parser<char, int> BencodeLength = _int.Before(_colon);

public static readonly Parser<char, BToken> BencodeString =
    BencodeLength.Bind(StringWithLength).Select<BToken>(x => new BString(x));

public static readonly Parser<char, BToken> BencodeInt =
    _i.Then(_int).Before(_e).Select<BToken>(x => new BInteger(x));

public static readonly Parser<char, BToken> Bencoded =
    BencodeString.Or(BencodeInt).Or(Rec(() => BencodeList)).Or(Rec(() => BencodeDictionary));

public static readonly Parser<char, BToken> BencodeList =
    Bencoded.Many().Between(_l, _e).Select<BToken>(x => new BList(x.ToImmutableArray()));

private static readonly Parser<char, KeyValuePair<string, BToken>> KeyValuePair =
    BencodeString
        .Then(Bencoded, (name, val) => new KeyValuePair<string, BToken>((name as BString)!.Value, val));

public static readonly Parser<char, BToken> BencodeDictionary =
    KeyValuePair.Many().Between(_d, _e).Select<BToken>(x => new BDictionary(x.ToImmutableDictionary()));

private static readonly Parser<char, char> _i = Char('i');
private static readonly Parser<char, char> _l = Char('l');
private static readonly Parser<char, char> _d = Char('d');
private static readonly Parser<char, char> _e = Char('e');
private static readonly Parser<char, char> _colon = Char(':');

private static Parser<char, string> StringWithLength(int length) => Any.RepeatString(length);

private static readonly Parser<char, int> _int = UnsignedInt(10);
private static readonly Parser<char, int> BencodeLength = _int.Before(_colon);

public static readonly Parser<char, BToken> BencodeString =
    BencodeLength.Bind(StringWithLength).Select<BToken>(x => new BString(x));

public static readonly Parser<char, BToken> BencodeInt =
    _i.Then(_int).Before(_e).Select<BToken>(x => new BInteger(x));

public static readonly Parser<char, BToken> Bencoded =
    BencodeString.Or(BencodeInt).Or(Rec(() => BencodeList)).Or(Rec(() => BencodeDictionary));

public static readonly Parser<char, BToken> BencodeList =
    Bencoded.Many().Between(_l, _e).Select<BToken>(x => new BList(x.ToImmutableArray()));

private static readonly Parser<char, KeyValuePair<string, BToken>> KeyValuePair =
    BencodeString
        .Then(Bencoded, (name, val) => new KeyValuePair<string, BToken>((name as BString)!.Value, val));

public static readonly Parser<char, BToken> BencodeDictionary =
    KeyValuePair.Many().Between(_d, _e).Select<BToken>(x => new BDictionary(x.ToImmutableDictionary()));

my parsers fixed and here is some test code

var test = "d4:spam4:eggs2:hii16ee";
Console.WriteLine($"dictionary: {test}");

var result = BencodeDictionary.ParseOrThrow(test) as BDictionary;

foreach (var token in result.Elements)
{
    Console.WriteLine($"{token.Key}: {token.Value}");
}

var test = "d4:spam4:eggs2:hii16ee";
Console.WriteLine($"dictionary: {test}");

var result = BencodeDictionary.ParseOrThrow(test) as BDictionary;

foreach (var token in result.Elements)
{
    Console.WriteLine($"{token.Key}: {token.Value}");
}

the Between combinator was awesome

Ꜳåąɐȁặⱥᴀᴬ•2y ago

do you really have to use e? you are using utf8, right?

Pobiega•2y ago

the e is set by the encoding, so yes you can't just come up with your own characters and still claim to be fully bencode compliant its like replacing the } in json :p

Ꜳåąɐȁặⱥᴀᴬ•2y ago

aaaaa ok i didn't know it was a standard also because if it was a standard, i thought, what is the problem, it's all already done because there would be already an escaping method or something

Pobiega•2y ago

As Asher said, he is implementing a parser from scratch. There are ofc several existing parsers.

Accord•2y ago

Was this issue resolved? If so, run /close - otherwise I will mark this as stale and this post will be archived until there is new activity.

Gaming

Programming

❔ Format parsing problem with character collision

Did you find this page helpful?