C#•3y ago

A generic and efficient way to feed in text for a lexer or scanner

In the past I was lazy and always just shoved a string into my lexical analyzer as input, but what's the "ideal" way to read source text? It should be relatively efficient as the lexer observes each character. It should also support peeking forward without consumption. The thing is, source code can come from many different places, like a file, a console REPL as lines, edit diffs from a language client, ... I'd like some way to read tokens from essentially any source without paying for virtual calls for each character read or something.

12 Replies

LPeter1997OP•3y ago

Maybe I'd need a view-type like a ReadOnlyMemory or even ReadOnlySpan? Thing is, we have so many options. string, IEnumerable<char>, ReadOnlyMemory<char>, ReadOnlySpan<char>, TextReader, IO pipelines, ...

Chiyoko_S•3y ago

a stream with an internal buffer?

Pheubel•3y ago

something i am doing is making use of a TextReader, it reads a stream of data going forwards only. there is a small problem with it tho, sometimes when you call Peek() on it, it will assume that there is nothing to look for anymore, even though if you were to call Read() it would return the next character. What i did was create a simple wrapper around it with a small internal buffer it can use to peek into.

LPeter1997OP•3y ago

A stream + an internal buffer doesn't sound bad but in practice it's a pain to work with streams IIRC

Pheubel•3y ago

so far it was pretty painless, you just need to know what you are doing.

LPeter1997OP•3y ago

TextReader is a whole different topic, I'm talking about Stream here Readers are a way friendlier way to tackle this problem

Pheubel•3y ago

i did make some extra functions to make it a little bit easier to use, for example peek for a specific character and move the reader forwards if it matches

LPeter1997OP•3y ago

Sure, you just preallocate a buffer and you got all those functions I'm still not sure if it's the ideal thing to use but I might not have anything better

Pheubel•3y ago

is there any reason you could not use a reader?

LPeter1997OP•3y ago

Oh absolutely, I'm planning to have non-sequential reads and I'd be happy if I didn't have to reimplement my lexer for that Thing is, I'd want to extend my lexer to be optionally incremental, and then I can't afford re-reading the entire source A Stream can at least seek in this case

Pheubel•3y ago

what i would do is instead of going over the source file again, go over it once and tokenize it and structure it in a tree immediately. then if you want to do multiple passes you can go to the tree instead and transform it

LPeter1997OP•3y ago

That has no relation to the problem, yes streaming lexers are a thing For now I've decided to roll with the following interface:

public interface ISourceReader
{
    public bool IsEnd { get; }
    // Settable to be able to seek on incremental lexing
    public int Position { get; set; }
    public char Peek(int offset = 0, char @default = '\0');
    public void Advanve(int length = 1);
}

public interface ISourceReader
{
    public bool IsEnd { get; }
    // Settable to be able to seek on incremental lexing
    public int Position { get; set; }
    public char Peek(int offset = 0, char @default = '\0');
    public void Advanve(int length = 1);
}

And I'll make my lexer take a generic parameter, and wrap my source readers as structs so I can guarantee the compiler will inline each type, no vcalls Setting the position will just throw on sources that can't support it, I think that's reasonable, as some configurations just can't allow for that My lexer will be a very thin type with a source reader and a single Next() method anyway, then I'll wrap that for different configurations, like incremental/nonincremental and streaming/non-streaming

Gaming

Programming

A generic and efficient way to feed in text for a lexer or scanner

Did you find this page helpful?