C
C#3y ago
LPeter1997

A generic and efficient way to feed in text for a lexer or scanner

In the past I was lazy and always just shoved a string into my lexical analyzer as input, but what's the "ideal" way to read source text? It should be relatively efficient as the lexer observes each character. It should also support peeking forward without consumption. The thing is, source code can come from many different places, like a file, a console REPL as lines, edit diffs from a language client, ... I'd like some way to read tokens from essentially any source without paying for virtual calls for each character read or something.
12 Replies
LPeter1997
LPeter1997OP3y ago
Maybe I'd need a view-type like a ReadOnlyMemory or even ReadOnlySpan? Thing is, we have so many options. string, IEnumerable<char>, ReadOnlyMemory<char>, ReadOnlySpan<char>, TextReader, IO pipelines, ...
Chiyoko_S
Chiyoko_S3y ago
a stream with an internal buffer?
Pheubel
Pheubel3y ago
something i am doing is making use of a TextReader, it reads a stream of data going forwards only. there is a small problem with it tho, sometimes when you call Peek() on it, it will assume that there is nothing to look for anymore, even though if you were to call Read() it would return the next character. What i did was create a simple wrapper around it with a small internal buffer it can use to peek into.
LPeter1997
LPeter1997OP3y ago
A stream + an internal buffer doesn't sound bad but in practice it's a pain to work with streams IIRC
Pheubel
Pheubel3y ago
so far it was pretty painless, you just need to know what you are doing.
LPeter1997
LPeter1997OP3y ago
TextReader is a whole different topic, I'm talking about Stream here Readers are a way friendlier way to tackle this problem
Pheubel
Pheubel3y ago
i did make some extra functions to make it a little bit easier to use, for example peek for a specific character and move the reader forwards if it matches
LPeter1997
LPeter1997OP3y ago
Sure, you just preallocate a buffer and you got all those functions I'm still not sure if it's the ideal thing to use but I might not have anything better
Pheubel
Pheubel3y ago
is there any reason you could not use a reader?
LPeter1997
LPeter1997OP3y ago
Oh absolutely, I'm planning to have non-sequential reads and I'd be happy if I didn't have to reimplement my lexer for that Thing is, I'd want to extend my lexer to be optionally incremental, and then I can't afford re-reading the entire source A Stream can at least seek in this case
Pheubel
Pheubel3y ago
what i would do is instead of going over the source file again, go over it once and tokenize it and structure it in a tree immediately. then if you want to do multiple passes you can go to the tree instead and transform it
LPeter1997
LPeter1997OP3y ago
That has no relation to the problem, yes streaming lexers are a thing For now I've decided to roll with the following interface:
public interface ISourceReader
{
public bool IsEnd { get; }
// Settable to be able to seek on incremental lexing
public int Position { get; set; }
public char Peek(int offset = 0, char @default = '\0');
public void Advanve(int length = 1);
}
public interface ISourceReader
{
public bool IsEnd { get; }
// Settable to be able to seek on incremental lexing
public int Position { get; set; }
public char Peek(int offset = 0, char @default = '\0');
public void Advanve(int length = 1);
}
And I'll make my lexer take a generic parameter, and wrap my source readers as structs so I can guarantee the compiler will inline each type, no vcalls Setting the position will just throw on sources that can't support it, I think that's reasonable, as some configurations just can't allow for that My lexer will be a very thin type with a source reader and a single Next() method anyway, then I'll wrap that for different configurations, like incremental/nonincremental and streaming/non-streaming

Did you find this page helpful?