A generic and efficient way to feed in text for a lexer or scanner
In the past I was lazy and always just shoved a string into my lexical analyzer as input, but what's the "ideal" way to read source text? It should be relatively efficient as the lexer observes each character. It should also support peeking forward without consumption. The thing is, source code can come from many different places, like a file, a console REPL as lines, edit diffs from a language client, ...
I'd like some way to read tokens from essentially any source without paying for virtual calls for each character read or something.
12 Replies
Maybe I'd need a view-type like a
ReadOnlyMemory
or even ReadOnlySpan
?
Thing is, we have so many options. string
, IEnumerable<char>
, ReadOnlyMemory<char>
, ReadOnlySpan<char>
, TextReader
, IO pipelines, ...a stream with an internal buffer?
something i am doing is making use of a
TextReader
, it reads a stream of data going forwards only. there is a small problem with it tho, sometimes when you call Peek()
on it, it will assume that there is nothing to look for anymore, even though if you were to call Read()
it would return the next character. What i did was create a simple wrapper around it with a small internal buffer it can use to peek into.A stream + an internal buffer doesn't sound bad but in practice it's a pain to work with streams IIRC
so far it was pretty painless, you just need to know what you are doing.
TextReader is a whole different topic, I'm talking about
Stream
here
Readers are a way friendlier way to tackle this problemi did make some extra functions to make it a little bit easier to use, for example peek for a specific character and move the reader forwards if it matches
Sure, you just preallocate a buffer and you got all those functions
I'm still not sure if it's the ideal thing to use but I might not have anything better
is there any reason you could not use a reader?
Oh absolutely, I'm planning to have non-sequential reads and I'd be happy if I didn't have to reimplement my lexer for that
Thing is, I'd want to extend my lexer to be optionally incremental, and then I can't afford re-reading the entire source
A
Stream
can at least seek in this casewhat i would do is instead of going over the source file again, go over it once and tokenize it and structure it in a tree immediately. then if you want to do multiple passes you can go to the tree instead and transform it
That has no relation to the problem, yes streaming lexers are a thing
For now I've decided to roll with the following interface:
And I'll make my lexer take a generic parameter, and wrap my source readers as structs so I can guarantee the compiler will inline each type, no vcalls
Setting the position will just throw on sources that can't support it, I think that's reasonable, as some configurations just can't allow for that
My lexer will be a very thin type with a source reader and a single
Next()
method anyway, then I'll wrap that for different configurations, like incremental/nonincremental and streaming/non-streaming