How do I get element innerHTML using HTMLRewriter?

Here's my code:

    const rewriter = new HTMLRewriter().on('#element', {
        element(element) {
            // Somehow get element innerHTML
        }
    }).transform(new Response(html))

    const rewriter = new HTMLRewriter().on('#element', {
        element(element) {
            // Somehow get element innerHTML
        }
    }).transform(new Response(html))

How do I get the element's innerHTML?

1 Reply

James•10mo ago

HTMLRewriter is streamed, so there's no guarantee that on the first iteration, you'll have the full element's contents. You would have to do something like run the rewriter on *, check tagName, set a point at which you start watching for new elements, and then as they come in, keep track of your own tree of nodes. By constructing this with onEndTag you could probably create a pretty accurate representation of the contents. If you just want text inside an element, you can do something like (pseudo):

class elementHandler{
    element(element){
        this.buffer = ''; // initialise text buffer for this element
    }
    text(text){
        this.buffer += text.text; // concatenate new text with existing text buffer
        if(text.lastInTextNode){
            // this is the last bit of text in the chunk. Search and replace text
            text.replace(this.buffer.replace(/cat/g, 'dog'), {html: true});
            this.buffer = '';
        }else{
            // This wasn't the last text chunk, and we don't know if this chunk will
            // participate in a match. We must remove it so the client doesn't see it
            text.remove();
        }
    }
}

class elementHandler{
    element(element){
        this.buffer = ''; // initialise text buffer for this element
    }
    text(text){
        this.buffer += text.text; // concatenate new text with existing text buffer
        if(text.lastInTextNode){
            // this is the last bit of text in the chunk. Search and replace text
            text.replace(this.buffer.replace(/cat/g, 'dog'), {html: true});
            this.buffer = '';
        }else{
            // This wasn't the last text chunk, and we don't know if this chunk will
            // participate in a match. We must remove it so the client doesn't see it
            text.remove();
        }
    }
}

but if you're looking more to parse/scrape HTML from specific elements, HTMLRewriter probably isn't the best tool for the job, and a more traditional parser like cheerio (etc.) will work best after loading the document into memory.

Gaming

Programming

How do I get element innerHTML using HTMLRewriter?

Did you find this page helpful?