Theo's Typesafe Cult•4mo ago

How do I parse this using python and langchain's WebBaseLoader?

Hi, I'm trying to do the following: Create an or condition for SoupStrainer as bs_kwargs for WebBaseLoader where it will look for either a html tag, or a classname on all of the html tags. Currently I have only found a way to do an and condition, where it searches html tags with the classname. My code looks like this:

            current_url = 'https://example.org'
            # Create a SoupStrainer for <main> tags
            main_tag_strainer = SoupStrainer(name="main")

            # Create a SoupStrainer for elements with the class "entry-content"
            class_strainer = SoupStrainer(class_="entry-content")

            combined_strainer = main_tag_strainer | class_strainer

            loader = WebBaseLoader(
                web_paths=[current_url],
                bs_kwargs=dict(
                    parse_only=combined_strainer
                )
            )

            current_url = 'https://example.org'
            # Create a SoupStrainer for <main> tags
            main_tag_strainer = SoupStrainer(name="main")

            # Create a SoupStrainer for elements with the class "entry-content"
            class_strainer = SoupStrainer(class_="entry-content")

            combined_strainer = main_tag_strainer | class_strainer

            loader = WebBaseLoader(
                web_paths=[current_url],
                bs_kwargs=dict(
                    parse_only=combined_strainer
                )
            )

However, this causes the following error in my compiler:

unsupported operand type(s) for |: 'SoupStrainer' and 'SoupStrainer'

unsupported operand type(s) for |: 'SoupStrainer' and 'SoupStrainer'

I'm looking to combine 2 Soupstrainers into one without creating an and condition for the soupstrainer. I'm not quite sure if this is possible

Solution:

perhaps I could try this: ```python class CustomSoupStrainer(SoupStrainer): def _matches(self, markup_name, d=None, markup_class=None): # Check for 'main' tag or 'entry-content' class...

Jump to solution

21 Replies

SimbaclawsOP•4mo ago

it's kind of ridiculous that you can't find docs about SoupStrainer in the bs4 documentation though.... It's literally the library that implemented the class.... https://cdn.discordapp.com/attachments/651676691722141706/1324822871159341078/image.png?ex=67798cf2&is=67783b72&hm=6fbe09d76f693c8510db65db152274a458dcc2efa5f1b66199cd373ce65ee667&

SimbaclawsOP•4mo ago

I have looked at the implementation, but didn't yet fully understand how to do it with a or condition, this is the implementation in bs4:

SimbaclawsOP•4mo ago

SoupStrainer.py

SimbaclawsOP•4mo ago

I know I need: name as well as class_, but providing both would just search for: <element class="classname"> where I provided element and classname the langchain documentation for it here: https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.web_base.WebBaseLoader.html Only shows this:

bs_kwargs (Dict[str, Any] | None) – kwargs for beatifulsoup4 web page parsing

bs_kwargs (Dict[str, Any] | None) – kwargs for beatifulsoup4 web page parsing

Which isn't very descriptive or helpful either

mukomo•4mo ago

Damn, if only you could've used a lambda in there instead

SimbaclawsOP•4mo ago

I'd have to use a html parser with lambda then but still, would be better than only be able to provide a beautifulsoup4 class, which I don't know whether any other classes can be used, there is literally no examples or information on what I can input I guess I'll try to look at the bs_kwargs implementation hmmm... Seems like the implementation does this:

return BeautifulSoup(html_doc.text, parser, **(bs_kwargs or {}))

return BeautifulSoup(html_doc.text, parser, **(bs_kwargs or {}))

I'll check the BeautifulSoup docs hmmm it looks like I can pass in a list of arguments if I'm not mistaken, let me checkl nope, doesn't work

mukomo•4mo ago

I wonder if you might be able to convert the strainers to a list or dict and combine them that way if they're not duplicate elements

SimbaclawsOP•4mo ago

I'll give that a try, perhaps you can input a list with multiple strainers thank you

parser = [SoupStrainer(name="main"), SoupStrainer(class_="entry-content")]
            loader = WebBaseLoader(
                web_paths=[current_url],
                bs_kwargs=dict(
                    parse_only=parser
                )
       )

parser = [SoupStrainer(name="main"), SoupStrainer(class_="entry-content")]
            loader = WebBaseLoader(
                web_paths=[current_url],
                bs_kwargs=dict(
                    parse_only=parser
                )
       )

this fails with the following message: 'list' object has no attribute 'text' kind of confused what it means

mukomo•4mo ago

It's assuming whatever you pass in to the parse_only has a .text to it

SimbaclawsOP•4mo ago

which is weird because putting in a single SoupStrainer with either of the values does seem to have a .text value. So I assume the SoupStrainer class has a .text property it can use. but passing in 2 of them means it uses a list of 2 objects that have .text in them. I need to figure out a way to use both somehow sorry I've only been using python for a couple of months so far I could create a class that extends from SoupStrainer perhaps

mukomo•4mo ago

Is there a way to pull the items out of the strainer? Or is that what the parser does?

SimbaclawsOP•4mo ago

I think that's exactly what the parser does, it uses the parser to parse the text that's provided from the WebBaseLoader

mukomo•4mo ago

Then why not run the parser twice for now? Once for each strainer?

SimbaclawsOP•4mo ago

because there is a rate limit applied to the site that I'm scraping, so if I run it twice, it means double the timeouts

mukomo•4mo ago

Ah gotcha

SimbaclawsOP•4mo ago

maybe I can supply a lambda that goes over the parsers? since it accepts Any?

Solution

Simbaclaws•4mo ago

perhaps I could try this:

class CustomSoupStrainer(SoupStrainer):
    def _matches(self, markup_name, d=None, markup_class=None):
        # Check for 'main' tag or 'entry-content' class
        return (markup_name == "main") or ("entry-content" in (d.get("class", []) if d else []))

class CustomSoupStrainer(SoupStrainer):
    def _matches(self, markup_name, d=None, markup_class=None):
        # Check for 'main' tag or 'entry-content' class
        return (markup_name == "main") or ("entry-content" in (d.get("class", []) if d else []))

SimbaclawsOP•4mo ago

that worked I think 😄

SimbaclawsOP•4mo ago

oops, that seems to just scrape all of the content, I might've implemented that incorrectly can I "unmark" it?

SimbaclawsOP•4mo ago

I have the first part figured out:

SoupStrainer('div', {'class': lambda L: 'entry-content' in L.split()}),

SoupStrainer('div', {'class': lambda L: 'entry-content' in L.split()}),

This causes it to search a div element with the class entry-content, because SoupStrainer doesn't work when it has multiple classes on a div element. So the lambda would look for entry-content in all of the classes. This works for my first criteria, and I'm looking through stack overflow, which says the following: https://stackoverflow.com/questions/27713802/can-soupstrainer-have-two-arguments You can apparently give a list of different criteria, however... Doing the following:

SoupStrainer(['main', ['div', {'class': lambda L: 'entry-content' in L.split()}]]),

SoupStrainer(['main', ['div', {'class': lambda L: 'entry-content' in L.split()}]]),

doesn't work

Stack Overflow

Can SoupStrainer have two arguments?

Well this started out as a question but halfway through I figured it out. I can't find a question like this on stackoverflow or Google, so I'll post it anyway to help anyone who stumbles across it....

SimbaclawsOP•4mo ago

I think I'll try to use a different method of getting this turned into a langchain document. By using a custom html parsing library and requests I gave up on using SoupStrainer inside WebBaseLoader, I went with parsing in Beautiful soup with a html request instead I can then just build the Document object with langchain myself instead I went ahead and turned it into:

parser = BeautifulSoup(content, 'html.parser')
docContent = ''
main = parser.find('main')

if main:
    # Extract text from the <main> element
    main_text = main.get_text()
    
    # Check for nested <article> within <main>
    article_in_main = main.find('article')
    if article_in_main:
        # If an <article> is found inside <main>, use its content instead of <main>'s
        main_text = article_in_main.get_text()

    docContent += main_text

# Check for standalone <article> outside of <main>
if not main or (main and not main.find('article')):
    article = parser.find('article')
    if article:
        docContent += article.get_text()

# Check for 'entry-content'
entry_content = parser.find('div', {'class': lambda L: 'entry-content' in L.split()})
if entry_content:
    docContent += entry_content.get_text()

if docContent == '':
    docContent = parser.get_text()  # Fallback to extracting text from the entire document
    print(f"Following URL falled back to content parsing (needs manual review): {current_url}")

doc = Document(id=index, metadata={"source": current_url}, page_content=docContent, type="Document")
write_to_file(doc, data_directory)

parser = BeautifulSoup(content, 'html.parser')
docContent = ''
main = parser.find('main')

if main:
    # Extract text from the <main> element
    main_text = main.get_text()
    
    # Check for nested <article> within <main>
    article_in_main = main.find('article')
    if article_in_main:
        # If an <article> is found inside <main>, use its content instead of <main>'s
        main_text = article_in_main.get_text()

    docContent += main_text

# Check for standalone <article> outside of <main>
if not main or (main and not main.find('article')):
    article = parser.find('article')
    if article:
        docContent += article.get_text()

# Check for 'entry-content'
entry_content = parser.find('div', {'class': lambda L: 'entry-content' in L.split()})
if entry_content:
    docContent += entry_content.get_text()

if docContent == '':
    docContent = parser.get_text()  # Fallback to extracting text from the entire document
    print(f"Following URL falled back to content parsing (needs manual review): {current_url}")

doc = Document(id=index, metadata={"source": current_url}, page_content=docContent, type="Document")
write_to_file(doc, data_directory)

Gaming

Programming

How do I parse this using python and langchain's WebBaseLoader?

Did you find this page helpful?