Simbaclaws Comments - Answer Overflow

Simbaclaws

•Created by Simbaclaws on 1/3/2025 in #questions

How do I parse this using python and langchain's WebBaseLoader?

I went ahead and turned it into:

parser = BeautifulSoup(content, 'html.parser')
docContent = ''
main = parser.find('main')

if main:
    # Extract text from the <main> element
    main_text = main.get_text()
    
    # Check for nested <article> within <main>
    article_in_main = main.find('article')
    if article_in_main:
        # If an <article> is found inside <main>, use its content instead of <main>'s
        main_text = article_in_main.get_text()

    docContent += main_text

# Check for standalone <article> outside of <main>
if not main or (main and not main.find('article')):
    article = parser.find('article')
    if article:
        docContent += article.get_text()

# Check for 'entry-content'
entry_content = parser.find('div', {'class': lambda L: 'entry-content' in L.split()})
if entry_content:
    docContent += entry_content.get_text()

if docContent == '':
    docContent = parser.get_text()  # Fallback to extracting text from the entire document
    print(f"Following URL falled back to content parsing (needs manual review): {current_url}")

doc = Document(id=index, metadata={"source": current_url}, page_content=docContent, type="Document")
write_to_file(doc, data_directory)

parser = BeautifulSoup(content, 'html.parser')
docContent = ''
main = parser.find('main')

if main:
    # Extract text from the <main> element
    main_text = main.get_text()
    
    # Check for nested <article> within <main>
    article_in_main = main.find('article')
    if article_in_main:
        # If an <article> is found inside <main>, use its content instead of <main>'s
        main_text = article_in_main.get_text()

    docContent += main_text

# Check for standalone <article> outside of <main>
if not main or (main and not main.find('article')):
    article = parser.find('article')
    if article:
        docContent += article.get_text()

# Check for 'entry-content'
entry_content = parser.find('div', {'class': lambda L: 'entry-content' in L.split()})
if entry_content:
    docContent += entry_content.get_text()

if docContent == '':
    docContent = parser.get_text()  # Fallback to extracting text from the entire document
    print(f"Following URL falled back to content parsing (needs manual review): {current_url}")

doc = Document(id=index, metadata={"source": current_url}, page_content=docContent, type="Document")
write_to_file(doc, data_directory)

42 replies

TTCTheo's Typesafe Cult

•Created by Simbaclaws on 1/3/2025 in #questions

How do I parse this using python and langchain's WebBaseLoader?

I can then just build the Document object with langchain myself instead

42 replies

TTCTheo's Typesafe Cult

•Created by Simbaclaws on 1/3/2025 in #questions

How do I parse this using python and langchain's WebBaseLoader?

I gave up on using SoupStrainer inside WebBaseLoader, I went with parsing in Beautiful soup with a html request instead

42 replies

TTCTheo's Typesafe Cult

•Created by Simbaclaws on 1/3/2025 in #questions

How do I parse this using python and langchain's WebBaseLoader?

I think I'll try to use a different method of getting this turned into a langchain document. By using a custom html parsing library and requests

42 replies

TTCTheo's Typesafe Cult

•Created by Simbaclaws on 1/3/2025 in #questions

How do I parse this using python and langchain's WebBaseLoader?

I have the first part figured out:

SoupStrainer('div', {'class': lambda L: 'entry-content' in L.split()}),

SoupStrainer('div', {'class': lambda L: 'entry-content' in L.split()}),

This causes it to search a div element with the class entry-content, because SoupStrainer doesn't work when it has multiple classes on a div element. So the lambda would look for entry-content in all of the classes. This works for my first criteria, and I'm looking through stack overflow, which says the following: https://stackoverflow.com/questions/27713802/can-soupstrainer-have-two-arguments You can apparently give a list of different criteria, however... Doing the following:

SoupStrainer(['main', ['div', {'class': lambda L: 'entry-content' in L.split()}]]),

SoupStrainer(['main', ['div', {'class': lambda L: 'entry-content' in L.split()}]]),

doesn't work

42 replies

TTCTheo's Typesafe Cult

•Created by Simbaclaws on 1/3/2025 in #questions

How do I parse this using python and langchain's WebBaseLoader?

can I "unmark" it?

42 replies

TTCTheo's Typesafe Cult

•Created by Simbaclaws on 1/3/2025 in #questions

How do I parse this using python and langchain's WebBaseLoader?

oops, that seems to just scrape all of the content, I might've implemented that incorrectly

42 replies

TTCTheo's Typesafe Cult

•Created by Simbaclaws on 1/3/2025 in #questions

How do I parse this using python and langchain's WebBaseLoader?

that worked I think 😄

42 replies

TTCTheo's Typesafe Cult

•Created by Simbaclaws on 1/3/2025 in #questions

How do I parse this using python and langchain's WebBaseLoader?

perhaps I could try this:

class CustomSoupStrainer(SoupStrainer):
    def _matches(self, markup_name, d=None, markup_class=None):
        # Check for 'main' tag or 'entry-content' class
        return (markup_name == "main") or ("entry-content" in (d.get("class", []) if d else []))

class CustomSoupStrainer(SoupStrainer):
    def _matches(self, markup_name, d=None, markup_class=None):
        # Check for 'main' tag or 'entry-content' class
        return (markup_name == "main") or ("entry-content" in (d.get("class", []) if d else []))

42 replies

TTCTheo's Typesafe Cult

•Created by Simbaclaws on 1/3/2025 in #questions

How do I parse this using python and langchain's WebBaseLoader?

since it accepts Any?

42 replies

TTCTheo's Typesafe Cult

•Created by Simbaclaws on 1/3/2025 in #questions

How do I parse this using python and langchain's WebBaseLoader?

maybe I can supply a lambda that goes over the parsers?

42 replies

TTCTheo's Typesafe Cult

•Created by Simbaclaws on 1/3/2025 in #questions

How do I parse this using python and langchain's WebBaseLoader?

because there is a rate limit applied to the site that I'm scraping, so if I run it twice, it means double the timeouts

42 replies

TTCTheo's Typesafe Cult

•Created by Simbaclaws on 1/3/2025 in #questions

How do I parse this using python and langchain's WebBaseLoader?

I think that's exactly what the parser does, it uses the parser to parse the text that's provided from the WebBaseLoader

42 replies

TTCTheo's Typesafe Cult

•Created by Simbaclaws on 1/3/2025 in #questions

How do I parse this using python and langchain's WebBaseLoader?

I could create a class that extends from SoupStrainer perhaps

42 replies

TTCTheo's Typesafe Cult

•Created by Simbaclaws on 1/3/2025 in #questions

How do I parse this using python and langchain's WebBaseLoader?

sorry I've only been using python for a couple of months so far

42 replies

TTCTheo's Typesafe Cult

•Created by Simbaclaws on 1/3/2025 in #questions

How do I parse this using python and langchain's WebBaseLoader?

which is weird because putting in a single SoupStrainer with either of the values does seem to have a .text value. So I assume the SoupStrainer class has a .text property it can use. but passing in 2 of them means it uses a list of 2 objects that have .text in them. I need to figure out a way to use both somehow

42 replies

TTCTheo's Typesafe Cult

•Created by Simbaclaws on 1/3/2025 in #questions

How do I parse this using python and langchain's WebBaseLoader?

kind of confused what it means

42 replies

TTCTheo's Typesafe Cult

•Created by Simbaclaws on 1/3/2025 in #questions

How do I parse this using python and langchain's WebBaseLoader?

parser = [SoupStrainer(name="main"), SoupStrainer(class_="entry-content")]
            loader = WebBaseLoader(
                web_paths=[current_url],
                bs_kwargs=dict(
                    parse_only=parser
                )
       )

parser = [SoupStrainer(name="main"), SoupStrainer(class_="entry-content")]
            loader = WebBaseLoader(
                web_paths=[current_url],
                bs_kwargs=dict(
                    parse_only=parser
                )
       )

this fails with the following message: 'list' object has no attribute 'text'

42 replies

Gaming

Programming