Simbaclaws
Simbaclaws
TTCTheo's Typesafe Cult
Created by Simbaclaws on 1/3/2025 in #questions
How do I parse this using python and langchain's WebBaseLoader?
I went ahead and turned it into:
parser = BeautifulSoup(content, 'html.parser')
docContent = ''
main = parser.find('main')

if main:
# Extract text from the <main> element
main_text = main.get_text()

# Check for nested <article> within <main>
article_in_main = main.find('article')
if article_in_main:
# If an <article> is found inside <main>, use its content instead of <main>'s
main_text = article_in_main.get_text()

docContent += main_text

# Check for standalone <article> outside of <main>
if not main or (main and not main.find('article')):
article = parser.find('article')
if article:
docContent += article.get_text()

# Check for 'entry-content'
entry_content = parser.find('div', {'class': lambda L: 'entry-content' in L.split()})
if entry_content:
docContent += entry_content.get_text()

if docContent == '':
docContent = parser.get_text() # Fallback to extracting text from the entire document
print(f"Following URL falled back to content parsing (needs manual review): {current_url}")

doc = Document(id=index, metadata={"source": current_url}, page_content=docContent, type="Document")
write_to_file(doc, data_directory)
parser = BeautifulSoup(content, 'html.parser')
docContent = ''
main = parser.find('main')

if main:
# Extract text from the <main> element
main_text = main.get_text()

# Check for nested <article> within <main>
article_in_main = main.find('article')
if article_in_main:
# If an <article> is found inside <main>, use its content instead of <main>'s
main_text = article_in_main.get_text()

docContent += main_text

# Check for standalone <article> outside of <main>
if not main or (main and not main.find('article')):
article = parser.find('article')
if article:
docContent += article.get_text()

# Check for 'entry-content'
entry_content = parser.find('div', {'class': lambda L: 'entry-content' in L.split()})
if entry_content:
docContent += entry_content.get_text()

if docContent == '':
docContent = parser.get_text() # Fallback to extracting text from the entire document
print(f"Following URL falled back to content parsing (needs manual review): {current_url}")

doc = Document(id=index, metadata={"source": current_url}, page_content=docContent, type="Document")
write_to_file(doc, data_directory)
42 replies
TTCTheo's Typesafe Cult
Created by Simbaclaws on 1/3/2025 in #questions
How do I parse this using python and langchain's WebBaseLoader?
I can then just build the Document object with langchain myself instead
42 replies
TTCTheo's Typesafe Cult
Created by Simbaclaws on 1/3/2025 in #questions
How do I parse this using python and langchain's WebBaseLoader?
I gave up on using SoupStrainer inside WebBaseLoader, I went with parsing in Beautiful soup with a html request instead
42 replies
TTCTheo's Typesafe Cult
Created by Simbaclaws on 1/3/2025 in #questions
How do I parse this using python and langchain's WebBaseLoader?
I think I'll try to use a different method of getting this turned into a langchain document. By using a custom html parsing library and requests
42 replies
TTCTheo's Typesafe Cult
Created by Simbaclaws on 1/3/2025 in #questions
How do I parse this using python and langchain's WebBaseLoader?
I have the first part figured out:
SoupStrainer('div', {'class': lambda L: 'entry-content' in L.split()}),
SoupStrainer('div', {'class': lambda L: 'entry-content' in L.split()}),
This causes it to search a div element with the class entry-content, because SoupStrainer doesn't work when it has multiple classes on a div element. So the lambda would look for entry-content in all of the classes. This works for my first criteria, and I'm looking through stack overflow, which says the following: https://stackoverflow.com/questions/27713802/can-soupstrainer-have-two-arguments You can apparently give a list of different criteria, however... Doing the following:
SoupStrainer(['main', ['div', {'class': lambda L: 'entry-content' in L.split()}]]),
SoupStrainer(['main', ['div', {'class': lambda L: 'entry-content' in L.split()}]]),
doesn't work
42 replies
TTCTheo's Typesafe Cult
Created by Simbaclaws on 1/3/2025 in #questions
How do I parse this using python and langchain's WebBaseLoader?
can I "unmark" it?
42 replies
TTCTheo's Typesafe Cult
Created by Simbaclaws on 1/3/2025 in #questions
How do I parse this using python and langchain's WebBaseLoader?
oops, that seems to just scrape all of the content, I might've implemented that incorrectly
42 replies
TTCTheo's Typesafe Cult
Created by Simbaclaws on 1/3/2025 in #questions
How do I parse this using python and langchain's WebBaseLoader?
that worked I think 😄
42 replies
TTCTheo's Typesafe Cult
Created by Simbaclaws on 1/3/2025 in #questions
How do I parse this using python and langchain's WebBaseLoader?
perhaps I could try this:
class CustomSoupStrainer(SoupStrainer):
def _matches(self, markup_name, d=None, markup_class=None):
# Check for 'main' tag or 'entry-content' class
return (markup_name == "main") or ("entry-content" in (d.get("class", []) if d else []))
class CustomSoupStrainer(SoupStrainer):
def _matches(self, markup_name, d=None, markup_class=None):
# Check for 'main' tag or 'entry-content' class
return (markup_name == "main") or ("entry-content" in (d.get("class", []) if d else []))
42 replies
TTCTheo's Typesafe Cult
Created by Simbaclaws on 1/3/2025 in #questions
How do I parse this using python and langchain's WebBaseLoader?
since it accepts Any?
42 replies
TTCTheo's Typesafe Cult
Created by Simbaclaws on 1/3/2025 in #questions
How do I parse this using python and langchain's WebBaseLoader?
maybe I can supply a lambda that goes over the parsers?
42 replies
TTCTheo's Typesafe Cult
Created by Simbaclaws on 1/3/2025 in #questions
How do I parse this using python and langchain's WebBaseLoader?
because there is a rate limit applied to the site that I'm scraping, so if I run it twice, it means double the timeouts
42 replies
TTCTheo's Typesafe Cult
Created by Simbaclaws on 1/3/2025 in #questions
How do I parse this using python and langchain's WebBaseLoader?
I think that's exactly what the parser does, it uses the parser to parse the text that's provided from the WebBaseLoader
42 replies
TTCTheo's Typesafe Cult
Created by Simbaclaws on 1/3/2025 in #questions
How do I parse this using python and langchain's WebBaseLoader?
I could create a class that extends from SoupStrainer perhaps
42 replies
TTCTheo's Typesafe Cult
Created by Simbaclaws on 1/3/2025 in #questions
How do I parse this using python and langchain's WebBaseLoader?
sorry I've only been using python for a couple of months so far
42 replies
TTCTheo's Typesafe Cult
Created by Simbaclaws on 1/3/2025 in #questions
How do I parse this using python and langchain's WebBaseLoader?
which is weird because putting in a single SoupStrainer with either of the values does seem to have a .text value. So I assume the SoupStrainer class has a .text property it can use. but passing in 2 of them means it uses a list of 2 objects that have .text in them. I need to figure out a way to use both somehow
42 replies
TTCTheo's Typesafe Cult
Created by Simbaclaws on 1/3/2025 in #questions
How do I parse this using python and langchain's WebBaseLoader?
kind of confused what it means
42 replies
TTCTheo's Typesafe Cult
Created by Simbaclaws on 1/3/2025 in #questions
How do I parse this using python and langchain's WebBaseLoader?
parser = [SoupStrainer(name="main"), SoupStrainer(class_="entry-content")]
loader = WebBaseLoader(
web_paths=[current_url],
bs_kwargs=dict(
parse_only=parser
)
)
parser = [SoupStrainer(name="main"), SoupStrainer(class_="entry-content")]
loader = WebBaseLoader(
web_paths=[current_url],
bs_kwargs=dict(
parse_only=parser
)
)
this fails with the following message: 'list' object has no attribute 'text'
42 replies
TTCTheo's Typesafe Cult
Created by Simbaclaws on 1/3/2025 in #questions
How do I parse this using python and langchain's WebBaseLoader?
thank you
42 replies
TTCTheo's Typesafe Cult
Created by Simbaclaws on 1/3/2025 in #questions
How do I parse this using python and langchain's WebBaseLoader?
I'll give that a try, perhaps you can input a list with multiple strainers
42 replies