Simbaclaws
Simbaclaws
TTCTheo's Typesafe Cult
Created by Simbaclaws on 1/3/2025 in #questions
How do I parse this using python and langchain's WebBaseLoader?
Hi, I'm trying to do the following: Create an or condition for SoupStrainer as bs_kwargs for WebBaseLoader where it will look for either a html tag, or a classname on all of the html tags. Currently I have only found a way to do an and condition, where it searches html tags with the classname. My code looks like this:
current_url = 'https://example.org'
# Create a SoupStrainer for <main> tags
main_tag_strainer = SoupStrainer(name="main")

# Create a SoupStrainer for elements with the class "entry-content"
class_strainer = SoupStrainer(class_="entry-content")

combined_strainer = main_tag_strainer | class_strainer

loader = WebBaseLoader(
web_paths=[current_url],
bs_kwargs=dict(
parse_only=combined_strainer
)
)
current_url = 'https://example.org'
# Create a SoupStrainer for <main> tags
main_tag_strainer = SoupStrainer(name="main")

# Create a SoupStrainer for elements with the class "entry-content"
class_strainer = SoupStrainer(class_="entry-content")

combined_strainer = main_tag_strainer | class_strainer

loader = WebBaseLoader(
web_paths=[current_url],
bs_kwargs=dict(
parse_only=combined_strainer
)
)
However, this causes the following error in my compiler:
unsupported operand type(s) for |: 'SoupStrainer' and 'SoupStrainer'
unsupported operand type(s) for |: 'SoupStrainer' and 'SoupStrainer'
I'm looking to combine 2 Soupstrainers into one without creating an and condition for the soupstrainer. I'm not quite sure if this is possible
42 replies