How do I parse this using python and langchain's WebBaseLoader?

Hi, I'm trying to do the following: Create an or condition for SoupStrainer as bs_kwargs for WebBaseLoader where it will look for either a html tag, or a classname on all of the html tags. Currently I have only found a way to do an and condition, where it searches html tags with the classname. My code looks like this:
current_url = 'https://example.org'
# Create a SoupStrainer for <main> tags
main_tag_strainer = SoupStrainer(name="main")

# Create a SoupStrainer for elements with the class "entry-content"
class_strainer = SoupStrainer(class_="entry-content")

combined_strainer = main_tag_strainer | class_strainer

loader = WebBaseLoader(
web_paths=[current_url],
bs_kwargs=dict(
parse_only=combined_strainer
)
)
current_url = 'https://example.org'
# Create a SoupStrainer for <main> tags
main_tag_strainer = SoupStrainer(name="main")

# Create a SoupStrainer for elements with the class "entry-content"
class_strainer = SoupStrainer(class_="entry-content")

combined_strainer = main_tag_strainer | class_strainer

loader = WebBaseLoader(
web_paths=[current_url],
bs_kwargs=dict(
parse_only=combined_strainer
)
)
However, this causes the following error in my compiler:
unsupported operand type(s) for |: 'SoupStrainer' and 'SoupStrainer'
unsupported operand type(s) for |: 'SoupStrainer' and 'SoupStrainer'
I'm looking to combine 2 Soupstrainers into one without creating an and condition for the soupstrainer. I'm not quite sure if this is possible
Solution:
perhaps I could try this: ```python class CustomSoupStrainer(SoupStrainer): def _matches(self, markup_name, d=None, markup_class=None): # Check for 'main' tag or 'entry-content' class...
Jump to solution
21 Replies
Simbaclaws
SimbaclawsOP6d ago
it's kind of ridiculous that you can't find docs about SoupStrainer in the bs4 documentation though.... It's literally the library that implemented the class.... https://cdn.discordapp.com/attachments/651676691722141706/1324822871159341078/image.png?ex=67798cf2&is=67783b72&hm=6fbe09d76f693c8510db65db152274a458dcc2efa5f1b66199cd373ce65ee667&
Simbaclaws
SimbaclawsOP6d ago
I have looked at the implementation, but didn't yet fully understand how to do it with a or condition, this is the implementation in bs4:
Simbaclaws
SimbaclawsOP6d ago
I know I need: name as well as class_, but providing both would just search for: <element class="classname"> where I provided element and classname the langchain documentation for it here: https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.web_base.WebBaseLoader.html Only shows this:
bs_kwargs (Dict[str, Any] | None) – kwargs for beatifulsoup4 web page parsing
bs_kwargs (Dict[str, Any] | None) – kwargs for beatifulsoup4 web page parsing
Which isn't very descriptive or helpful either
mukomo
mukomo6d ago
Damn, if only you could've used a lambda in there instead
Simbaclaws
SimbaclawsOP6d ago
I'd have to use a html parser with lambda then but still, would be better than only be able to provide a beautifulsoup4 class, which I don't know whether any other classes can be used, there is literally no examples or information on what I can input I guess I'll try to look at the bs_kwargs implementation hmmm... Seems like the implementation does this:
return BeautifulSoup(html_doc.text, parser, **(bs_kwargs or {}))
return BeautifulSoup(html_doc.text, parser, **(bs_kwargs or {}))
I'll check the BeautifulSoup docs hmmm it looks like I can pass in a list of arguments if I'm not mistaken, let me checkl nope, doesn't work
mukomo
mukomo6d ago
I wonder if you might be able to convert the strainers to a list or dict and combine them that way if they're not duplicate elements
Simbaclaws
SimbaclawsOP6d ago
I'll give that a try, perhaps you can input a list with multiple strainers thank you
parser = [SoupStrainer(name="main"), SoupStrainer(class_="entry-content")]
loader = WebBaseLoader(
web_paths=[current_url],
bs_kwargs=dict(
parse_only=parser
)
)
parser = [SoupStrainer(name="main"), SoupStrainer(class_="entry-content")]
loader = WebBaseLoader(
web_paths=[current_url],
bs_kwargs=dict(
parse_only=parser
)
)
this fails with the following message: 'list' object has no attribute 'text' kind of confused what it means
mukomo
mukomo6d ago
It's assuming whatever you pass in to the parse_only has a .text to it
Simbaclaws
SimbaclawsOP6d ago
which is weird because putting in a single SoupStrainer with either of the values does seem to have a .text value. So I assume the SoupStrainer class has a .text property it can use. but passing in 2 of them means it uses a list of 2 objects that have .text in them. I need to figure out a way to use both somehow sorry I've only been using python for a couple of months so far I could create a class that extends from SoupStrainer perhaps
mukomo
mukomo6d ago
Is there a way to pull the items out of the strainer? Or is that what the parser does?
Simbaclaws
SimbaclawsOP6d ago
I think that's exactly what the parser does, it uses the parser to parse the text that's provided from the WebBaseLoader
mukomo
mukomo6d ago
Then why not run the parser twice for now? Once for each strainer?
Simbaclaws
SimbaclawsOP6d ago
because there is a rate limit applied to the site that I'm scraping, so if I run it twice, it means double the timeouts
mukomo
mukomo6d ago
Ah gotcha
Simbaclaws
SimbaclawsOP6d ago
maybe I can supply a lambda that goes over the parsers? since it accepts Any?
Solution
Simbaclaws
Simbaclaws6d ago
perhaps I could try this:
class CustomSoupStrainer(SoupStrainer):
def _matches(self, markup_name, d=None, markup_class=None):
# Check for 'main' tag or 'entry-content' class
return (markup_name == "main") or ("entry-content" in (d.get("class", []) if d else []))
class CustomSoupStrainer(SoupStrainer):
def _matches(self, markup_name, d=None, markup_class=None):
# Check for 'main' tag or 'entry-content' class
return (markup_name == "main") or ("entry-content" in (d.get("class", []) if d else []))
Simbaclaws
SimbaclawsOP6d ago
that worked I think 😄
Simbaclaws
SimbaclawsOP6d ago
oops, that seems to just scrape all of the content, I might've implemented that incorrectly can I "unmark" it?
Simbaclaws
SimbaclawsOP6d ago
I have the first part figured out:
SoupStrainer('div', {'class': lambda L: 'entry-content' in L.split()}),
SoupStrainer('div', {'class': lambda L: 'entry-content' in L.split()}),
This causes it to search a div element with the class entry-content, because SoupStrainer doesn't work when it has multiple classes on a div element. So the lambda would look for entry-content in all of the classes. This works for my first criteria, and I'm looking through stack overflow, which says the following: https://stackoverflow.com/questions/27713802/can-soupstrainer-have-two-arguments You can apparently give a list of different criteria, however... Doing the following:
SoupStrainer(['main', ['div', {'class': lambda L: 'entry-content' in L.split()}]]),
SoupStrainer(['main', ['div', {'class': lambda L: 'entry-content' in L.split()}]]),
doesn't work
Stack Overflow
Can SoupStrainer have two arguments?
Well this started out as a question but halfway through I figured it out. I can't find a question like this on stackoverflow or Google, so I'll post it anyway to help anyone who stumbles across it....
Simbaclaws
SimbaclawsOP6d ago
I think I'll try to use a different method of getting this turned into a langchain document. By using a custom html parsing library and requests I gave up on using SoupStrainer inside WebBaseLoader, I went with parsing in Beautiful soup with a html request instead I can then just build the Document object with langchain myself instead I went ahead and turned it into:
parser = BeautifulSoup(content, 'html.parser')
docContent = ''
main = parser.find('main')

if main:
# Extract text from the <main> element
main_text = main.get_text()

# Check for nested <article> within <main>
article_in_main = main.find('article')
if article_in_main:
# If an <article> is found inside <main>, use its content instead of <main>'s
main_text = article_in_main.get_text()

docContent += main_text

# Check for standalone <article> outside of <main>
if not main or (main and not main.find('article')):
article = parser.find('article')
if article:
docContent += article.get_text()

# Check for 'entry-content'
entry_content = parser.find('div', {'class': lambda L: 'entry-content' in L.split()})
if entry_content:
docContent += entry_content.get_text()

if docContent == '':
docContent = parser.get_text() # Fallback to extracting text from the entire document
print(f"Following URL falled back to content parsing (needs manual review): {current_url}")

doc = Document(id=index, metadata={"source": current_url}, page_content=docContent, type="Document")
write_to_file(doc, data_directory)
parser = BeautifulSoup(content, 'html.parser')
docContent = ''
main = parser.find('main')

if main:
# Extract text from the <main> element
main_text = main.get_text()

# Check for nested <article> within <main>
article_in_main = main.find('article')
if article_in_main:
# If an <article> is found inside <main>, use its content instead of <main>'s
main_text = article_in_main.get_text()

docContent += main_text

# Check for standalone <article> outside of <main>
if not main or (main and not main.find('article')):
article = parser.find('article')
if article:
docContent += article.get_text()

# Check for 'entry-content'
entry_content = parser.find('div', {'class': lambda L: 'entry-content' in L.split()})
if entry_content:
docContent += entry_content.get_text()

if docContent == '':
docContent = parser.get_text() # Fallback to extracting text from the entire document
print(f"Following URL falled back to content parsing (needs manual review): {current_url}")

doc = Document(id=index, metadata={"source": current_url}, page_content=docContent, type="Document")
write_to_file(doc, data_directory)

Did you find this page helpful?