How do I parse this using python and langchain's WebBaseLoader?
Hi, I'm trying to do the following:
Create an or condition for SoupStrainer as bs_kwargs for WebBaseLoader where it will look for either a html tag, or a classname on all of the html tags.
Currently I have only found a way to do an and condition, where it searches html tags with the classname.
My code looks like this:
However, this causes the following error in my compiler:
I'm looking to combine 2 Soupstrainers into one without creating an and condition for the soupstrainer.
I'm not quite sure if this is possible
Solution:Jump to solution
perhaps I could try this:
```python
class CustomSoupStrainer(SoupStrainer):
def _matches(self, markup_name, d=None, markup_class=None):
# Check for 'main' tag or 'entry-content' class...
21 Replies
it's kind of ridiculous that you can't find docs about SoupStrainer in the bs4 documentation though....
It's literally the library that implemented the class....
https://cdn.discordapp.com/attachments/651676691722141706/1324822871159341078/image.png?ex=67798cf2&is=67783b72&hm=6fbe09d76f693c8510db65db152274a458dcc2efa5f1b66199cd373ce65ee667&
I have looked at the implementation, but didn't yet fully understand how to do it with a or condition, this is the implementation in bs4:
I know I need: name as well as class_, but providing both would just search for: <element class="classname">
where I provided element and classname
the langchain documentation for it here:
https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.web_base.WebBaseLoader.html
Only shows this:
Which isn't very descriptive or helpful either
Damn, if only you could've used a lambda in there instead
I'd have to use a html parser with lambda then
but still, would be better than only be able to provide a beautifulsoup4 class, which I don't know whether any other classes can be used, there is literally no examples or information on what I can input
I guess I'll try to look at the bs_kwargs implementation
hmmm... Seems like the implementation does this:
I'll check the BeautifulSoup docs
hmmm it looks like I can pass in a list of arguments if I'm not mistaken, let me checkl
nope, doesn't work
I wonder if you might be able to convert the strainers to a list or dict and combine them that way if they're not duplicate elements
I'll give that a try, perhaps you can input a list with multiple strainers
thank you
this fails with the following message: 'list' object has no attribute 'text'
kind of confused what it means
It's assuming whatever you pass in to the
parse_only
has a .text
to itwhich is weird because putting in a single SoupStrainer with either of the values does seem to have a .text value. So I assume the SoupStrainer class has a .text property it can use. but passing in 2 of them means it uses a list of 2 objects that have .text in them. I need to figure out a way to use both somehow
sorry I've only been using python for a couple of months so far
I could create a class that extends from SoupStrainer perhaps
Is there a way to pull the items out of the strainer?
Or is that what the parser does?
I think that's exactly what the parser does, it uses the parser to parse the text that's provided from the WebBaseLoader
Then why not run the parser twice for now? Once for each strainer?
because there is a rate limit applied to the site that I'm scraping, so if I run it twice, it means double the timeouts
Ah gotcha
maybe I can supply a lambda that goes over the parsers?
since it accepts Any?
Solution
perhaps I could try this:
that worked I think 😄
oops, that seems to just scrape all of the content, I might've implemented that incorrectly
can I "unmark" it?
I have the first part figured out:
This causes it to search a div element with the class entry-content, because SoupStrainer doesn't work when it has multiple classes on a div element. So the lambda would look for entry-content in all of the classes.
This works for my first criteria, and I'm looking through stack overflow, which says the following:
https://stackoverflow.com/questions/27713802/can-soupstrainer-have-two-arguments
You can apparently give a list of different criteria, however... Doing the following:
doesn't work
Stack Overflow
Can SoupStrainer have two arguments?
Well this started out as a question but halfway through I figured it out. I can't find a question like this on stackoverflow or Google, so I'll post it anyway to help anyone who stumbles across it....
I think I'll try to use a different method of getting this turned into a langchain document. By using a custom html parsing library and requests
I gave up on using SoupStrainer inside WebBaseLoader, I went with parsing in Beautiful soup with a html request instead
I can then just build the Document object with langchain myself instead
I went ahead and turned it into: