C
C#14mo ago
Natro

❔ Parsing inconsistent PDFs

Hey, I am trying to parse product offers catalogs from supermarkets or smallshops. Example of PDF (https://letak.billa.sk/64070/1622928/pdfs/301ccba5-fc73-4da6-99b4-c09f5ca5c789.pdf?response-content-disposition=attachment%3B+filename%2A%3DUTF-8%27%27BILLA.sk%2520-%2520BILLA%2520letak%252019%2520Akciov%25C3%25A1%2520ponuka%2520plat%25C3%25AD%2520od%252010.%25205.%2520do%252016.%25205.%25202023..pdf) Sadly, when I parse file like this there a lot of inconsistencies - some text is put in irrelevant position in my parsed string and so on - there is no pattern that I can find. Any other ideas how to approach something like this? I was thinking of going for OCR solution but I am scared that I am taking more than I can chew with the weird layouts that can happen Added Image for those who are scared of that daunting link
2 Replies
teauxfu
teauxfu14mo ago
Yeah that's one of the downsides of PDF -- it's really hard to scrape text from them unless they were specifically produced with the intention of supporting that. An OCR approach might be your best bet if you're expecting a high degree of variance. I know AWS and Azure both offer some document recognition services, but I imagine that gets expensive pretty fast.
Accord
Accord14mo ago
Was this issue resolved? If so, run /close - otherwise I will mark this as stale and this post will be archived until there is new activity.