Dictionary translation
So i need to make some app to transfer hole pdf dictionary to some DB, and i encounter some problems. The problem is reading and parsing data, i came up to use regex for that but its just so complex to optimize and i sometime just get good result for a few words and lot of times its just broke, here is what needs to be transfered and how.
the example is in photo.
this is english to serbian dictionary
does anyone know any idia how can i do this with minimal errors
5 Replies
You could start by assuming bold text is english and non bold text the translation.
But this will break on declinations like kneel, knelt and so on and also an plurals like knavery.
You could then try to do more passes detecting and handling as many edge cases as possible.
But somehow I don't think you will get around having to clean up your data manually in the end.
an example of the data you are reading from the pdf?
what is the structure of the table
Original word, pronaunce, translation
Thats what i need to extract from this
are you parsing raw text from the pdf or something more refined
Yeah if reading raw pdf data, one might be able to extract font weight. If using OCR, that might be more difficult
Seems to me like regular grammars are just barely able to express your format; maybe a context-free grammar would be better suited. I'd imagine using a tool like ANTLR for this task would be feasible.