Bulgarian OCR 1.0 is a desktop-based program which has the purpose to extract and recognize cyrillic characters from a scanned document.
- read the document
- convert the image to a PGM image
- find the segmented characters using Depth-first search with a filter. Search for separate parts of the segment. Skip segment if it is not proper (bias that it is not a character).
- sort the segments by coordinates
- resize the segments using the Nearest neighbor algorithm
- recognize the segments using a previously trained artificial neural network
- check what is the Levensthein distance between the word in the document and a previously compiled list of words in a MySQL database. This would be our method to check for mistakes in the hypotheses of our neural network.
- output editable text in a .docx document.
Used programming languages and technologies:
- JMathPlot (plot the error of the neural network)