Bulgarian OCR 1.0 is a desktop-based program which has the purpose to extract and recognize cyrillic characters from a scanned document.

The process:

  1. read the document
  2. convert the image to a PGM image
  3. find the segmented characters using Depth-first search with a filter. Search for separate parts of the segment. Skip segment if it is not proper (bias that it is not a character).
  4. sort the segments by coordinates
  5. resize the segments using the Nearest neighbor algorithm
  6. recognize the segments using a previously trained artificial neural network
  7. check what is the Levensthein distance between the word in the document and a previously compiled list of words in a MySQL database. This would be our method to check for mistakes in the hypotheses of our neural network.
  8. output editable text in a .docx document.
Used programming languages and technologies:
  • Java
  • MySQL
  • Swing
  • JMathPlot (plot the error of the neural network)