Entering the Fourth Dimension of OCR with Tesseract

by | Jan 27, 2019 | Voxxed Days Bucharest 2019

Hanno Embregts is a Java Developer, Speaker and Teacher at Info Support (the Netherlands). He has over 11 years experience with both front- and back-end development, with a special interest in automating the software development process to the fullest. He likes his work best when it is fast-paced and versatile, which is why he juggles Java development, public speaking and teaching courses at Info Support’s Knowledge Centre. When Hanno doesn’t have access to any kind of computer – which can only be called the most desperate of times – he plays in a band as a lead singer and guitar player. He is also a passionate fan of alternative rock band Switchfoot and Dutch football club Feyenoord. Last but not least: he has been told off repeatedly for using Star Wars quotes at work (things didn’t improve much by replying “I find your lack of faith disturbing”).

Optical Character Recognition (OCR) has come a long way since the first image-scanning inventions in the early 1900s. Nowadays, accuracy rates of over 90% are easily achievable on high-quality text scans. Many OCR engines capable of reaching these rates exist today; one of which is Tesseract.

Tesseract has become quite popular amongst software developers because of its accuracy, its open-source status and its active development by Google. By using the Tess4J JNA wrapper it is easily integrated into your Java project.

During this session, I will introduce Tesseract, its pros and cons and how & when to use it. And I will demonstrate a Java application that uses Tesseract and Tess4J to process some of my favorite books from Google Books, so you’ll be able to assess its accuracy for yourself.

In geometry, a ‘tesseract’ is the four-dimensional analog of a cube. So will the Tesseract OCR library live up to its name and help your project to ‘enter the fourth dimension’?

Join me for this session and find out for yourself!