The last days I was asked, how to handle indexing of pdf-files that contain scanned content. In these files the content often are just images and an OCR approach is needed to make the content readable and accessible for the crawler.
From my point of view, we have two options to answer the question. The first is a Flow, where we can use the ElasticOCR connector. Actually, the connector is in preview, but it can already get a trial license for your tests. The way of working of the connector creates a new version of the document with readable content for the crawler. Good approach and it does what I expected.
But there should already be another approach to answer the question. For environments that run on-premises, we are not able to use Microsoft Flow, and on the other hand, using this Flow connector will first copy the file and the content to another location, do the processing and then move the results back to our SharePoint or OneDrive library.
There are some development packages for OCR available, I tested with IronOcr. My approach is very simple: in the library, where the document is stored, I create a hidden text field, where I store the text content of the file after the OCR process is done. The SharePoint crawler will pick-up the content of the field and store all necessary information in the index for the search. Searching for any information from the document will show the document in the search results.
The following code is just the result of this proof-of-concept, nothing more. The first part is just the field definition, where the text content will be stored after ORC.
The second part is a very, very simple command line program that takes the item id of a document as the parameter, does the OCR for the document and stores the readable text in the text field of the file item.
So, for handling these documents in the real world, we can use a remote event receiver for SharePoint (Online) or just a simple remote timer job. As always it depends on the environment, where we are working in.