FAQ | Forum
Adjusting the output:
OPTICAL CHARACTER RECOGNITION (OCR)
NOTE: In v2.50, k2pdfopt is compiled with Tesseract v4.0.0.
Since v1.50, k2pdfopt can use one of two OCR engines to convert bitmapped text to
native ASCII characters so that the text in the output file can be searched
or copied and pasted into other applications. And in v1.63, bitmapped text
from any language that Tesseract supports (including, for example, Chinese) is converted
to Unicode-16 values and can be copied and pasted into Unicode-aware applications
(e.g. most web browsers and modern word processing software).
See the examples below.
UPDATE: Make sure you really need to perform OCR first.
With k2pdfopt v2.x, if the source PDF document has
searchable or highlightable text (e.g. if it is computer-generated or scanned but has
an OCR layer), then k2pdfopt output of either type (native PDF or the default
re-flowed text mode) should also have searchable text without having to resort
to time-consuming OCR. OCR should only be necessary if the source document is
scanned and does not already have a text/OCR layer.
(k2pdfopt -ocr pooh.pdf)
OCR ENGINE CHOICE: TESSERACT VS. GOCR
OCR is not turned on by default. You must select it with the -ocr command-line option,
or via "oc" in the interactive text menu, or as shown below in the Windows GUI:
You can choose from two different OCR engines to do the conversion to text.
The best and
default is Google's open-source Tesseract. It requires support files to be installed on your PC
(see below). The other option is GOCR,
which requires no additional files and is slightly faster than Tesseract, but should
really only be used as a last resort because it is not very accurate--see below.
Tesseract by default runs multi-threaded to make use of multi-core CPUs in order to
run faster. Also note that some training files perform OCR faster than others.
Tesseract also supports unicode and multiple languages (GOCR only supports English / ASCII).
See the examples below where I've copied and pasted the selected text from a k2pdfopt output
file into Microsoft Word.
Tesseract is clearly far superior to GOCR (which is why it is the only selection the GUI offers). GOCR should only be used a last resort.
Converted from pooh.pdf
Conversion time: 6.8 s
k2pdfopt -ocr pooh.pdf
Conversion time: 4.5 s
k2pdfopt -ocr g pooh.pdf
TESSERACT INSTALLATION (UPDATED FOR TESSERACT v4.0)
See the Tesseract Wiki for an explanation of the Tesseract project and how to install language training files.
To use the Tesseract OCR engine built into k2pdfopt, you only have to install the Tesseract language training file for your language (see example below for English). You do not need to install the Tesseract engine! You can install multiple language files if you want to be able to OCR documents in different lanugages.
I am hoping to eventually offer an easier way of installing Tesseract's language training files,
but for now you'll need to download the one(s) you want from the
Tesseract data download page
(the English training file is circled below):
You can choose the language you prefer (e.g. English is circled above).
Fortunately, in Tesseract v4.0, there is only one training file per language rather
than multiple ones as there were in v3.0.
Tesseract 4 uses what they call
LSTM (Long Short-Term Memory) training data. The .traineddata file may have LSTM
data for Tesseract 4 and/or training data compatible with Tesseract 3, and there
are, confusingly, a number of English ones you can find if you poke around--ones
that are optmized for speed, for accuracy, and for backwards compatibility with
Tesseract 3 (see the bottom of the page linked above).
The English training file circled in the folder linked
above is compatible with Tesseract 4 and 3 as of March 2020).
If you want to see what training data is in each file, run
k2pdfopt -ocrlang ? or access the Help menu from the GUI:
Here's what comes up on my system--I have downloaded English, Chinese, and Greek training files.
Note that English is the default because it has the most recent date stamp.
A location with some older training files (for Tesseract v3.x)
is on sourceforge.
You'll need to download the training file to the right folder, e.g.
you'll want to set the environment variable TESSDATA_PREFIX to point to the parent folder, e.g. c:\tesseract-ocr as follows (no trailing slash necessary in the latest versions of k2pdfopt):
(You can see how to set an enviroment variable here.)
NOTE: Though the Tesseract folks use the convention above, where the
training data files should be stored in the "tessdata" subfolder of the folder that is pointed to
by TESSDATA_PREFIX, with k2pdfopt (v2.5x and up), you can store the training files
directly in the folder pointed to by TESSDATA_PREFIX. You do not have to put them
in a "tessdata" subfolder. k2pdfopt will find them in either location.
To check that you have the training file(s) in the right place, check the k2pdfopt GUI help
menu as shown above or run k2pdofpt -ocrlang ?
Also, if you have correctly set up Tesseract, you'll see the Tesseract banner when you run k2pdfopt
with OCR turned on, and the selected language will also show:
... or in the GUI conversion window on the first conversion:
UNICODE-16 ALTERNATE LANGUAGE EXAMPLE (SIMPLIFIED CHINESE)
In k2pdfopt v1.63, any language Tesseract OCR supports can be converted to Unicode-16
characters. The example below shows the OCR results on simplified Chinese using
Tesseract v4.0.0 with the latest LSTM training files as of March 2020.
Use the -ocrlang option to select your language. If no language is specified, the
most recently dated training file in the Tesseract training folder is used. Note
that if you use -ocrvis t with a language like Chinese, as an example,
the text will not look right as displayed by the PDF file because k2pdfopt does
not embed any Chinese fonts (or other non-standard fonts) into the PDF file.
But if you copy and paste the text into a Unicode-16 compatible application, it will
come out as Chinese characters.
(Source PDF file)
k2pdfopt -mode copy -ocr t -ocrlang chi_tra crouching_tiger.pdf
(Copied and pasted into translate.google.com)
You can specify multiple languages for OCR if you use Tesseract,
e.g. English and Chinese (Example PDF) using
Results with different options (as of k2pdfopt v2.51 / Tesseract v3.0.5 and v4.0.0):
TESSERACT v4.0 OPTIMUM RESOLUTION
An interesting aside about Tesseract v4.0.0--there appears to be an optimum letter height,
in pixels, for getting the best OCR results. I posted this study about it to the Tesseract
google groups forum
in late 2018.
Because of this, k2pdfopt v2.51 and up includes a -ocrdpi option which sets the
dpi of bitmaps passed to Tesseract. The default value should work well in general.