FAQ | Forum
Adjusting the output:
OPTICAL CHARACTER RECOGNITION (OCR)
NOTE 1: In v1.63, k2pdfopt adds Unicode-16 support to OCR.
NOTE 2: In v1.51, the -wc command-line option has been replaced with -ocrvis.
As of v1.50, k2pdfopt can use one of two OCR engines to convert bitmapped text to
native ASCII characters so that the text in the output file can be searched
or copied and pasted into other applications. And in v1.63, bitmapped text
from any language that Tesseract supports (including, for example, Chinese) is converted
to Unicode-16 values and can be copied and pasted into Unicode-aware applications
(e.g. most web browsers and modern word processing software).
See the examples below.
UPDATE: With k2pdfopt v2.x, if the source PDF document has
searchable or highlightable text (e.g. if it is computer-generated or scanned but has
an OCR layer), then k2pdfopt output of either type (native PDF or the default
re-flowed text mode) should also have searchable text without having to resort
to time-consuming OCR. OCR should only be necessary if the source document is
scanned and does not already have a text/OCR layer.
(k2pdfopt -ocr pooh.pdf)
OCR ENGINE CHOICE: TESSERACT VS. GOCR
OCR is not turned on by default. You must select it with the -ocr command-line option
(or via "oc" in the interactive menu).
You can choose from two different OCR engines to do the conversion to text. The
default is Google's open-source Tesseract. It requires support files to be installed on your PC
(see below). The other option is GOCR.
GOCR requires no additional files and is faster than Tesseract
by more than a factor of ten, but Tesseract is
far more accurate and still reasonably fast (~25 words per second on a modern PC) and
also supports multiple languages (GOCR only supports English / ASCII).
Because of this, I decided to make Tesseract the default.
See the examples below (the -ocrvis t option (new in v1.51) causes only the OCR'd text to show):
Conversion time: 15 s
k2pdfopt -ocr -ocrhmax 0.5 -ocrvis t pooh.pdf
Conversion time: 3 s
k2pdfopt -ocr g -ocrhmax 0.5 -ocrvis t pooh.pdf
UNICODE-16 ALTERNATE LANGUAGE EXAMPLE (SIMPLIFIED CHINESE)
In k2pdfopt v1.63, any language Tesseract OCR supports can be converted to Unicode-16
characters. The example below shows the OCR results on simplified Chinese using
Tesseract's simplified Chinese training data. Use the
-ocrlang option to select your language. If no language is specified, the
most recently dated training file in the Tesseract training folder is used. Note
that if you use -ocrvis t with a language like Chinese, as an example,
the text will not look right as displayed by the PDF file because k2pdfopt does
not embed any Chinese fonts (or other non-standard fonts) into the PDF file.
But if you copy and paste the text into a Unicode-16 compatible application, it will
come out as Chinese characters.
(Source PDF file)
k2pdfopt -ocr t -ocrlang chi_sim -col 1 crouching_tiger.pdf
(Try copying and pasting the text from the PDF file.)
You can specify multiple languages for OCR if you use Tesseract,
e.g. English and Chinese (Example PDF) using
Results with different options (as of v2.36):
See the Tesseract Wiki for an explanation of the Tesseract project and how to install language training files.
To use the Tesseract OCR engine built into k2pdfopt, you only have to install the Tesseract language training file for your language (see example below for English). You do not need to install the Tesseract engine! You can install multiple language files if you want to be able to OCR documents in different lanugages.
I am hoping to eventually offer an easier way of installing Tesseract's language training files,
but for now you'll need to download the one(s) you want from the
Tesseract data download page:
You can choose the training data for the language your prefer, for example, the
English language data files are circled in image above (eng...).
(An alternate location for these files
is on sourceforge
--a slightly older, but mostly equivalent set, from what I can tell.)
You'll need to download them all to a single folder, e.g.
you'll want to set the environment variable TESSDATA_PREFIX to point to the parent folder, e.g. c:\tesseract-ocr as follows (no trailing slash necessary in the latest versions of k2pdfopt):
(You can see how to set an enviroment variable here.)
NOTE! Your actual training data files must go in a subfolder named "tessdata" within the
TESSDATA_PREFIX folder, e.g. c:\tesseract-ocr\tessdata (to match the example above). I know this is confusing! I've
even confused myself with this.
If you have correctly set up Tesseract, you'll see the Tesseract banner when you run k2pdfopt
with OCR turned on, and the selected language will also show (as of v1.63):