Willus.org Home   |   Archive   |   About  

Willus.com's K2pdfopt Help Page

Return to K2pdfopt Home Page

Getting Started:
    1. Windows
  Text Menu
  (now with video!)
  2. Linux
  Env vars
  3. Mac OSX
  4. Help in other languages

FAQ | Forum

Customizing K2PDFOPT:
    1. K2pdfopt GUIs
    2. Disabling the Windows GUI
    3. The interactive menu
    4. List of command-line options
    5. Using a shortcut
  (now with video!)
  6. Using the K2PDFOPT environment variable
  7. Using the command line

Adjusting the output:
    1. Screen Size
    2. Increasing the magnification
    3. Landscape mode
    4. Output File Size
    5. Setting Margins
    6. Color Output
    7. Uneven Line Breaks/ Excess Margins

Processing Options:
    1. Showing Markings
    2. OCR
    3. Native PDF
  (now with video!)
    4. Auto-Straightening
    5. Ignoring Borders/ Headers/Footers
    6. Detecting Columns
    7. Protecting Regions
    8. Column Order
    9. Right-to-Left Page Scanning
    10. Using Ghostscript
NOTE 1: In v1.63, k2pdfopt adds Unicode-16 support to OCR.
NOTE 2: In v1.51, the -wc command-line option has been replaced with -ocrvis.

As of v1.50, k2pdfopt can use one of two OCR engines to convert bitmapped text to native ASCII characters so that the text in the output file can be searched or copied and pasted into other applications. And in v1.63, bitmapped text from any language that Tesseract supports (including, for example, Chinese) is converted to Unicode-16 values and can be copied and pasted into Unicode-aware applications (e.g. most web browsers and modern word processing software). See the examples below.

UPDATE: With k2pdfopt v2.x, if the source PDF document has searchable or highlightable text (e.g. if it is computer-generated or scanned but has an OCR layer), then k2pdfopt output of either type (native PDF or the default re-flowed text mode) should also have searchable text without having to resort to time-consuming OCR. OCR should only be necessary if the source document is scanned and does not already have a text/OCR layer.

(k2pdfopt -ocr pooh.pdf)

OCR is not turned on by default. You must select it with the -ocr command-line option (or via "oc" in the interactive menu). You can choose from two different OCR engines to do the conversion to text. The default is Google's open-source Tesseract. It requires support files to be installed on your PC (see below). The other option is GOCR. GOCR requires no additional files and is faster than Tesseract by more than a factor of ten, but Tesseract is far more accurate and still reasonably fast (~25 words per second on a modern PC) and also supports multiple languages (GOCR only supports English / ASCII). Because of this, I decided to make Tesseract the default. See the examples below (the -ocrvis t option (new in v1.51) causes only the OCR'd text to show):

Tesseract 3.01
Conversion time: 15 s
k2pdfopt -ocr -ocrhmax 0.5 -ocrvis t pooh.pdf
GOCR 0.49
Conversion time: 3 s
k2pdfopt -ocr g -ocrhmax 0.5 -ocrvis t pooh.pdf

In k2pdfopt v1.63, any language Tesseract OCR supports can be converted to Unicode-16 characters. The example below shows the OCR results on simplified Chinese using Tesseract's simplified Chinese training data. Use the -ocrlang option to select your language. If no language is specified, the most recently dated training file in the Tesseract training folder is used. Note that if you use -ocrvis t with a language like Chinese, as an example, the text will not look right as displayed by the PDF file because k2pdfopt does not embed any Chinese fonts (or other non-standard fonts) into the PDF file. But if you copy and paste the text into a Unicode-16 compatible application, it will come out as Chinese characters.

(Source PDF file)
k2pdfopt -ocr t -ocrlang chi_sim -col 1 crouching_tiger.pdf
(Try copying and pasting the text from the PDF file.)

You can specify multiple languages for OCR if you use Tesseract, e.g. English and Chinese (Example PDF) using

   -ocrlang eng+chi_tra

Results with different options (as of v2.36):

See the Tesseract Wiki for an explanation of the Tesseract project and how to install language training files.

NOTE! To use the Tesseract OCR engine built into k2pdfopt, you only have to install the Tesseract language training file for your language (see example below for English). You do not need to install the Tesseract engine! You can install multiple language files if you want to be able to OCR documents in different lanugages.

I am hoping to eventually offer an easier way of installing Tesseract's language training files, but for now you'll need to download the one(s) you want from the Tesseract data download page:
You can choose the training data for the language your prefer, for example, the English language data files are circled in image above (eng...). (An alternate location for these files is on sourceforge --a slightly older, but mostly equivalent set, from what I can tell.)

You'll need to download them all to a single folder, e.g. c:\tesseract-ocr\tessdata. Then you'll want to set the environment variable TESSDATA_PREFIX to point to the parent folder, e.g. c:\tesseract-ocr as follows (no trailing slash necessary in the latest versions of k2pdfopt):
(You can see how to set an enviroment variable here.)
NOTE! Your actual training data files must go in a subfolder named "tessdata" within the TESSDATA_PREFIX folder, e.g. c:\tesseract-ocr\tessdata (to match the example above). I know this is confusing! I've even confused myself with this.

If you have correctly set up Tesseract, you'll see the Tesseract banner when you run k2pdfopt with OCR turned on, and the selected language will also show (as of v1.63):


This page last modified
Sunday, 28-May-2017 10:09:24 PDT