6 JAN 2023 -- TESSERACT ACCURACY
Since 2018, I have been testing Tesseract's OCR engine against the resolution of the
text. I wrote script to auto-generate PDF files with different resolution text in six
different fonts (Helvetica, Times-Roman, Courier, Palatino, Bookman, and Helvetia-Narrow).
I then run Tesseract on the different PDF's and determine the accuracy of the OCR.
I characterize the resolution by the height of a typical capital letter in pixels.
It turns out that there is a sweet spot for Tesseract of about 30 pixels for the height
of a capital letter (seems strange to me that it would not continue to improve at higher
and higher resolutions, but okay). See the plot below. My software k2pdfopt uses this
result and tries to optimize OCR text size to be in this "sweet spot."
6 JAN 2023 -- GCC 12.2 / CLANG 15 / K2PDFOPT / TESSERACT BENCHMARK
I re-compiled k2pdfopt v2.54 with gcc 12.2.0 and clang 15. Here are results on both a Windows-11 PC with an Intel Core i9-9900 CPU and on a Mac Mini with an M1 ARM-64 CPU. The results also compare Tesseract v4.1 with Tesseract v5.3.
14 MAY 2022 -- GCC 11.3 / CLANG 14 AND K2PDFOPT BENCHMARK
I re-compiled k2pdfopt v2.53 with gcc 11.3.0 and clang 14. Here are the results.
12 FEB 2022 -- COMPRESSION UTILITY BENCHMARK
I visited Fabrice Bellard's website
(Bellard is a brilliant French programmer)
for the first time in a while. His very first entry intrigued me, linking to a the
Large Text Compression Benchmark.
This is a comprehensive comparison of how well various compression algorithms do at
compressing a 1-GB XML dump (from 2006) of Wikipedia. (Of course Bellard's own program, nncp, had the best result as of the last update to the site, which was August 2021, though nncp takes over 2 days to get that result --and that's using a GPU).
So I went through the list and thought I'd try out
Mathieu Chartier's mcm
entry since it seemed to have the best combination of speed and performance. I compiled
it with MinGW gcc 11 and ran
my own benchmark of nearly the same uncompressed size: my Win32/64 package for MinGW gcc 11,
which has a tar-ball size of 1,032,924,160 bytes. The results,
along with results from several other standard compression utilities, are below.
Indeed, mcm gets the best compression, but not by much over xz. The widely used 7-zip also turns in a very respectable score with a good blend of speed and
compression performance. If you are interested in trying mcm in Windows, here is a
Win64 mcm .exe file (command-line based).
Program
Flags
Run Time (s)
Compression
size (bytes)
Compression
Ratio (bits per byte)
Compression
Speed (MB/s)
Compression
mcm
-x11
255.6
94,643,694
0.733
3.85
9.16%
mcm
-h11
235.1
95,544,308
0.740
4.19
9.25%
mcm
-x10
244.5
95,928,885
0.743
4.03
9.29%
mcm
-m11
218.5
96,805,951
0.750
4.51
9.37%
mcm
-h10
222.2
96,837,087
0.750
4.43
9.38%
xz
-9
264.9
97,335,200
0.754
3.72
9.42%
mcm
-x9
235.8
97,362,670
0.754
4.18
9.43%
mcm
-m10
199.5
98,080,521
0.760
4.94
9.50%
mcm
-h9
217.3
98,276,574
0.761
4.53
9.51%
mcm
-m9
193.1
99,489,105
0.771
5.10
9.63%
mcm
-x8
242.4
103,514,788
0.802
4.06
10.02%
mcm
-h8
211.7
104,457,794
0.809
4.65
10.11%
mcm
-m8
193.7
105,651,443
0.818
5.09
10.23%
mcm
-t11
131.9
108,643,785
0.841
7.47
10.52%
7z
-t7z -mx=9
-ms=on
186.9
108,736,435
0.842
5.27
10.53%
mcm
-t10
130.3
110,483,748
0.856
7.56
10.70%
mcm
-t9
127.5
112,022,467
0.868
7.73
10.85%
mcm
-t8
127.8
120,184,123
0.931
7.71
11.64%
7z
150.8
169,579,190
1.313
6.53
16.42%
xz
-0
51.1
307,871,492
2.384
19.29
29.81%
bzip2
--best
84.1
328,167,083
2.542
11.71
31.77%
bzip2
--fast
77.2
340,557,458
2.638
12.75
32.97%
gzip
--best
107.1
354,146,627
2.743
9.20
34.29%
zip
49.5
363,251,119
2.813
19.88
35.17%
gzip
--fast
19.2
391,682,939
3.034
51.31
37.92%
(Run times are on a Core i9-9900 PC running Windows 11.)
12 JUN 2020 -- "1 MONTH AGO" -- EVEN GITHUB?
Wow, even GitHub, a site for hardcore programmers,
is using the nebulous date format I blogged about before, so I
get to see a whole bunch of files dated "last month" or "last year."
How incredibly unhelpful.
Sorry, but I just lost a lot of respect for github.
Thank you to SourceForge for not following suit.
16 JAN 2019 -- "1 MONTH AGO" <RANT>
Why do sites like YouTube, Reddit, etc. all list the dates that comments are posted
as imprecise and relative times like "1 month ago" and "2 years ago"? Do they really
think I need help doing date-stamp arithmetic in my head, or that I don't want to know
the posting date more precisely than +/-50%?
The genius who thought of this should be dragged by a heard of flatulant wildebeests
and then neutered so that he or she cannot reproduce.
Thank you to StackOverflow for not jumping on this bandwagon. </RANT>
16 JAN 2019 -- MAKE WAY FOR THE HOLIDAYS
I'm a Christmas music junkie, and I got pretty excited about the Starbucks holiday song
I heard this last Christmas, but I didn't find it until just now.
That is partly because the lyrics have yet to be
posted on the on the web (that I can find), so I'm posting them here to help
people find the song since it's a common way I search for a song. I should have
thought earlier to click through to (and read the comments on) the YouTube video of the commercial that I found
on this unhelpful and misleading page.
Make Way for the Holidays, by Le Bon (Back story)
See the shining lights.
Feel the cold outside.
Looking forward to the winter.
We're busy making plans,
packing up our bags,
counting down the New Year.
And I just can't wait for the holidays
and all the happiness it brings.
Home with family,
Reliving memories,
Always filled with love and laughter.
A time when we surprise,
Give someone else some time,
Getting back to what matters.
And I just can't wait for the holidays
and all the happiness it brings.
So let's all make way for the holidays
and all the happiness it brings.
It's a season for giving
and for loving everyone.
And it's time to remember
everything you have and all you are grateful for.
See the shining lights.
Feel the cold outside.
Looking forward to the winter.
We're busy making plans,
packing up our bags,
counting down the New Year.
And I just can't wait for the holidays
and all the happiness it brings.
So let's all make way for the holidays
and all the happiness it brings.
And I just can't wait for the holidays
and all the happiness it brings.
So let's all make way for the holidays
and all the happiness it brings.
And all the happiness it brings.
9 JUN 2018 -- MADE UP PLOTS
I won't call out the company that put up this idiotic plot on their web site,
but the dollars listed on the horizontal axis are supposed to be represented by the blue
curve on the plot. See anything wrong?
That's right. The dollars are perfectly linear -- increasing by $336 every year,
whereas the plotted blue curve is not (nor is the orange one).
On a linear vertical scale, the correctly plotted dollar values should look like this:
I guess the company didn't think the correctly plotted curve looked elegant enough, but why
on earth would they want you to think that their prices go up nonlinearly over
time, increasing more and more every year? I was thinking
of looking into this company's service, but if they can't get a simple plot right, they probably
can't get their service right, either.
11 FEB 2018 -- PLAYBUZZ IS A SCAM
I decided I wanted to vent a bit about PlayBuzz and realized I had no real place to put
such a vent, hence, my first ever general "blog" entry on willus.com after nearly 20 years of
hosting the site. And I only wanted to vent because when I google "playbuzz is a scam," there are
a surprisingly small number of relevant results. Are people really so oblivious?
I'm a natural skeptic, so when I first got a perfect score on a PlayBuzz quiz, I immediately
re-took it and intentially clicked on several wrong answers. The results: I got a perfect
score again! Or I would get 9 of 10 or 14 of 15.
"Empty Al" (see his comment here) had the exact same experience.
Just try taking a Playbuzz quiz and intentionally getting several answers wrong. And do you
really think you're a genius for knowing that a crocodile is different from a lion, tiger,
and bear? Or that a triangle is different from a square, rectangle, and parallelogram?
Come on. Playbuzz is feel-good click bait with completely bogus scoring and ridiculous
claims. Which leads me to my main question:
Is everybody in on it and they're all okay with it--i.e. harmless feel-good fun? Or do
people really not know?