[an error occurred while processing this directive]

Willus.com's Blog
My random thoughts
See also my 64-bit Chronicles and Windows 7 Tips.

 
6 JAN 2023 -- TESSERACT ACCURACY
Since 2018, I have been testing Tesseract's OCR engine against the resolution of the text. I wrote script to auto-generate PDF files with different resolution text in six different fonts (Helvetica, Times-Roman, Courier, Palatino, Bookman, and Helvetia-Narrow). I then run Tesseract on the different PDF's and determine the accuracy of the OCR. I characterize the resolution by the height of a typical capital letter in pixels. It turns out that there is a sweet spot for Tesseract of about 30 pixels for the height of a capital letter (seems strange to me that it would not continue to improve at higher and higher resolutions, but okay). See the plot below. My software k2pdfopt uses this result and tries to optimize OCR text size to be in this "sweet spot."



6 JAN 2023 -- GCC 12.2 / CLANG 15 / K2PDFOPT / TESSERACT BENCHMARK
I re-compiled k2pdfopt v2.54 with gcc 12.2.0 and clang 15. Here are results on both a Windows-11 PC with an Intel Core i9-9900 CPU and on a Mac Mini with an M1 ARM-64 CPU. The results also compare Tesseract v4.1 with Tesseract v5.3.



14 MAY 2022 -- GCC 11.3 / CLANG 14 AND K2PDFOPT BENCHMARK
I re-compiled k2pdfopt v2.53 with gcc 11.3.0 and clang 14. Here are the results.



12 FEB 2022 -- COMPRESSION UTILITY BENCHMARK
I visited Fabrice Bellard's website (Bellard is a brilliant French programmer) for the first time in a while. His very first entry intrigued me, linking to a the Large Text Compression Benchmark. This is a comprehensive comparison of how well various compression algorithms do at compressing a 1-GB XML dump (from 2006) of Wikipedia. (Of course Bellard's own program, nncp, had the best result as of the last update to the site, which was August 2021, though nncp takes over 2 days to get that result --and that's using a GPU).

So I went through the list and thought I'd try out Mathieu Chartier's mcm entry since it seemed to have the best combination of speed and performance. I compiled it with MinGW gcc 11 and ran my own benchmark of nearly the same uncompressed size: my Win32/64 package for MinGW gcc 11, which has a tar-ball size of 1,032,924,160 bytes. The results, along with results from several other standard compression utilities, are below. Indeed, mcm gets the best compression, but not by much over xz. The widely used 7-zip also turns in a very respectable score with a good blend of speed and compression performance. If you are interested in trying mcm in Windows, here is a Win64 mcm .exe file (command-line based).

Program Flags Run Time (s) Compression size (bytes) Compression Ratio (bits per byte) Compression Speed (MB/s) Compression
mcm -x11 255.6 94,643,694 0.733 3.85 9.16%
mcm -h11 235.1 95,544,308 0.740 4.19 9.25%
mcm -x10 244.5 95,928,885 0.743 4.03 9.29%
mcm -m11 218.5 96,805,951 0.750 4.51 9.37%
mcm -h10 222.2 96,837,087 0.750 4.43 9.38%
xz -9 264.9 97,335,200 0.754 3.72 9.42%
mcm -x9 235.8 97,362,670 0.754 4.18 9.43%
mcm -m10 199.5 98,080,521 0.760 4.94 9.50%
mcm -h9 217.3 98,276,574 0.761 4.53 9.51%
mcm -m9 193.1 99,489,105 0.771 5.10 9.63%
mcm -x8 242.4 103,514,788 0.802 4.06 10.02%
mcm -h8 211.7 104,457,794 0.809 4.65 10.11%
mcm -m8 193.7 105,651,443 0.818 5.09 10.23%
mcm -t11 131.9 108,643,785 0.841 7.47 10.52%
7z -t7z -mx=9 -ms=on 186.9 108,736,435 0.842 5.27 10.53%
mcm -t10 130.3 110,483,748 0.856 7.56 10.70%
mcm -t9 127.5 112,022,467 0.868 7.73 10.85%
mcm -t8 127.8 120,184,123 0.931 7.71 11.64%
7z   150.8 169,579,190 1.313 6.53 16.42%
xz -0 51.1 307,871,492 2.384 19.29 29.81%
bzip2 --best 84.1 328,167,083 2.542 11.71 31.77%
bzip2 --fast 77.2 340,557,458 2.638 12.75 32.97%
gzip --best 107.1 354,146,627 2.743 9.20 34.29%
zip   49.5 363,251,119 2.813 19.88 35.17%
gzip --fast 19.2 391,682,939 3.034 51.31 37.92%
(Run times are on a Core i9-9900 PC running Windows 11.)


12 JUN 2020 -- "1 MONTH AGO" -- EVEN GITHUB?
Wow, even GitHub, a site for hardcore programmers, is using the nebulous date format I blogged about before, so I get to see a whole bunch of files dated "last month" or "last year." How incredibly unhelpful. Sorry, but I just lost a lot of respect for github. Thank you to SourceForge for not following suit.

16 JAN 2019 -- "1 MONTH AGO"
<RANT>
Why do sites like YouTube, Reddit, etc. all list the dates that comments are posted as imprecise and relative times like "1 month ago" and "2 years ago"? Do they really think I need help doing date-stamp arithmetic in my head, or that I don't want to know the posting date more precisely than +/-50%? The genius who thought of this should be dragged by a heard of flatulant wildebeests and then neutered so that he or she cannot reproduce. Thank you to StackOverflow for not jumping on this bandwagon.
</RANT>


16 JAN 2019 -- MAKE WAY FOR THE HOLIDAYS
I'm a Christmas music junkie, and I got pretty excited about the Starbucks holiday song I heard this last Christmas, but I didn't find it until just now. That is partly because the lyrics have yet to be posted on the on the web (that I can find), so I'm posting them here to help people find the song since it's a common way I search for a song. I should have thought earlier to click through to (and read the comments on) the YouTube video of the commercial that I found on this unhelpful and misleading page.
Make Way for the Holidays, by Le Bon (Back story)

See the shining lights.
Feel the cold outside.
Looking forward to the winter.

We're busy making plans,
packing up our bags,
counting down the New Year.

And I just can't wait for the holidays
and all the happiness it brings.

Home with family,
Reliving memories,
Always filled with love and laughter.
A time when we surprise,
Give someone else some time,
Getting back to what matters.

And I just can't wait for the holidays
and all the happiness it brings.
So let's all make way for the holidays
and all the happiness it brings.

It's a season for giving
and for loving everyone.
And it's time to remember
everything you have and all you are grateful for.

See the shining lights.
Feel the cold outside.
Looking forward to the winter.

We're busy making plans,
packing up our bags,
counting down the New Year.

And I just can't wait for the holidays
and all the happiness it brings.
So let's all make way for the holidays
and all the happiness it brings.

And I just can't wait for the holidays
and all the happiness it brings.
So let's all make way for the holidays
and all the happiness it brings.

And all the happiness it brings.


9 JUN 2018 -- MADE UP PLOTS
I won't call out the company that put up this idiotic plot on their web site, but the dollars listed on the horizontal axis are supposed to be represented by the blue curve on the plot. See anything wrong?



That's right. The dollars are perfectly linear -- increasing by $336 every year, whereas the plotted blue curve is not (nor is the orange one). On a linear vertical scale, the correctly plotted dollar values should look like this:



I guess the company didn't think the correctly plotted curve looked elegant enough, but why on earth would they want you to think that their prices go up nonlinearly over time, increasing more and more every year? I was thinking of looking into this company's service, but if they can't get a simple plot right, they probably can't get their service right, either.


11 FEB 2018 -- PLAYBUZZ IS A SCAM
I decided I wanted to vent a bit about PlayBuzz and realized I had no real place to put such a vent, hence, my first ever general "blog" entry on willus.com after nearly 20 years of hosting the site. And I only wanted to vent because when I google "playbuzz is a scam," there are a surprisingly small number of relevant results. Are people really so oblivious? I'm a natural skeptic, so when I first got a perfect score on a PlayBuzz quiz, I immediately re-took it and intentially clicked on several wrong answers. The results: I got a perfect score again! Or I would get 9 of 10 or 14 of 15. "Empty Al" (see his comment here) had the exact same experience. Just try taking a Playbuzz quiz and intentionally getting several answers wrong. And do you really think you're a genius for knowing that a crocodile is different from a lion, tiger, and bear? Or that a triangle is different from a square, rectangle, and parallelogram? Come on. Playbuzz is feel-good click bait with completely bogus scoring and ridiculous claims. Which leads me to my main question:

Is everybody in on it and they're all okay with it--i.e. harmless feel-good fun? Or do people really not know?
 

[an error occurred while processing this directive]