Best Online OCR Software for Chinese Characters

OCR (光学字符识别 in Chinese) on Chinese documents and movies — OCR (光学字符识别 in Chinese) on Chinese scans and movie subtitles

Update May 1, 2015: (a9t9) launched its very own free and open-source Online OCR service - try it out and let us know how it compares.
Think English language OCR is hard? Then try Chinese. This is what I did for this review. When I reviewed Online OCR services for English, there were 5 OCR surprises. Now I am back, looking at OCR software for Chinese characters.

Why is Chinese OCR difficult?

Why is Chinese OCR more difficult than, say, English or German document OCR?

Optical character recognition by itself is still hard. It is not a solved problem, at least not for the software available to end-users. The result of our English OCR benchmark have been mixed. But Chinese language OCR takes the challenge to another level. Here is why:

(1) Number of characters: A typical Western alphabet has around 24-30 characters whereas Chinese OCR software has to learn far more that that. To be useful, it needs to know at least the 6,763 simplified Chinese characters the GB–2312 standard. Then add another ~5000 or so traditional versions of characters to the mix. So we have at least 10,000 characters. This also means: Some rare characters may not be recognized simply because they’re not in the database – something that cannot happen in English.

(2) Every new character the software supports is another character that might potentially result in a false positive match, so there is a limit for the sake of accuracy.

(3) In Chinese, a character (or two) resembles an English word. Example: 手机 = Mobile Phone (literally: hand machine). In this example 2 Chinese characters = 11 English characters. So the information density in Chinese texts is much higher. This also means the text size needs to be greater. Typical lower limits for OCR software are 15 pixels for Western languages or 20 pixels for East Asian languages.

OCR Software Benchmark for Chinese

For this review a Chinese OCR benchmark consisting of of six images was created: A document scan of magazine article in three different qualities (300dpi, 100dpi, 75 dpi), a Lumia 535 smartphone image of the article, and screenshot of two movie subtitles (more about the movie subtitles later).

1. High-Quality Chinese Scan (300 dpi)

Test 1: Chinese OCR, 300dpi

OCR Service	Result	Output (Excerpt)
Abby Finereader	100%	在中国，餐厅里的菜通常很特别，
Google Docs OCR	100%	在中國,餐斤里的菜通常很特別,
OnlineOCR	100%	在中国，餐厅里的菜通常很特别，
i2 OCR	Good	在中国, 餐厅里的菜通常壕艮特另u,
NewOCR	100%	在中国，餐厅里的菜通常很特别,

The first test uses a high-quality scan of a magazine article. All services that can recognize Chinese characters did well in this test, most with no error (100%) and some with almost no error (good).
Having said that, some services that I reviewed for English language OCR do not support Chinese and are thus not part of this review.

1. Low-Quality Chinese Scan (100 dpi)

Test 2: Chinese OCR, 100dpi Scan

OCR Service	Result	Output (Excerpt)
Abby Finereader	Good	在中“，枝厅里的芡通常很持别
Google Docs OCR	Good	在中山,餐訂里的菜通常很特別,
OnlineOCR	Good	在中囚,各厅里的菜通常很特别
i2 OCR	Fail	在中山l 鲁汀里的菜通常很牺易ul,
NewOCR	Poor	在中凹. 稷汀呈的菜逆常很持别l

The 100 dpi scan the text is easily readable for a human, even so it is somewhat blurred. For OCR systems this resolutions is close to limit of the technology can handle. Three competitors barely got the good grade, while one service is unusable.

3. Very low-Quality Chinese Scan (75 dpi)

Test 3: Chinese OCR, 75dpi Scan []

OCR Service	Result	Output (Excerpt)
Abby Finereader	Fail	(no text)
Google Docs OCR	Fail	(no text)
OnlineOCR	Fail	往中工肠泞王的共诵常很特别
i2 OCR	Fail	仕申重. g厅虫的翼矗薰蟹麟颤」,
NewOCR	Fail	仨中重. 器焘虫的翼邃篱氰麝蒽上

Humans can still read this text, even so it involves guessing a few characters from the context. Not so our PC – every single OCR software flunked this test.

4. Smartphone image

Test 4: Smartphone camera Chinese image OCR

OCR Service	Result	Output (Excerpt)
Abby Finereader	Good	在中国，说厅里的菜通常很特別，
Google Docs OCR	Fail	OCR not trigged
OnlineOCR	Good	在中国，偿厅里的菜通常很特别，
i2 OCR	Fail	在口口五，餐厅里的粟遇常抒艮持另u,
NewOCR	Good	在中国, 餐厅里的菜通常很特别,

Using your mobile phone as scanner? Sure, that works. Three services deliver a good conversion result despite the yellowish background and somewhat oblique text. Surprisingly Google OCR fails this test: Google Docs no longer has a dedicated “Start OCR” button and the automatic OCR fails to trigger.

5/6: Chinese movie subtitles

This is not your average OCR task. The challenge here are the backgrounds. Of-the-shelf OCR systems have a very hard time distinguishing text from the background.

Test 5: Movie Subtitle 1 Chinese Movie OCR

OCR Service	Result	Output (Excerpt)
Abby Finereader	Poor	1 ^跳，彳II见上面的字吗
Google Docs OCR	Poor	行得現上面的字
OnlineOCR	Fail	).ir-iv一目日口
i2 OCR	Fail	(no text)
NewOCR	Poor	唰得见上面的学吗

In Subtitle 1 (from the movie Anchoring Seattle, not that this matters….) the subtitle is white on green. No OCR software can read it ok. But Abbyy, Finereader, Google OCR and NewOCR detect at least a few characters correctly.

Test 6: Movie Subtitle 2 Chinese OCR

OCR Service	Result	Output (Excerpt)
Abby Finereader	Fail	(no text)
Google Docs OCR	Fail	(no text)
OnlineOCR	Fail	．叫口圈团口睡鼠喻戒…
i2 OCR	Fail	(no text)
NewOCR	Fail	莪问大z 之… > .二_…

In Subtitle 2 (from the movie Good for Nothing Heroes) we have a street scene and the subtitle is white on a greyish background. No OCR software can read the characters, despite the large size characters (screenshot from full-screen Youtube replay of the Chinese movie).

Abby Cloud SDK works better than Abbyy FineReader for complex backgrounds — Abbyy Cloud SDK works better than Abbyy FineReader for complex backgrounds. So for test 5 and 6 the Abbyy Cloud SDK was used.

In my February OCR review, Abbyy OCR did a great job reading numbers from a gas meter image, so I am surprised it does not better here. But I used Abby FineReader instead of the Abbyy Cloud SDK for this review, as it is easier to use and showed no significant difference in recognition rates in a pre-test. So I went back to the cloud SDK for the subtitle recognition. And indeed, Abbyy Cloud SDK seems to have a more powerful recognition engine and/or background removal. It recognized Subtitle 1 and 2 at least partly. I would love to know why all OCR services miss the ~ first half of both subtitles completely.

Summary - Best OCR (web) software for Chinese characters

Ranking	Score	Scan1	Scan2	Scan3	Mobile	Sub1	Sub1
Abbyy FineReader	8	++	+	-	+	0	-
OnlineOCR	7	++	+	-	+	-	-
Google Docs OCR	6	++	+	-	-	0	-
NewOCR	6	++	0	-	+	0	-
i2 OCR	2	+	-	-	-	-	-

Just like with English Google OCR disappoints and is clumsy to use. It seems to be some kind of black magic if or if not the OCR conversion is triggered. If it fails, no error messages guides the user. And if it works, it works just average. The TOP 3 spots are exactly the same as in the English language OCR test: Abbyy wins the review, and the “no-name” OnlineOCR service comes in second - and is the best free OCR service. Google ranks only 3rd, along with NewOCR. So the surprise of this OCR for Chinese characters review is: No surprises - or in other words: We learned that whatever OCR software did well for English also did well for Chinese.

Chinese language version of this article: 最棒的在线中文OCR软件-评介

References

Reviewed online OCR services:

Abbyy OCR SDK/API
OnlineOCR
Google Docs OCR
i2OCR
FreeOnlineOCR
Tesseract Online Demo (flunked even the 300 dpi OCR test)

OCR benchmark - test images:

Chinese magazine scan, 300 dpi
Chinese magazine scan, 100 dpi
Chinese magazine scan, 75 dpi
Chinese magazine smartphone “scan”
Chinese movie subtitle 1 (green background)
Chinese movie subtitle 2 (grey background)
All test images are available on GitHub: OCR Benchmark

The OCR Software Blog

Best Online OCR Software for Chinese Characters - Review