The lack of a two-em dash in Unicode

What if your dashes are all broken?

You probably know that in Microsoft Word typing three hyphens gives you a dash; but what if the reverse is true, that even if you choose an em-dash from Insert Symbols or the Character Chooser, you still end up only getting three hyphens, complete with the unsighty spaces inbetween? What will you do? Do you think this is a crazy suggestion? Quite the contrary: millions of Chinese people are living with a similar problem every day when they need to type a Chinese dash.

Chinese uses (something very close to) two-em dashes

To understand this problem, we first need to understand what a dash used by the Chinese language looks like.

In the Chinese language, in correct orthography, a grammatical dash is a single unbroken line spanning two fullwidth spaces. We could say that, grammatically, dashes in Chinese are two-em dashes instead of em-dashes, if we ignore the differences between the em and the fullwidth space for now.

No standard way to represent Chinese dashes

Since Unicode aims to be able to represent all writing systems of the entire world, it would seem natural that Unicode must have a code point reserved for the Chinese dash. Or at least its close approximation, the two-em dash. Or at least one half of the Chinese dash, so that if we put two side by side we will have a dash. Unfortunately, this is not the case: There is no official code point assigned to either the Chinese dash, the two-em dash, or even one half of the Chinese dash.

With Latin fonts, if we need a two-em dash, we just put two em-dashes side by side, and we get a two-em dash with no gap inbetween. With CJK fonts, however, things get unpredictable.

Figure 1 shows how putting two em-dashes side by side looks like, if we use PMingLiU, a Windows system font. As you can see from the figure, there is a very obvious gap in the middle. (In fact, this is how a lot of web pages and Word documents look like.)

At this point we might suspect a font quality problem, since PMingLiU is not exactly a very high-quality font. So we shall re-attempt the experiment with Kozuka Gothic Pro, a much higher-quality Japanese font. Figure 2 shows the result: not only is there still a very obvious gap in the middle, the composite dash also has the wrong width.

If we hop to http://www.edu.tw/files/site_content/m0001/hau/h10.htm (the page on the dash in the actual standard of how punctuation marks are used in Taiwan), you will find that the dash (──) actually looks solid on the browser. So we will do a simple copy-and-paste of that dash and re-set our example, and we finally get a solid dash.

So what Unicode character is actually used to set that dash? Examining the character using the Glyph window in Illustrator reveals that the character used is U+2500, the “box drawings light horizontal” character (and in the original web page it is character 0xA277 in the Big5 encoding). Is there an easy way to type this character? Since we don’t even use line-drawing characters these days, the answer is, quite naturally, no.

(The use of line-drawing characters goes back to the DBCS (double-byte character set) days, when the Big5 family of encodings was standard. The Chinese dash had always used to be represented by two line drawing characters, because using two line drawing characters was the only way to ensure that there is no gap within the dash. The only difference between back then and right now is that back then systems often provide macros (keyboard shortcuts) to type dashes.)

Figure 1: Setting two em-dashes with PMingLiu
Figure 2: Setting two em-dashes with Kozuka Gothic
Figure 3: Setting edu.tw’s dash with Kozuka Gothic

Despite the unexpected character used, it still looks like we have finally found a reliable way to represent a Chinese dash. However, this is in fact not the case. Figure 4 shows the same text (with the copied-and-pasted dash) set in Apple LiGothic Medium, an Apple-provided system font. As you can see, the gap returns.

In modern Unicode-based fonts, it is simply naïve to assume that line-drawing characters will actually line up without gaps.

Our experiments demonstrate two things clearly: Firstly, that there is simply no way to represent a Chinese dash — a single unbroken line spanning two fullwidth spaces; Secondly, that there is obviously a serious oversight in the design of Unicode, since relying on line-drawing characters to represent a punctuation mark as frequently-used as the dash is obviously unreasonable.

Figure 4: Setting edu.tw’s dash with Apple LiGothic Medium

Conclusion

In summary, there is no standard way to type the Chinese dash — as the Unicode standard has not specified this character —, nor the two-em dash which is a close approximation, nor a character which corresponds to one half of a Chinese dash. The normal user simply has no reliable way to type this punctuation mark, and to represent this frequently used punctuation mark one will have to resort to using line-drawing characters; however, even the use of line-drawing characters does not guarantee an unbroken dash.

Since the two-em dash actually exists in Western typography, it comes as a shock that Unicode has not defined it, especially when it could have served as a close approximation of the Chinese dash.

Of course, to be really correct, Unicode should have defined two such dashes, since the fullwidth space measures differently depending on writing direction. However, the lack of even a two-em dash shows that the Unicode Consortium never understood how a dash is typeset in the Chinese language.