How do I recognize where Thai words begin and end?
The Thai language does not use spaces between the words in a sentence. This poses unique challenges for the beginning
student whose first language does utilize word spacing, but with practice, reading Thai becomes second nature.
If you reflect on how you read English, for example, you might be surprised to learn that we recognize words as a whole,
rather than by parsing the individual letters from which they are composed. Doing the latter would make reading very much slower!
Rest assured that the same will hold for Thai. Like a first-grader, we begin by following each individual letter, but when
you can read at a more advanced grade school level, you will be recognizing entire words, and you won't even notice that
there are no spaces between the words.
Because Thai is "designed" to be written without spaces between the words, it will be easier to read Thai than it is to read this:
Englishwritingprovidesfewercluesaboutwordbreaking.
That's because Thai has rules such as the following:
- The preposed vowels (เ แ โ ใ ไ) start a syllable.
- ะ ends a syllable (unless it is followed by a consonant with the symbol อ์ as in the word เคราะห์ . These exceptions are rare. The symbol อ์ is called การันต์ /gaaM ranM/ or gaaran.)
- Except for European loan words (such as กอล์ฟ /gaawpH/), gaaran ends a syllable.
- A syllable starting with ใ or ไ is an open syllable.
- อั and อ็ do not appear over a syllable final consonant.
- Sometimes two consonants form an initial cluster together; a tone mark, if any, will appear on the second consonant of such a cluster.
- อำ ends a syllable.
These rules go a long way, and will be helpful until you start forming word boundaries by recognising words. Having said all this, Thai
does use spaces in some cases. See Bryan's article for more information on
Spacing in the Thai Language.
Thai Word- and Sentence-Breaking
Because Thai doesn't use space between words, the task of automatically separating Thai text into words has been a long-standing
challenge in the field of computational linguistics. A further challenge is to identify sentence boundaries in Thai text, because—as Bryan's article points out—Thai also
uses space for various functions
within a sentence, so a given space may or may not indicate the end of a sentence. For more information on these interesting
problems, you may be interested in the following academic papers I've (co-)authored on the subject:
Glenn Slayden, Mei-Yuh Hwang, and Lee Schwartz. 2010.
Thai Sentence-Breaking for Large-Scale SMT.
Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing,
Beijing, China, August 2010, p. 8-16. COLING 2010 Organizing Committee.
This article by Glenn Slayden based on material by Richard Wordingham
Last updated: November 5, 2010