thai-language.comInternet resource
for the Thai language
Lookup:
» more options here
Browse

F.A.Q. Check out the list of frequently asked questions for a quick answer to your inquiry

e-mail the author
guestbook
site settings
site news
bulk lookup
Bangkok
Thanks for your

recent donations!

Narisa N. $+++!
John A. $+++!
Paul S. $100!
Mike A. $100!
Eric B. $100!
John Karl L. $100!
Don S. $100!
John S. $100!
Peter B. $100!
Ingo B $50
Peter d C $50
Hans G $50
Alan M. $50
Rod S. $50
Wolfgang W. $50
Bill O. $70
Ravinder S. $20
Chris S. $15
Jose D-C $20
Steven P. $20
Daniel W. $75
Rudolf M. $30
David R. $50
Judith W. $50
Roger C. $50
Steve D. $50
Sean F. $50
Paul G. B. $50
xsinventory $20
Nigel A. $15
Michael B. $20
Otto S. $20
Damien G. $12
Simon G. $5
Lindsay D. $25
David S. $25
Laurent L. $40
Peter van G. $10
Graham S. $10
Peter N. $30
James A. $10
Dmitry I. $10
Edward R. $50
Roderick S. $30
Mason S. $5
Henning E. $20
John F. $20
Daniel F. $10
Armand H. $20
Daniel S. $20
James McD. $20
Shane McC. $10
Roberto P. $50
Derrell P. $20
Trevor O. $30
Patrick H. $25
Rick @SS $15
Gene H. $10
Aye A. M. $33
S. Cummings $25
Will F. $20
Get e-mail

Sign-up to join our mail­ing list. You'll receive e­mail notification when this site is updated. Your privacy is guaran­teed; this list is not sold, shared, or used for any other purpose. Click here for more infor­mation.

To unsubscribe, click here.

Thai Letters and Unicode Extended Grapheme Clusters

Vowel & consonant graphemes (letters), syllables, and orthography

Moderator: acloudmovingby

Thai Letters and Unicode Extended Grapheme Clusters

Postby Jim Monty » Mon Jul 28, 2014 5:08 pm

The following sequence of characters (four Unicode code points)…

    เพื่

    U+0E40 THAI CHARACTER SARA E
    U+0E1E THAI CHARACTER PHO PHAN
    U+0E37 THAI CHARACTER SARA UEE
    U+0E48 THAI CHARACTER MAI EK

…constitutes one Unicode extended grapheme cluster. Is it just one Thai letter? In general, are Unicode extended grapheme clusters equivalent to Thai letters and vice versa?

I'm a computer programmer. I don't speak or read Thai.

Thank you very much for your kind response to my inquiry.

Jim Monty
Jim Monty
 
Posts: 5
Joined: Mon Jul 28, 2014 4:44 pm

Re: Thai Letters and Unicode Extended Grapheme Clusters

Postby DonSena » Mon Jul 28, 2014 9:59 pm

Jim Monty wrote:The following sequence of characters (four Unicode code points)…

    เพื่

    U+0E40 THAI CHARACTER SARA E
    U+0E1E THAI CHARACTER PHO PHAN
    U+0E37 THAI CHARACTER SARA UEE
    U+0E48 THAI CHARACTER MAI EK

…constitutes one Unicode extended grapheme cluster. Is it just one Thai letter? In general, are Unicode extended grapheme clusters equivalent to Thai letters and vice versa?

I'm a computer programmer. I don't speak or read Thai.

Thank you very much for your kind response to my inquiry.

Jim Monty


U+0E1E =>
is a stand-alone in the sense that it need not occur with other characters to define a component of sound in Thai. in syllable-initial position is /ph/ (voiceless aspirated bilabial stop) and simply /p/ in final position.

Thus, พา /phaa/ ‘to take (along), to guide’; พูด ‘/phûut/ to speak, to say (as a literal utterance); พิมพ์ /phim/ ‘to print, type’; สะพาน /saphaan/ ‘bridge’

U+0E40 => /ee/ (as in the girl’s name Renee’), a non-standalone that precedes in the writing the consonant it actually *follows* in pronunciation.

Thus, เท /thee/ ‘to pour out; slanting’; เขน /khěen/ ‘shield, buckler’; เสน่ห์ /sanèe/ ‘charm(s), spell’

U+0E37 => อื (the upper character only), a non-standalone that is written above the consonant it actually *follows* in pronunciation. Use of this character อื /yy/ (lower-high central, like the ‘i’ in US English ‘robin’) requires a final consonant.

Thus, มืด /mŷyt/ ‘dark, obscure’; หรือ /ry̌y/ ‘or’; คือ /khyy/ ‘that is to say, (literally) is’; ดื้อ /dŷy/ ‘heastrong’

Not shown in your listing is , which, in addition to a mute consonant which makes possible syllables beginning with a vowel, is also a vowel symbol used in specific combinations with other symbols to denote long and short vowels and vowel dipthongs.

เพื่อ /phŷa/ ‘for the purpose of, in order to’

Here, the , อื (upper character only) and *together* as one larger unit, denote the dipthong /ya/.

Likewise, เนื้อ /nýa/ ‘meat, flesh’; เลือด /lỳad/ ‘blood’; เสื่อม /sỳam/ ‘to decline, deteriorate, wear out’
In each of these, the , อื and => /ya/

U+0E48 => ' , a non-standalone that denotes the tone (pitch contour of the voice). With *certain* consonants like , it denotes a falling tone:
เพื่อ /phŷa/ ‘for the purpose of, in order to’, in which /^/ is a fairly common transcription for the falling tone.
User avatar
DonSena
 
Posts: 1152
Joined: Sun Sep 12, 2010 2:47 am
Location: รัฐ อาริโซน่า

Re: Thai Letters and Unicode Extended Grapheme Clusters

Postby Jim Monty » Mon Jul 28, 2014 11:26 pm

Thank you for your detailed explanation, Don. I appreciate it.

You indicated U+0E1E () is a standalone character and U+0E40 (), U+0E37, and U+0E48 are not standalone characters. So is it correct, then, to say that the Unicode grapheme cluster เพื่ (U+0E40 U+0E1E U+0E37 U+0E48) is a single Thai letter?

I want to understand the relationship of Unicode grapheme clusters to what a Thai reader would think of as a single Thai letter. I used เพื่ as an example because I happen to know it comprises a single Unicode grapheme cluster, and because it has multiple Unicode code points in it.

In the Thai script, are Thai letters essentially equivalent to Unicode grapheme clusters? When Thai text is segmented into Unicode grapheme clusters, are the results individual Thai letters?

Jim Monty
Jim Monty
 
Posts: 5
Joined: Mon Jul 28, 2014 4:44 pm

Re: Thai Letters and Unicode Extended Grapheme Clusters

Postby DonSena » Tue Jul 29, 2014 2:46 am

Jim Monty wrote:Thank you for your detailed explanation, Don. I appreciate it.

You indicated U+0E1E () is a standalone character and U+0E40 (), U+0E37, and U+0E48 are not standalone characters. So is it correct, then, to say that the Unicode grapheme cluster เพื่ (U+0E40 U+0E1E U+0E37 U+0E48) is a single Thai letter?

I want to understand the relationship of Unicode grapheme clusters to what a Thai reader would think of as a single Thai letter. I used เพื่ as an example because I happen to know it comprises a single Unicode grapheme cluster, and because it has multiple Unicode code points in it.

In the Thai script, are Thai letters essentially equivalent to Unicode grapheme clusters? When Thai text is segmented into Unicode grapheme clusters, are the results individual Thai letters? Jim Monty


เพื่ is not correct. This combination is not valid.

เพื่อ is valid. It combines the three characters , อื and to express the dipthong /ya/. Thus, เอือ /ya/, เพือ /phya/ and เพื่อ /phŷa/. The last of these is an actual word in Thai.

The three characters that combine as a single larger unit are placed around a consonant -- before it, over it and after it -- as shown. The consonant is sounded, then the /ya/, which is indicated by these same three vowel characters.

can be used as a mute consonant to make possible a syllable that begins with the vowel. Thus, เอือ /ya/.
If, instead, the written consonant is , then we get เพือ /phya/. Notice that the replaces the (first) in เอือ, making เพือ. With the tone mark, เพื่อ /phŷa/ results.

You can consider the three characters before, above and after เพือ the be a grapheme cluster, but เพือ is not such a cluster. The can be repaced by other consonants. Look again at the examples I gave in my first post.
User avatar
DonSena
 
Posts: 1152
Joined: Sun Sep 12, 2010 2:47 am
Location: รัฐ อาริโซน่า

Re: Thai Letters and Unicode Extended Grapheme Clusters

Postby Jim Monty » Tue Jul 29, 2014 3:42 am

Thank you again for your detailed response, Don.

Here's the original Thai phrase from which I excerpted the Unicode grapheme cluster เพื่:

    ทั้งนี้เพื่อเป็นการปกป้องการรักษาความลับและข้อ

(N.B. เพื่ is two successive Unicode grapheme clusters.)

This phrase consists of 40 Unicode code points. Here are the results of segmenting this string of 40 Unicode code points into 24 grapheme clusters according to Unicode Standard Annex #29: Unicode Text Segmentation:

    ทั้ 3 U+0E17 THAI CHARACTER THO THAHAN, U+0E31 THAI CHARACTER MAI HAN-AKAT, U+0E49 THAI CHARACTER MAI THO
    1 U+0E07 THAI CHARACTER NGO NGU
    นี้ 3 U+0E19 THAI CHARACTER NO NU, U+0E35 THAI CHARACTER SARA II, U+0E49 THAI CHARACTER MAI THO
    เพื่ 4 U+0E40 THAI CHARACTER SARA E, U+0E1E THAI CHARACTER PHO PHAN, U+0E37 THAI CHARACTER SARA UEE, U+0E48 THAI CHARACTER MAI EK
    1 U+0E2D THAI CHARACTER O ANG
    เป็ 3 U+0E40 THAI CHARACTER SARA E, U+0E1B THAI CHARACTER PO PLA, U+0E47 THAI CHARACTER MAITAIKHU
    1 U+0E19 THAI CHARACTER NO NU
    กา 2 U+0E01 THAI CHARACTER KO KAI, U+0E32 THAI CHARACTER SARA AA
    1 U+0E23 THAI CHARACTER RO RUA
    1 U+0E1B THAI CHARACTER PO PLA
    1 U+0E01 THAI CHARACTER KO KAI
    ป้ 2 U+0E1B THAI CHARACTER PO PLA, U+0E49 THAI CHARACTER MAI THO
    1 U+0E2D THAI CHARACTER O ANG
    1 U+0E07 THAI CHARACTER NGO NGU
    กา 2 U+0E01 THAI CHARACTER KO KAI, U+0E32 THAI CHARACTER SARA AA
    1 U+0E23 THAI CHARACTER RO RUA
    รั 2 U+0E23 THAI CHARACTER RO RUA, U+0E31 THAI CHARACTER MAI HAN-AKAT
    1 U+0E01 THAI CHARACTER KO KAI
    ษา 2 U+0E29 THAI CHARACTER SO RUSI, U+0E32 THAI CHARACTER SARA AA
    1 U+0E04 THAI CHARACTER KHO KHWAI
    วา 2 U+0E27 THAI CHARACTER WO WAEN, U+0E32 THAI CHARACTER SARA AA
    1 U+0E21 THAI CHARACTER MO MA
    ลั 2 U+0E25 THAI CHARACTER LO LING, U+0E31 THAI CHARACTER MAI HAN-AKAT
    1 U+0E1A THAI CHARACTER BO BAIMAI


My question is simply this: Are these 24 things Thai letters? If they're not letters, then what are they, and why does Unicode segment the text this way rather than into Thai letters?

I'm not trying to learn the Thai language or anything about Thai linguistics. I just want to know if these things that Unicode calls grapheme clusters are Thai letters or not.

Jim Monty
Jim Monty
 
Posts: 5
Joined: Mon Jul 28, 2014 4:44 pm

Re: Thai Letters and Unicode Extended Grapheme Clusters

Postby DonSena » Tue Jul 29, 2014 5:24 am

Jim Monty wrote:Thank you again for your detailed response, Don.

Here's the original Thai phrase from which I excerpted the Unicode grapheme cluster เพื่:

    ทั้งนี้เพื่อเป็นการปกป้องการรักษาความลับและข้อ

(N.B. เพื่ is two successive Unicode grapheme clusters.)

This phrase consists of 40 Unicode code points. Here are the results of segmenting this string of 40 Unicode code points into 24 grapheme clusters according to Unicode Standard Annex #29: Unicode Text Segmentation:

    ทั้ 3 U+0E17 THAI CHARACTER THO THAHAN, U+0E31 THAI CHARACTER MAI HAN-AKAT, U+0E49 THAI CHARACTER MAI THO
    1 U+0E07 THAI CHARACTER NGO NGU
    นี้ 3 U+0E19 THAI CHARACTER NO NU, U+0E35 THAI CHARACTER SARA II, U+0E49 THAI CHARACTER MAI THO
    เพื่ 4 U+0E40 THAI CHARACTER SARA E, U+0E1E THAI CHARACTER PHO PHAN, U+0E37 THAI CHARACTER SARA UEE, U+0E48 THAI CHARACTER MAI EK
    1 U+0E2D THAI CHARACTER O ANG
    เป็ 3 U+0E40 THAI CHARACTER SARA E, U+0E1B THAI CHARACTER PO PLA, U+0E47 THAI CHARACTER MAITAIKHU
    1 U+0E19 THAI CHARACTER NO NU
    กา 2 U+0E01 THAI CHARACTER KO KAI, U+0E32 THAI CHARACTER SARA AA
    1 U+0E23 THAI CHARACTER RO RUA
    1 U+0E1B THAI CHARACTER PO PLA
    1 U+0E01 THAI CHARACTER KO KAI
    ป้ 2 U+0E1B THAI CHARACTER PO PLA, U+0E49 THAI CHARACTER MAI THO
    1 U+0E2D THAI CHARACTER O ANG
    1 U+0E07 THAI CHARACTER NGO NGU
    กา 2 U+0E01 THAI CHARACTER KO KAI, U+0E32 THAI CHARACTER SARA AA
    1 U+0E23 THAI CHARACTER RO RUA
    รั 2 U+0E23 THAI CHARACTER RO RUA, U+0E31 THAI CHARACTER MAI HAN-AKAT
    1 U+0E01 THAI CHARACTER KO KAI
    ษา 2 U+0E29 THAI CHARACTER SO RUSI, U+0E32 THAI CHARACTER SARA AA
    1 U+0E04 THAI CHARACTER KHO KHWAI
    วา 2 U+0E27 THAI CHARACTER WO WAEN, U+0E32 THAI CHARACTER SARA AA
    1 U+0E21 THAI CHARACTER MO MA
    ลั 2 U+0E25 THAI CHARACTER LO LING, U+0E31 THAI CHARACTER MAI HAN-AKAT
    1 U+0E1A THAI CHARACTER BO BAIMAI


My question is simply this: Are these 24 things Thai letters? If they're not letters, then what are they, and why does Unicode segment the text this way rather than into Thai letters?

I'm not trying to learn the Thai language or anything about Thai linguistics. I just want to know if these things that Unicode calls grapheme clusters are Thai letters or not.

Jim Monty

They are combnaitons of one or more "letters." Where, in the list, you see just one character, it is a standalone, a letter that does not need to be associated with another letter. The unicode apparently provides all possible combinations of non-standalones with one standalone that correspond to the formation of Thai syllables.
Even though เพื่ is not a legitimate Thai syllable, it does combine three non-standalones (one to the left and two above the standalone ) with the one standalone. The addition of another standalone, , makes a comnplete Thai syllable : เพื่อ.
Each of the 24 contains one or more letter.
The 24 combinations apply to the sentence you copied. Each one of the twenty-four contains exactly one standalone character.
User avatar
DonSena
 
Posts: 1152
Joined: Sun Sep 12, 2010 2:47 am
Location: รัฐ อาริโซน่า

Re: Thai Letters and Unicode Extended Grapheme Clusters

Postby Jim Monty » Tue Jul 29, 2014 6:07 pm

Thank you once again, Don, for your thoughtful response. I appreciate it.

DonSena wrote:They are combinations of one or more "letters." Where, in the list, you see just one character, it is a standalone, a letter that does not need to be associated with another letter. The unicode apparently provides all possible combinations of non-standalones with one standalone that correspond to the formation of Thai syllables.
Even though เพื่ is not a legitimate Thai syllable, it does combine three non-standalones (one to the left and two above the standalone ) with the one standalone. The addition of another standalone, , makes a complete Thai syllable : เพื่อ.
Each of the 24 contains one or more letter.
The 24 combinations apply to the sentence you copied. Each one of the twenty-four contains exactly one standalone character.


So I think we've converged on a satisfactory answer. What Unicode calls a grapheme cluster is a combination of one standalone Thai letter and zero or more Thai letters or marks that do not stand alone. For Thai text, it's an incorrect oversimplification to say that a Unicode grapheme cluster is a letter.

Thai syllables are composed of one or more Unicode grapheme clusters.

Jim Monty
Jim Monty
 
Posts: 5
Joined: Mon Jul 28, 2014 4:44 pm

Re: Thai Letters and Unicode Extended Grapheme Clusters

Postby DonSena » Tue Jul 29, 2014 7:38 pm

Jim Monty wrote:Thank you once again, Don, for your thoughtful response. I appreciate it.

DonSena wrote:They are combinations of one or more "letters." Where, in the list, you see just one character, it is a standalone, a letter that does not need to be associated with another letter. The unicode apparently provides all possible combinations of non-standalones with one standalone that correspond to the formation of Thai syllables.
Even though เพื่ is not a legitimate Thai syllable, it does combine three non-standalones (one to the left and two above the standalone ) with the one standalone. The addition of another standalone, , makes a complete Thai syllable : เพื่อ.
Each of the 24 contains one or more letter.
The 24 combinations apply to the sentence you copied. Each one of the twenty-four contains exactly one standalone character.


So I think we've converged on a satisfactory answer. What Unicode calls a grapheme cluster is a combination of one standalone Thai letter and zero or more Thai letters or marks that do not stand alone. For Thai text, it's an incorrect oversimplification to say that a Unicode grapheme cluster is a letter.

Thai syllables are composed of one or more Unicode grapheme clusters.

Jim Monty


Precisely so.
User avatar
DonSena
 
Posts: 1152
Joined: Sun Sep 12, 2010 2:47 am
Location: รัฐ อาริโซน่า

Re: Thai Letters and Unicode Extended Grapheme Clusters

Postby Richard Wordingham » Fri Aug 01, 2014 11:20 pm

เพื่ <U+0E40 THAI CHARACTER SARA E, U+0E1E THAI CHARACTER PHO PHAN, U+0E37 THAI CHARACTER SARA UEE, U+0E48 THAI CHARACTER MAI EK> is no longer an extended grapheme cluster. The kick in the teeth of giving U+0E40 THAI CHARACTER SARA E the value Prepend for the property Grapheme_Cluster_Break in Unicode 5.1 raised such howls of protest that it was withdrawn in Unicode 6.1. The same applies to giving U+0E32 THAI CHARACTER SARA AA the value Extend for this property. It is now two extended grapheme clusters, <U+0E40> and <U+0E1E, U+0E37, U+0E48>. Likewise, กา <U+0E01 THAI CHARACTER KO KAI, U+0E32 THAI CHARACTER SARA AA> is now two extended grapheme clusters.

The problem is that grapheme clusters do not in general correspond to what users consider characters, let alone letters. While this may be so for Swedish (less sure for French, and even less so for German) and Tamil, it clearly did not work for Thai. Indeed, U+0E40 has been a 'letter' in Unicode since Unicode 1.1. Moreover, Thai justification treats it as such. Thai justification inserts spaces between letters, and the preposed and postposed vowel symbols are treated in the same way as consonants. They are also treated the same way as consonants in crosswords, though Thai crosswords are very much a minority interest. Note also that Indic scripts make a distinction between consonants and marks; the prototypical marks are the vowel symbols.

I'm not convinced that U+0E33 THAI CHARACTER SARA AM should have the value 'Extend' for the property. Unfortunately I haven't been able to do any field research, looking at stretched names in Thai streets. U+0E33 is a bit of a problem area.

There is a current trend for editing software to prevent users editing grapheme clusters. This is an anti-social nuisance, which is why I refer to the introduction of 'Prepend' as a kick in the teeth. While retyping the whole of a acute when one should have typed e acute may be tolerable, it is far less pleasant to have to retype all of พื่ <U+0E1E THAI CHARACTER PHO PHAN, U+0E37 THAI CHARACTER SARA UEE, U+0E48 THAI CHARACTER MAI EK> if one accidentally mistypes it as ปื่. It's even worse if one has five combining marks, as sometimes happens in the Lanna script (Unicode: TAI THAM). Thais were infuriated when they lost the ability to edit the preposed vowel symbols - I am genuinely surprised they apparently don't fume at being unable to edit clusters like พื่. The whole business makes me so angry that I suspect I will wind up making a post that gets me expelled from the Unicode Consortium. (I'm an individual member, so I don't get to vote.)
Richard Wordingham
 
Posts: 1294
Joined: Mon Feb 14, 2005 12:00 am
Location: Stevenage, England

Re: Thai Letters and Unicode Extended Grapheme Clusters

Postby Jim Monty » Mon Aug 04, 2014 3:23 am

Thank you very much for your thorough explanation, Richard. I appreciate it.

Richard Wordingham wrote:เพื่ <U+0E40 THAI CHARACTER SARA E, U+0E1E THAI CHARACTER PHO PHAN, U+0E37 THAI CHARACTER SARA UEE, U+0E48 THAI CHARACTER MAI EK> is no longer an extended grapheme cluster. The kick in the teeth of giving U+0E40 THAI CHARACTER SARA E the value Prepend for the property Grapheme_Cluster_Break in Unicode 5.1 raised such howls of protest that it was withdrawn in Unicode 6.1.


I parsed the Thai text into these purported extended grapheme clusters using Perl 5.16 and its EGC regular expression construct '\X'. Perl 5.16 purports to support Unicode 6.1. The perl5610delta perldoc page states: "Supports (almost) Unicode 6.1. ... and because of other changes in 6.1, the Perl regular expression construct '\X' now works differently for some characters in Thai and Lao." Hmm, interesting. It's just not interesting enough for me to want to research it.

I got the answer to my question thanks to you and Don Sena: Don't use Thai text in an example to demonstrate to a layperson why the length of text measured in graphemes is sometimes shorter than the length of the same text measured in code points. Use Swedish text instead.

Richard Wordingham wrote:The whole business makes me so angry that I suspect I will wind up making a post that gets me expelled from the Unicode Consortium. (I'm an individual member, so I don't get to vote.)


Good luck with that! (I'm also an individual member of the Unicode Consortium. If I could, I'd vote against your expulsion.)

Jim Monty
Jim Monty
 
Posts: 5
Joined: Mon Jul 28, 2014 4:44 pm


Return to Reading, Writing, Spelling, and Tone Rules

Who is online

Users browsing this forum: No registered users and 13 guests

Copyright © 2024 thai-language.com. Portions copyright © by original authors, rights reserved, used by permission; Portions 17 USC §107.