Kanjis to Romajis first letter
Clash Royale CLAN TAG#URR8PPP
up vote
1
down vote
favorite
I'm fond of Unicode and what romanization attempts. My goal is to get a first latin char (for sorting purposes) of a non-latin character - so far I succeeded by transcribing Cyrillic, Greek, Hebrew, Katakana, Hiragana, Hangul (including all its syllables), Berber, Thai and Arabic letters by assigning the most appropriate starting letter to each case.
I also know that multiple systems for transliterations and transcribings (and romanization) exist - so far their differences are almost irrelevant for my needs. I'm not fond of Japanese itself - at most I might be able to recognize English terms written in Katakanas.
My problem is: how to assign Unicode code points U+4E00 thru U+9FFF by algorithm? For Hangul syllables this is quite easy: U+AC00 thru U+B097 => K (as all of them start with that); U+B098 thru B2E3 => N. I've looked at JS solutions like https://github.com/hexenq/kuroshiro/ and https://github.com/WaniKani/WanaKana/, but I only find the code for processing Hiraganas and Katakanas (which I already got), never Kanjis (although all of their demos succeed in processing them).
Is there a table or dictionary? If romanization of Kanjis is achieved thru first converting each Kanji into Katakanas, then how to achieve that?
kanji rÃ…Âmaji unicode
New contributor
AmigoJack is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |Â
up vote
1
down vote
favorite
I'm fond of Unicode and what romanization attempts. My goal is to get a first latin char (for sorting purposes) of a non-latin character - so far I succeeded by transcribing Cyrillic, Greek, Hebrew, Katakana, Hiragana, Hangul (including all its syllables), Berber, Thai and Arabic letters by assigning the most appropriate starting letter to each case.
I also know that multiple systems for transliterations and transcribings (and romanization) exist - so far their differences are almost irrelevant for my needs. I'm not fond of Japanese itself - at most I might be able to recognize English terms written in Katakanas.
My problem is: how to assign Unicode code points U+4E00 thru U+9FFF by algorithm? For Hangul syllables this is quite easy: U+AC00 thru U+B097 => K (as all of them start with that); U+B098 thru B2E3 => N. I've looked at JS solutions like https://github.com/hexenq/kuroshiro/ and https://github.com/WaniKani/WanaKana/, but I only find the code for processing Hiraganas and Katakanas (which I already got), never Kanjis (although all of their demos succeed in processing them).
Is there a table or dictionary? If romanization of Kanjis is achieved thru first converting each Kanji into Katakanas, then how to achieve that?
kanji rÃ…Âmaji unicode
New contributor
AmigoJack is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
2
This question is probably off topic at this forum but for what it's worth, kanji in Japanese have multiple pronunciations and you will not be able to assign definitive pronunciation to most of them. The best you could do is find a downloadable kanji database and choose a pronunciation (it will probably be in hiragana) arbitrarily. Perhaps the database will also include some form of probability that you could use to your advantage.
– G-Cam
1 hour ago
1
I'm curious, why? Granted, you can assign a Latin letter to each kanji, but as others have noted, that is mostly an arbitrary process. I'm left scratching my head as to what use this would be. I suppose as a general coding project, it might be a fun puzzle, but as to final utility, I'm baffled.
– EirÃkr Útlendi
1 hour ago
The goal is to have both "Takkyu" and "å“çƒ" being listed under "T", instead of having all non-latin words/names being listed under "#", appealing to users who rather deal with latin letters. It may not make sense for Kanjis/CJK, but for i.e. Hangul and Cyrillic ("Yulia" and "îÌÂûøÑÂ" both under "Y") - most probably I'll realize it makes too little sense for all CJK idiographs.
– AmigoJack
38 mins ago
@AmigoJack Wait, so you're trying deal with words (å“çƒ) rather than characters (å“)? Then character-based approach described in my answer will make almost no sense in Japanese. One kanji can have many readings, and Unihan_Readings.txt is almost useless to determine the reading of an individual word. For example 生命 is Seimei, 生地 is Kiji, 生åµ is Namatamago and 生霊 is Ikiryo. What you may need is a morphological analyzer introduced here.
– naruto
18 mins ago
add a comment |Â
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I'm fond of Unicode and what romanization attempts. My goal is to get a first latin char (for sorting purposes) of a non-latin character - so far I succeeded by transcribing Cyrillic, Greek, Hebrew, Katakana, Hiragana, Hangul (including all its syllables), Berber, Thai and Arabic letters by assigning the most appropriate starting letter to each case.
I also know that multiple systems for transliterations and transcribings (and romanization) exist - so far their differences are almost irrelevant for my needs. I'm not fond of Japanese itself - at most I might be able to recognize English terms written in Katakanas.
My problem is: how to assign Unicode code points U+4E00 thru U+9FFF by algorithm? For Hangul syllables this is quite easy: U+AC00 thru U+B097 => K (as all of them start with that); U+B098 thru B2E3 => N. I've looked at JS solutions like https://github.com/hexenq/kuroshiro/ and https://github.com/WaniKani/WanaKana/, but I only find the code for processing Hiraganas and Katakanas (which I already got), never Kanjis (although all of their demos succeed in processing them).
Is there a table or dictionary? If romanization of Kanjis is achieved thru first converting each Kanji into Katakanas, then how to achieve that?
kanji rÃ…Âmaji unicode
New contributor
AmigoJack is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
I'm fond of Unicode and what romanization attempts. My goal is to get a first latin char (for sorting purposes) of a non-latin character - so far I succeeded by transcribing Cyrillic, Greek, Hebrew, Katakana, Hiragana, Hangul (including all its syllables), Berber, Thai and Arabic letters by assigning the most appropriate starting letter to each case.
I also know that multiple systems for transliterations and transcribings (and romanization) exist - so far their differences are almost irrelevant for my needs. I'm not fond of Japanese itself - at most I might be able to recognize English terms written in Katakanas.
My problem is: how to assign Unicode code points U+4E00 thru U+9FFF by algorithm? For Hangul syllables this is quite easy: U+AC00 thru U+B097 => K (as all of them start with that); U+B098 thru B2E3 => N. I've looked at JS solutions like https://github.com/hexenq/kuroshiro/ and https://github.com/WaniKani/WanaKana/, but I only find the code for processing Hiraganas and Katakanas (which I already got), never Kanjis (although all of their demos succeed in processing them).
Is there a table or dictionary? If romanization of Kanjis is achieved thru first converting each Kanji into Katakanas, then how to achieve that?
kanji rÃ…Âmaji unicode
kanji rÃ…Âmaji unicode
New contributor
AmigoJack is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
AmigoJack is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
AmigoJack is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
asked 3 hours ago


AmigoJack
1062
1062
New contributor
AmigoJack is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
AmigoJack is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
AmigoJack is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
2
This question is probably off topic at this forum but for what it's worth, kanji in Japanese have multiple pronunciations and you will not be able to assign definitive pronunciation to most of them. The best you could do is find a downloadable kanji database and choose a pronunciation (it will probably be in hiragana) arbitrarily. Perhaps the database will also include some form of probability that you could use to your advantage.
– G-Cam
1 hour ago
1
I'm curious, why? Granted, you can assign a Latin letter to each kanji, but as others have noted, that is mostly an arbitrary process. I'm left scratching my head as to what use this would be. I suppose as a general coding project, it might be a fun puzzle, but as to final utility, I'm baffled.
– EirÃkr Útlendi
1 hour ago
The goal is to have both "Takkyu" and "å“çƒ" being listed under "T", instead of having all non-latin words/names being listed under "#", appealing to users who rather deal with latin letters. It may not make sense for Kanjis/CJK, but for i.e. Hangul and Cyrillic ("Yulia" and "îÌÂûøÑÂ" both under "Y") - most probably I'll realize it makes too little sense for all CJK idiographs.
– AmigoJack
38 mins ago
@AmigoJack Wait, so you're trying deal with words (å“çƒ) rather than characters (å“)? Then character-based approach described in my answer will make almost no sense in Japanese. One kanji can have many readings, and Unihan_Readings.txt is almost useless to determine the reading of an individual word. For example 生命 is Seimei, 生地 is Kiji, 生åµ is Namatamago and 生霊 is Ikiryo. What you may need is a morphological analyzer introduced here.
– naruto
18 mins ago
add a comment |Â
2
This question is probably off topic at this forum but for what it's worth, kanji in Japanese have multiple pronunciations and you will not be able to assign definitive pronunciation to most of them. The best you could do is find a downloadable kanji database and choose a pronunciation (it will probably be in hiragana) arbitrarily. Perhaps the database will also include some form of probability that you could use to your advantage.
– G-Cam
1 hour ago
1
I'm curious, why? Granted, you can assign a Latin letter to each kanji, but as others have noted, that is mostly an arbitrary process. I'm left scratching my head as to what use this would be. I suppose as a general coding project, it might be a fun puzzle, but as to final utility, I'm baffled.
– EirÃkr Útlendi
1 hour ago
The goal is to have both "Takkyu" and "å“çƒ" being listed under "T", instead of having all non-latin words/names being listed under "#", appealing to users who rather deal with latin letters. It may not make sense for Kanjis/CJK, but for i.e. Hangul and Cyrillic ("Yulia" and "îÌÂûøÑÂ" both under "Y") - most probably I'll realize it makes too little sense for all CJK idiographs.
– AmigoJack
38 mins ago
@AmigoJack Wait, so you're trying deal with words (å“çƒ) rather than characters (å“)? Then character-based approach described in my answer will make almost no sense in Japanese. One kanji can have many readings, and Unihan_Readings.txt is almost useless to determine the reading of an individual word. For example 生命 is Seimei, 生地 is Kiji, 生åµ is Namatamago and 生霊 is Ikiryo. What you may need is a morphological analyzer introduced here.
– naruto
18 mins ago
2
2
This question is probably off topic at this forum but for what it's worth, kanji in Japanese have multiple pronunciations and you will not be able to assign definitive pronunciation to most of them. The best you could do is find a downloadable kanji database and choose a pronunciation (it will probably be in hiragana) arbitrarily. Perhaps the database will also include some form of probability that you could use to your advantage.
– G-Cam
1 hour ago
This question is probably off topic at this forum but for what it's worth, kanji in Japanese have multiple pronunciations and you will not be able to assign definitive pronunciation to most of them. The best you could do is find a downloadable kanji database and choose a pronunciation (it will probably be in hiragana) arbitrarily. Perhaps the database will also include some form of probability that you could use to your advantage.
– G-Cam
1 hour ago
1
1
I'm curious, why? Granted, you can assign a Latin letter to each kanji, but as others have noted, that is mostly an arbitrary process. I'm left scratching my head as to what use this would be. I suppose as a general coding project, it might be a fun puzzle, but as to final utility, I'm baffled.
– EirÃkr Útlendi
1 hour ago
I'm curious, why? Granted, you can assign a Latin letter to each kanji, but as others have noted, that is mostly an arbitrary process. I'm left scratching my head as to what use this would be. I suppose as a general coding project, it might be a fun puzzle, but as to final utility, I'm baffled.
– EirÃkr Útlendi
1 hour ago
The goal is to have both "Takkyu" and "å“çƒ" being listed under "T", instead of having all non-latin words/names being listed under "#", appealing to users who rather deal with latin letters. It may not make sense for Kanjis/CJK, but for i.e. Hangul and Cyrillic ("Yulia" and "îÌÂûøÑÂ" both under "Y") - most probably I'll realize it makes too little sense for all CJK idiographs.
– AmigoJack
38 mins ago
The goal is to have both "Takkyu" and "å“çƒ" being listed under "T", instead of having all non-latin words/names being listed under "#", appealing to users who rather deal with latin letters. It may not make sense for Kanjis/CJK, but for i.e. Hangul and Cyrillic ("Yulia" and "îÌÂûøÑÂ" both under "Y") - most probably I'll realize it makes too little sense for all CJK idiographs.
– AmigoJack
38 mins ago
@AmigoJack Wait, so you're trying deal with words (å“çƒ) rather than characters (å“)? Then character-based approach described in my answer will make almost no sense in Japanese. One kanji can have many readings, and Unihan_Readings.txt is almost useless to determine the reading of an individual word. For example 生命 is Seimei, 生地 is Kiji, 生åµ is Namatamago and 生霊 is Ikiryo. What you may need is a morphological analyzer introduced here.
– naruto
18 mins ago
@AmigoJack Wait, so you're trying deal with words (å“çƒ) rather than characters (å“)? Then character-based approach described in my answer will make almost no sense in Japanese. One kanji can have many readings, and Unihan_Readings.txt is almost useless to determine the reading of an individual word. For example 生命 is Seimei, 生地 is Kiji, 生åµ is Namatamago and 生霊 is Ikiryo. What you may need is a morphological analyzer introduced here.
– naruto
18 mins ago
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
5
down vote
CJK unified ideographs are sorted based on radicals, not readings. This is because it's impossible to determine the reading of those characters in one way. The Unicode Consortium provides Unihan Database, which can display the representative readings of CJK ideographs written in simple Latin alphabet. For example, here is the result for a very basic ideograph 日 (U+65E5; "day", "date", "sun", etc):
The table says the "first roman letter of 日" is J
in Cantonese, R
in Mandarin, H-or-K-or-N-or-J
in Japanese and I
in Korean. To understand what's going on here, please keep in mind that the 'CJK Unified Ideographs' block has characters used in Chinese, Korean and Japanese jumbled together. Each character is read differently in different languages. Especially in Japanese, one character can have many readings depending on the context. To make matters worse, there are some characters whose readings are totally unknown. If you can accept all those limitations and still want the Latin readings anyway, go ahead and use the database according to your needs. If you're only interested in Japanese kanji, a reasonable method would be to pick the first letter of the kJapaneseOn
field (or the first letter of kJapaneseKun
if there is no kJapaneseOn
).
This is a huge step forward to me - thru Unihan_Readings.txt in unicode.org/Public/UCD/latest/ucd/Unihan.zip I now have something to start automation with, despite being CJK (and more). Great!
– AmigoJack
1 hour ago
Oh I didn't know it's available for download!
– naruto
1 hour ago
I recently released the Unihan database equivalent data in JSON format, either as a single file (unihan-data-json.zip) or as a set of files for each property (unihan-data.zip); they are available for download here. HTH...
– Mikaeru
37 mins ago
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
5
down vote
CJK unified ideographs are sorted based on radicals, not readings. This is because it's impossible to determine the reading of those characters in one way. The Unicode Consortium provides Unihan Database, which can display the representative readings of CJK ideographs written in simple Latin alphabet. For example, here is the result for a very basic ideograph 日 (U+65E5; "day", "date", "sun", etc):
The table says the "first roman letter of 日" is J
in Cantonese, R
in Mandarin, H-or-K-or-N-or-J
in Japanese and I
in Korean. To understand what's going on here, please keep in mind that the 'CJK Unified Ideographs' block has characters used in Chinese, Korean and Japanese jumbled together. Each character is read differently in different languages. Especially in Japanese, one character can have many readings depending on the context. To make matters worse, there are some characters whose readings are totally unknown. If you can accept all those limitations and still want the Latin readings anyway, go ahead and use the database according to your needs. If you're only interested in Japanese kanji, a reasonable method would be to pick the first letter of the kJapaneseOn
field (or the first letter of kJapaneseKun
if there is no kJapaneseOn
).
This is a huge step forward to me - thru Unihan_Readings.txt in unicode.org/Public/UCD/latest/ucd/Unihan.zip I now have something to start automation with, despite being CJK (and more). Great!
– AmigoJack
1 hour ago
Oh I didn't know it's available for download!
– naruto
1 hour ago
I recently released the Unihan database equivalent data in JSON format, either as a single file (unihan-data-json.zip) or as a set of files for each property (unihan-data.zip); they are available for download here. HTH...
– Mikaeru
37 mins ago
add a comment |Â
up vote
5
down vote
CJK unified ideographs are sorted based on radicals, not readings. This is because it's impossible to determine the reading of those characters in one way. The Unicode Consortium provides Unihan Database, which can display the representative readings of CJK ideographs written in simple Latin alphabet. For example, here is the result for a very basic ideograph 日 (U+65E5; "day", "date", "sun", etc):
The table says the "first roman letter of 日" is J
in Cantonese, R
in Mandarin, H-or-K-or-N-or-J
in Japanese and I
in Korean. To understand what's going on here, please keep in mind that the 'CJK Unified Ideographs' block has characters used in Chinese, Korean and Japanese jumbled together. Each character is read differently in different languages. Especially in Japanese, one character can have many readings depending on the context. To make matters worse, there are some characters whose readings are totally unknown. If you can accept all those limitations and still want the Latin readings anyway, go ahead and use the database according to your needs. If you're only interested in Japanese kanji, a reasonable method would be to pick the first letter of the kJapaneseOn
field (or the first letter of kJapaneseKun
if there is no kJapaneseOn
).
This is a huge step forward to me - thru Unihan_Readings.txt in unicode.org/Public/UCD/latest/ucd/Unihan.zip I now have something to start automation with, despite being CJK (and more). Great!
– AmigoJack
1 hour ago
Oh I didn't know it's available for download!
– naruto
1 hour ago
I recently released the Unihan database equivalent data in JSON format, either as a single file (unihan-data-json.zip) or as a set of files for each property (unihan-data.zip); they are available for download here. HTH...
– Mikaeru
37 mins ago
add a comment |Â
up vote
5
down vote
up vote
5
down vote
CJK unified ideographs are sorted based on radicals, not readings. This is because it's impossible to determine the reading of those characters in one way. The Unicode Consortium provides Unihan Database, which can display the representative readings of CJK ideographs written in simple Latin alphabet. For example, here is the result for a very basic ideograph 日 (U+65E5; "day", "date", "sun", etc):
The table says the "first roman letter of 日" is J
in Cantonese, R
in Mandarin, H-or-K-or-N-or-J
in Japanese and I
in Korean. To understand what's going on here, please keep in mind that the 'CJK Unified Ideographs' block has characters used in Chinese, Korean and Japanese jumbled together. Each character is read differently in different languages. Especially in Japanese, one character can have many readings depending on the context. To make matters worse, there are some characters whose readings are totally unknown. If you can accept all those limitations and still want the Latin readings anyway, go ahead and use the database according to your needs. If you're only interested in Japanese kanji, a reasonable method would be to pick the first letter of the kJapaneseOn
field (or the first letter of kJapaneseKun
if there is no kJapaneseOn
).
CJK unified ideographs are sorted based on radicals, not readings. This is because it's impossible to determine the reading of those characters in one way. The Unicode Consortium provides Unihan Database, which can display the representative readings of CJK ideographs written in simple Latin alphabet. For example, here is the result for a very basic ideograph 日 (U+65E5; "day", "date", "sun", etc):
The table says the "first roman letter of 日" is J
in Cantonese, R
in Mandarin, H-or-K-or-N-or-J
in Japanese and I
in Korean. To understand what's going on here, please keep in mind that the 'CJK Unified Ideographs' block has characters used in Chinese, Korean and Japanese jumbled together. Each character is read differently in different languages. Especially in Japanese, one character can have many readings depending on the context. To make matters worse, there are some characters whose readings are totally unknown. If you can accept all those limitations and still want the Latin readings anyway, go ahead and use the database according to your needs. If you're only interested in Japanese kanji, a reasonable method would be to pick the first letter of the kJapaneseOn
field (or the first letter of kJapaneseKun
if there is no kJapaneseOn
).
edited 2 hours ago
answered 2 hours ago


naruto
146k8137269
146k8137269
This is a huge step forward to me - thru Unihan_Readings.txt in unicode.org/Public/UCD/latest/ucd/Unihan.zip I now have something to start automation with, despite being CJK (and more). Great!
– AmigoJack
1 hour ago
Oh I didn't know it's available for download!
– naruto
1 hour ago
I recently released the Unihan database equivalent data in JSON format, either as a single file (unihan-data-json.zip) or as a set of files for each property (unihan-data.zip); they are available for download here. HTH...
– Mikaeru
37 mins ago
add a comment |Â
This is a huge step forward to me - thru Unihan_Readings.txt in unicode.org/Public/UCD/latest/ucd/Unihan.zip I now have something to start automation with, despite being CJK (and more). Great!
– AmigoJack
1 hour ago
Oh I didn't know it's available for download!
– naruto
1 hour ago
I recently released the Unihan database equivalent data in JSON format, either as a single file (unihan-data-json.zip) or as a set of files for each property (unihan-data.zip); they are available for download here. HTH...
– Mikaeru
37 mins ago
This is a huge step forward to me - thru Unihan_Readings.txt in unicode.org/Public/UCD/latest/ucd/Unihan.zip I now have something to start automation with, despite being CJK (and more). Great!
– AmigoJack
1 hour ago
This is a huge step forward to me - thru Unihan_Readings.txt in unicode.org/Public/UCD/latest/ucd/Unihan.zip I now have something to start automation with, despite being CJK (and more). Great!
– AmigoJack
1 hour ago
Oh I didn't know it's available for download!
– naruto
1 hour ago
Oh I didn't know it's available for download!
– naruto
1 hour ago
I recently released the Unihan database equivalent data in JSON format, either as a single file (unihan-data-json.zip) or as a set of files for each property (unihan-data.zip); they are available for download here. HTH...
– Mikaeru
37 mins ago
I recently released the Unihan database equivalent data in JSON format, either as a single file (unihan-data-json.zip) or as a set of files for each property (unihan-data.zip); they are available for download here. HTH...
– Mikaeru
37 mins ago
add a comment |Â
AmigoJack is a new contributor. Be nice, and check out our Code of Conduct.
AmigoJack is a new contributor. Be nice, and check out our Code of Conduct.
AmigoJack is a new contributor. Be nice, and check out our Code of Conduct.
AmigoJack is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fjapanese.stackexchange.com%2fquestions%2f62716%2fkanjis-to-romajis-first-letter%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
2
This question is probably off topic at this forum but for what it's worth, kanji in Japanese have multiple pronunciations and you will not be able to assign definitive pronunciation to most of them. The best you could do is find a downloadable kanji database and choose a pronunciation (it will probably be in hiragana) arbitrarily. Perhaps the database will also include some form of probability that you could use to your advantage.
– G-Cam
1 hour ago
1
I'm curious, why? Granted, you can assign a Latin letter to each kanji, but as others have noted, that is mostly an arbitrary process. I'm left scratching my head as to what use this would be. I suppose as a general coding project, it might be a fun puzzle, but as to final utility, I'm baffled.
– EirÃkr Útlendi
1 hour ago
The goal is to have both "Takkyu" and "å“çƒ" being listed under "T", instead of having all non-latin words/names being listed under "#", appealing to users who rather deal with latin letters. It may not make sense for Kanjis/CJK, but for i.e. Hangul and Cyrillic ("Yulia" and "îÌÂûøÑÂ" both under "Y") - most probably I'll realize it makes too little sense for all CJK idiographs.
– AmigoJack
38 mins ago
@AmigoJack Wait, so you're trying deal with words (å“çƒ) rather than characters (å“)? Then character-based approach described in my answer will make almost no sense in Japanese. One kanji can have many readings, and Unihan_Readings.txt is almost useless to determine the reading of an individual word. For example 生命 is Seimei, 生地 is Kiji, 生åµ is Namatamago and 生霊 is Ikiryo. What you may need is a morphological analyzer introduced here.
– naruto
18 mins ago