Convert Unicode surrogate pair to literal string
Clash Royale CLAN TAG#URR8PPP
up vote
6
down vote
favorite
I am trying to read a high Unicode character from one string into another. For brevity, I will simplify my code as shown below:
public static void UnicodeTest()
var highUnicodeChar = "ð€"; //Not the standard A
var result1 = highUnicodeChar; //this works
var result2 = highUnicodeChar[0].ToString(); // returns ud835
When I assign highUnicodeChar
to result1
directly, it retains its literal value of ð€
. When I try to access it by index, it returns ud835
. As I understand it, this is a surrogate pair of UTF-16 characters used to represent a UTF-32 character. I am pretty sure this problem has to do with trying to implicitly convert a char
to a string
.
In the end, I want result2
to yield the same value as result1
. How can I do this?
c# .net unicode unicode-escapes
New contributor
hargle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |Â
up vote
6
down vote
favorite
I am trying to read a high Unicode character from one string into another. For brevity, I will simplify my code as shown below:
public static void UnicodeTest()
var highUnicodeChar = "ð€"; //Not the standard A
var result1 = highUnicodeChar; //this works
var result2 = highUnicodeChar[0].ToString(); // returns ud835
When I assign highUnicodeChar
to result1
directly, it retains its literal value of ð€
. When I try to access it by index, it returns ud835
. As I understand it, this is a surrogate pair of UTF-16 characters used to represent a UTF-32 character. I am pretty sure this problem has to do with trying to implicitly convert a char
to a string
.
In the end, I want result2
to yield the same value as result1
. How can I do this?
c# .net unicode unicode-escapes
New contributor
hargle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |Â
up vote
6
down vote
favorite
up vote
6
down vote
favorite
I am trying to read a high Unicode character from one string into another. For brevity, I will simplify my code as shown below:
public static void UnicodeTest()
var highUnicodeChar = "ð€"; //Not the standard A
var result1 = highUnicodeChar; //this works
var result2 = highUnicodeChar[0].ToString(); // returns ud835
When I assign highUnicodeChar
to result1
directly, it retains its literal value of ð€
. When I try to access it by index, it returns ud835
. As I understand it, this is a surrogate pair of UTF-16 characters used to represent a UTF-32 character. I am pretty sure this problem has to do with trying to implicitly convert a char
to a string
.
In the end, I want result2
to yield the same value as result1
. How can I do this?
c# .net unicode unicode-escapes
New contributor
hargle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
I am trying to read a high Unicode character from one string into another. For brevity, I will simplify my code as shown below:
public static void UnicodeTest()
var highUnicodeChar = "ð€"; //Not the standard A
var result1 = highUnicodeChar; //this works
var result2 = highUnicodeChar[0].ToString(); // returns ud835
When I assign highUnicodeChar
to result1
directly, it retains its literal value of ð€
. When I try to access it by index, it returns ud835
. As I understand it, this is a surrogate pair of UTF-16 characters used to represent a UTF-32 character. I am pretty sure this problem has to do with trying to implicitly convert a char
to a string
.
In the end, I want result2
to yield the same value as result1
. How can I do this?
c# .net unicode unicode-escapes
c# .net unicode unicode-escapes
New contributor
hargle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
hargle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
hargle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
asked 6 hours ago
hargle
333
333
New contributor
hargle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
hargle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
hargle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |Â
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
6
down vote
accepted
In Unicode, you have code points. These are 21 bits long.
In Unicode encodings, you have code units. These are the natural unit of the encoding: 8-bit for UTF-8, 16-bit for UTF-16, and so on. One or more code unit encodes a single code point.
In UTF-16, two code units that form a single code point are called a surrogate pair.
This gets a little tricky in .NET, as a .NET Char
represents a single UTF-16 code unit, and a .NET String
is a collection of code units.
So your code point ð€ (U+1D400) is a surrogate pair, meaning your string has two code units in it:
var highUnicodeChar = "ð€";
char a = highUnicodeChar[0]; // code unit 0xD835
char b = highUnicodeChar[1]; // code unit 0xDC00
Meaning when you index into the string like that, you're actually only getting half of the surrogate pair.
You can use IsSurrogatePair to test for a surrogate pair. For instance:
string GetFullCodePointAtIndex(string s, int idx) =>
s.Substring(idx, char.IsSurrogatePair(s, idx) ? 2 : 1);
Important to note that the rabbit hole of variable encoding in Unicode doesn't end at the code point. A grapheme cluster is the thing most people would ultimately call a "character". A grapheme cluster is made from one or more code points: a base character, and zero or more combining characters. An example of a combining character is an umlaut.
To test for a combining character, you can use GetUnicodeCategory to check for an enclosing mark, non-spacing mark, or spacing mark.
Perfect! This solution is exactly what I was looking for, and great explanation as well.
– hargle
4 hours ago
add a comment |Â
up vote
2
down vote
It appears that you want to extract the first "atomic" character from the user point of view (i.e. the first Unicode grapheme cluster) from the highUnicodeChar
string, where an "atomic" character includes both halves of a surrogate pair.
You can use StringInfo.GetTextElementEnumerator()
to do just this, breaking a string
down into atomic chunks then taking the first.
First, define the following extension method:
public static class TextExtensions
public static IEnumerable<string> TextElements(this string s)
// StringInfo.GetTextElementEnumerator is a .Net 1.1 class that doesn't implement IEnumerable<string>, so convert
if (s == null)
yield break;
var enumerator = StringInfo.GetTextElementEnumerator(s);
while (enumerator.MoveNext())
yield return enumerator.GetTextElement();
Now, you can do:
var result2 = highUnicodeChar.TextElements().FirstOrDefault() ?? "";
Note that StringInfo.GetTextElementEnumerator()
will also group Unicode combining characters, so that the first grapheme cluster of the string Ĥ=T̂+V̂
will be Ĥ
not H
.
Sample fiddle here.
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
6
down vote
accepted
In Unicode, you have code points. These are 21 bits long.
In Unicode encodings, you have code units. These are the natural unit of the encoding: 8-bit for UTF-8, 16-bit for UTF-16, and so on. One or more code unit encodes a single code point.
In UTF-16, two code units that form a single code point are called a surrogate pair.
This gets a little tricky in .NET, as a .NET Char
represents a single UTF-16 code unit, and a .NET String
is a collection of code units.
So your code point ð€ (U+1D400) is a surrogate pair, meaning your string has two code units in it:
var highUnicodeChar = "ð€";
char a = highUnicodeChar[0]; // code unit 0xD835
char b = highUnicodeChar[1]; // code unit 0xDC00
Meaning when you index into the string like that, you're actually only getting half of the surrogate pair.
You can use IsSurrogatePair to test for a surrogate pair. For instance:
string GetFullCodePointAtIndex(string s, int idx) =>
s.Substring(idx, char.IsSurrogatePair(s, idx) ? 2 : 1);
Important to note that the rabbit hole of variable encoding in Unicode doesn't end at the code point. A grapheme cluster is the thing most people would ultimately call a "character". A grapheme cluster is made from one or more code points: a base character, and zero or more combining characters. An example of a combining character is an umlaut.
To test for a combining character, you can use GetUnicodeCategory to check for an enclosing mark, non-spacing mark, or spacing mark.
Perfect! This solution is exactly what I was looking for, and great explanation as well.
– hargle
4 hours ago
add a comment |Â
up vote
6
down vote
accepted
In Unicode, you have code points. These are 21 bits long.
In Unicode encodings, you have code units. These are the natural unit of the encoding: 8-bit for UTF-8, 16-bit for UTF-16, and so on. One or more code unit encodes a single code point.
In UTF-16, two code units that form a single code point are called a surrogate pair.
This gets a little tricky in .NET, as a .NET Char
represents a single UTF-16 code unit, and a .NET String
is a collection of code units.
So your code point ð€ (U+1D400) is a surrogate pair, meaning your string has two code units in it:
var highUnicodeChar = "ð€";
char a = highUnicodeChar[0]; // code unit 0xD835
char b = highUnicodeChar[1]; // code unit 0xDC00
Meaning when you index into the string like that, you're actually only getting half of the surrogate pair.
You can use IsSurrogatePair to test for a surrogate pair. For instance:
string GetFullCodePointAtIndex(string s, int idx) =>
s.Substring(idx, char.IsSurrogatePair(s, idx) ? 2 : 1);
Important to note that the rabbit hole of variable encoding in Unicode doesn't end at the code point. A grapheme cluster is the thing most people would ultimately call a "character". A grapheme cluster is made from one or more code points: a base character, and zero or more combining characters. An example of a combining character is an umlaut.
To test for a combining character, you can use GetUnicodeCategory to check for an enclosing mark, non-spacing mark, or spacing mark.
Perfect! This solution is exactly what I was looking for, and great explanation as well.
– hargle
4 hours ago
add a comment |Â
up vote
6
down vote
accepted
up vote
6
down vote
accepted
In Unicode, you have code points. These are 21 bits long.
In Unicode encodings, you have code units. These are the natural unit of the encoding: 8-bit for UTF-8, 16-bit for UTF-16, and so on. One or more code unit encodes a single code point.
In UTF-16, two code units that form a single code point are called a surrogate pair.
This gets a little tricky in .NET, as a .NET Char
represents a single UTF-16 code unit, and a .NET String
is a collection of code units.
So your code point ð€ (U+1D400) is a surrogate pair, meaning your string has two code units in it:
var highUnicodeChar = "ð€";
char a = highUnicodeChar[0]; // code unit 0xD835
char b = highUnicodeChar[1]; // code unit 0xDC00
Meaning when you index into the string like that, you're actually only getting half of the surrogate pair.
You can use IsSurrogatePair to test for a surrogate pair. For instance:
string GetFullCodePointAtIndex(string s, int idx) =>
s.Substring(idx, char.IsSurrogatePair(s, idx) ? 2 : 1);
Important to note that the rabbit hole of variable encoding in Unicode doesn't end at the code point. A grapheme cluster is the thing most people would ultimately call a "character". A grapheme cluster is made from one or more code points: a base character, and zero or more combining characters. An example of a combining character is an umlaut.
To test for a combining character, you can use GetUnicodeCategory to check for an enclosing mark, non-spacing mark, or spacing mark.
In Unicode, you have code points. These are 21 bits long.
In Unicode encodings, you have code units. These are the natural unit of the encoding: 8-bit for UTF-8, 16-bit for UTF-16, and so on. One or more code unit encodes a single code point.
In UTF-16, two code units that form a single code point are called a surrogate pair.
This gets a little tricky in .NET, as a .NET Char
represents a single UTF-16 code unit, and a .NET String
is a collection of code units.
So your code point ð€ (U+1D400) is a surrogate pair, meaning your string has two code units in it:
var highUnicodeChar = "ð€";
char a = highUnicodeChar[0]; // code unit 0xD835
char b = highUnicodeChar[1]; // code unit 0xDC00
Meaning when you index into the string like that, you're actually only getting half of the surrogate pair.
You can use IsSurrogatePair to test for a surrogate pair. For instance:
string GetFullCodePointAtIndex(string s, int idx) =>
s.Substring(idx, char.IsSurrogatePair(s, idx) ? 2 : 1);
Important to note that the rabbit hole of variable encoding in Unicode doesn't end at the code point. A grapheme cluster is the thing most people would ultimately call a "character". A grapheme cluster is made from one or more code points: a base character, and zero or more combining characters. An example of a combining character is an umlaut.
To test for a combining character, you can use GetUnicodeCategory to check for an enclosing mark, non-spacing mark, or spacing mark.
edited 5 hours ago
answered 6 hours ago
Cory Nelson
21.8k24480
21.8k24480
Perfect! This solution is exactly what I was looking for, and great explanation as well.
– hargle
4 hours ago
add a comment |Â
Perfect! This solution is exactly what I was looking for, and great explanation as well.
– hargle
4 hours ago
Perfect! This solution is exactly what I was looking for, and great explanation as well.
– hargle
4 hours ago
Perfect! This solution is exactly what I was looking for, and great explanation as well.
– hargle
4 hours ago
add a comment |Â
up vote
2
down vote
It appears that you want to extract the first "atomic" character from the user point of view (i.e. the first Unicode grapheme cluster) from the highUnicodeChar
string, where an "atomic" character includes both halves of a surrogate pair.
You can use StringInfo.GetTextElementEnumerator()
to do just this, breaking a string
down into atomic chunks then taking the first.
First, define the following extension method:
public static class TextExtensions
public static IEnumerable<string> TextElements(this string s)
// StringInfo.GetTextElementEnumerator is a .Net 1.1 class that doesn't implement IEnumerable<string>, so convert
if (s == null)
yield break;
var enumerator = StringInfo.GetTextElementEnumerator(s);
while (enumerator.MoveNext())
yield return enumerator.GetTextElement();
Now, you can do:
var result2 = highUnicodeChar.TextElements().FirstOrDefault() ?? "";
Note that StringInfo.GetTextElementEnumerator()
will also group Unicode combining characters, so that the first grapheme cluster of the string Ĥ=T̂+V̂
will be Ĥ
not H
.
Sample fiddle here.
add a comment |Â
up vote
2
down vote
It appears that you want to extract the first "atomic" character from the user point of view (i.e. the first Unicode grapheme cluster) from the highUnicodeChar
string, where an "atomic" character includes both halves of a surrogate pair.
You can use StringInfo.GetTextElementEnumerator()
to do just this, breaking a string
down into atomic chunks then taking the first.
First, define the following extension method:
public static class TextExtensions
public static IEnumerable<string> TextElements(this string s)
// StringInfo.GetTextElementEnumerator is a .Net 1.1 class that doesn't implement IEnumerable<string>, so convert
if (s == null)
yield break;
var enumerator = StringInfo.GetTextElementEnumerator(s);
while (enumerator.MoveNext())
yield return enumerator.GetTextElement();
Now, you can do:
var result2 = highUnicodeChar.TextElements().FirstOrDefault() ?? "";
Note that StringInfo.GetTextElementEnumerator()
will also group Unicode combining characters, so that the first grapheme cluster of the string Ĥ=T̂+V̂
will be Ĥ
not H
.
Sample fiddle here.
add a comment |Â
up vote
2
down vote
up vote
2
down vote
It appears that you want to extract the first "atomic" character from the user point of view (i.e. the first Unicode grapheme cluster) from the highUnicodeChar
string, where an "atomic" character includes both halves of a surrogate pair.
You can use StringInfo.GetTextElementEnumerator()
to do just this, breaking a string
down into atomic chunks then taking the first.
First, define the following extension method:
public static class TextExtensions
public static IEnumerable<string> TextElements(this string s)
// StringInfo.GetTextElementEnumerator is a .Net 1.1 class that doesn't implement IEnumerable<string>, so convert
if (s == null)
yield break;
var enumerator = StringInfo.GetTextElementEnumerator(s);
while (enumerator.MoveNext())
yield return enumerator.GetTextElement();
Now, you can do:
var result2 = highUnicodeChar.TextElements().FirstOrDefault() ?? "";
Note that StringInfo.GetTextElementEnumerator()
will also group Unicode combining characters, so that the first grapheme cluster of the string Ĥ=T̂+V̂
will be Ĥ
not H
.
Sample fiddle here.
It appears that you want to extract the first "atomic" character from the user point of view (i.e. the first Unicode grapheme cluster) from the highUnicodeChar
string, where an "atomic" character includes both halves of a surrogate pair.
You can use StringInfo.GetTextElementEnumerator()
to do just this, breaking a string
down into atomic chunks then taking the first.
First, define the following extension method:
public static class TextExtensions
public static IEnumerable<string> TextElements(this string s)
// StringInfo.GetTextElementEnumerator is a .Net 1.1 class that doesn't implement IEnumerable<string>, so convert
if (s == null)
yield break;
var enumerator = StringInfo.GetTextElementEnumerator(s);
while (enumerator.MoveNext())
yield return enumerator.GetTextElement();
Now, you can do:
var result2 = highUnicodeChar.TextElements().FirstOrDefault() ?? "";
Note that StringInfo.GetTextElementEnumerator()
will also group Unicode combining characters, so that the first grapheme cluster of the string Ĥ=T̂+V̂
will be Ĥ
not H
.
Sample fiddle here.
edited 23 mins ago
answered 6 hours ago
dbc
50.5k763107
50.5k763107
add a comment |Â
add a comment |Â
hargle is a new contributor. Be nice, and check out our Code of Conduct.
hargle is a new contributor. Be nice, and check out our Code of Conduct.
hargle is a new contributor. Be nice, and check out our Code of Conduct.
hargle is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f52584308%2fconvert-unicode-surrogate-pair-to-literal-string%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password