Convert Unicode surrogate pair to literal string

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
6
down vote

favorite












I am trying to read a high Unicode character from one string into another. For brevity, I will simplify my code as shown below:



public static void UnicodeTest()

var highUnicodeChar = "𝐀"; //Not the standard A

var result1 = highUnicodeChar; //this works
var result2 = highUnicodeChar[0].ToString(); // returns ud835



When I assign highUnicodeChar to result1 directly, it retains its literal value of 𝐀. When I try to access it by index, it returns ud835. As I understand it, this is a surrogate pair of UTF-16 characters used to represent a UTF-32 character. I am pretty sure this problem has to do with trying to implicitly convert a char to a string.



In the end, I want result2 to yield the same value as result1. How can I do this?










share|improve this question







New contributor




hargle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.























    up vote
    6
    down vote

    favorite












    I am trying to read a high Unicode character from one string into another. For brevity, I will simplify my code as shown below:



    public static void UnicodeTest()

    var highUnicodeChar = "𝐀"; //Not the standard A

    var result1 = highUnicodeChar; //this works
    var result2 = highUnicodeChar[0].ToString(); // returns ud835



    When I assign highUnicodeChar to result1 directly, it retains its literal value of 𝐀. When I try to access it by index, it returns ud835. As I understand it, this is a surrogate pair of UTF-16 characters used to represent a UTF-32 character. I am pretty sure this problem has to do with trying to implicitly convert a char to a string.



    In the end, I want result2 to yield the same value as result1. How can I do this?










    share|improve this question







    New contributor




    hargle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.





















      up vote
      6
      down vote

      favorite









      up vote
      6
      down vote

      favorite











      I am trying to read a high Unicode character from one string into another. For brevity, I will simplify my code as shown below:



      public static void UnicodeTest()

      var highUnicodeChar = "𝐀"; //Not the standard A

      var result1 = highUnicodeChar; //this works
      var result2 = highUnicodeChar[0].ToString(); // returns ud835



      When I assign highUnicodeChar to result1 directly, it retains its literal value of 𝐀. When I try to access it by index, it returns ud835. As I understand it, this is a surrogate pair of UTF-16 characters used to represent a UTF-32 character. I am pretty sure this problem has to do with trying to implicitly convert a char to a string.



      In the end, I want result2 to yield the same value as result1. How can I do this?










      share|improve this question







      New contributor




      hargle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      I am trying to read a high Unicode character from one string into another. For brevity, I will simplify my code as shown below:



      public static void UnicodeTest()

      var highUnicodeChar = "𝐀"; //Not the standard A

      var result1 = highUnicodeChar; //this works
      var result2 = highUnicodeChar[0].ToString(); // returns ud835



      When I assign highUnicodeChar to result1 directly, it retains its literal value of 𝐀. When I try to access it by index, it returns ud835. As I understand it, this is a surrogate pair of UTF-16 characters used to represent a UTF-32 character. I am pretty sure this problem has to do with trying to implicitly convert a char to a string.



      In the end, I want result2 to yield the same value as result1. How can I do this?







      c# .net unicode unicode-escapes






      share|improve this question







      New contributor




      hargle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|improve this question







      New contributor




      hargle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this question




      share|improve this question






      New contributor




      hargle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked 6 hours ago









      hargle

      333




      333




      New contributor




      hargle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      hargle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      hargle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






















          2 Answers
          2






          active

          oldest

          votes

















          up vote
          6
          down vote



          accepted










          In Unicode, you have code points. These are 21 bits long.



          In Unicode encodings, you have code units. These are the natural unit of the encoding: 8-bit for UTF-8, 16-bit for UTF-16, and so on. One or more code unit encodes a single code point.



          In UTF-16, two code units that form a single code point are called a surrogate pair.



          This gets a little tricky in .NET, as a .NET Char represents a single UTF-16 code unit, and a .NET String is a collection of code units.



          So your code point 𝐀 (U+1D400) is a surrogate pair, meaning your string has two code units in it:



          var highUnicodeChar = "𝐀";
          char a = highUnicodeChar[0]; // code unit 0xD835
          char b = highUnicodeChar[1]; // code unit 0xDC00


          Meaning when you index into the string like that, you're actually only getting half of the surrogate pair.



          You can use IsSurrogatePair to test for a surrogate pair. For instance:



          string GetFullCodePointAtIndex(string s, int idx) =>
          s.Substring(idx, char.IsSurrogatePair(s, idx) ? 2 : 1);


          Important to note that the rabbit hole of variable encoding in Unicode doesn't end at the code point. A grapheme cluster is the thing most people would ultimately call a "character". A grapheme cluster is made from one or more code points: a base character, and zero or more combining characters. An example of a combining character is an umlaut.



          To test for a combining character, you can use GetUnicodeCategory to check for an enclosing mark, non-spacing mark, or spacing mark.






          share|improve this answer






















          • Perfect! This solution is exactly what I was looking for, and great explanation as well.
            – hargle
            4 hours ago

















          up vote
          2
          down vote













          It appears that you want to extract the first "atomic" character from the user point of view (i.e. the first Unicode grapheme cluster) from the highUnicodeChar string, where an "atomic" character includes both halves of a surrogate pair.



          You can use StringInfo.GetTextElementEnumerator() to do just this, breaking a string down into atomic chunks then taking the first.



          First, define the following extension method:



          public static class TextExtensions

          public static IEnumerable<string> TextElements(this string s)

          // StringInfo.GetTextElementEnumerator is a .Net 1.1 class that doesn't implement IEnumerable<string>, so convert
          if (s == null)
          yield break;
          var enumerator = StringInfo.GetTextElementEnumerator(s);
          while (enumerator.MoveNext())
          yield return enumerator.GetTextElement();




          Now, you can do:



          var result2 = highUnicodeChar.TextElements().FirstOrDefault() ?? "";


          Note that StringInfo.GetTextElementEnumerator() will also group Unicode combining characters, so that the first grapheme cluster of the string Ĥ=T̂+V̂ will be Ĥ not H.



          Sample fiddle here.






          share|improve this answer






















            Your Answer





            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "1"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: false,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );






            hargle is a new contributor. Be nice, and check out our Code of Conduct.









             

            draft saved


            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f52584308%2fconvert-unicode-surrogate-pair-to-literal-string%23new-answer', 'question_page');

            );

            Post as a guest






























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            6
            down vote



            accepted










            In Unicode, you have code points. These are 21 bits long.



            In Unicode encodings, you have code units. These are the natural unit of the encoding: 8-bit for UTF-8, 16-bit for UTF-16, and so on. One or more code unit encodes a single code point.



            In UTF-16, two code units that form a single code point are called a surrogate pair.



            This gets a little tricky in .NET, as a .NET Char represents a single UTF-16 code unit, and a .NET String is a collection of code units.



            So your code point 𝐀 (U+1D400) is a surrogate pair, meaning your string has two code units in it:



            var highUnicodeChar = "𝐀";
            char a = highUnicodeChar[0]; // code unit 0xD835
            char b = highUnicodeChar[1]; // code unit 0xDC00


            Meaning when you index into the string like that, you're actually only getting half of the surrogate pair.



            You can use IsSurrogatePair to test for a surrogate pair. For instance:



            string GetFullCodePointAtIndex(string s, int idx) =>
            s.Substring(idx, char.IsSurrogatePair(s, idx) ? 2 : 1);


            Important to note that the rabbit hole of variable encoding in Unicode doesn't end at the code point. A grapheme cluster is the thing most people would ultimately call a "character". A grapheme cluster is made from one or more code points: a base character, and zero or more combining characters. An example of a combining character is an umlaut.



            To test for a combining character, you can use GetUnicodeCategory to check for an enclosing mark, non-spacing mark, or spacing mark.






            share|improve this answer






















            • Perfect! This solution is exactly what I was looking for, and great explanation as well.
              – hargle
              4 hours ago














            up vote
            6
            down vote



            accepted










            In Unicode, you have code points. These are 21 bits long.



            In Unicode encodings, you have code units. These are the natural unit of the encoding: 8-bit for UTF-8, 16-bit for UTF-16, and so on. One or more code unit encodes a single code point.



            In UTF-16, two code units that form a single code point are called a surrogate pair.



            This gets a little tricky in .NET, as a .NET Char represents a single UTF-16 code unit, and a .NET String is a collection of code units.



            So your code point 𝐀 (U+1D400) is a surrogate pair, meaning your string has two code units in it:



            var highUnicodeChar = "𝐀";
            char a = highUnicodeChar[0]; // code unit 0xD835
            char b = highUnicodeChar[1]; // code unit 0xDC00


            Meaning when you index into the string like that, you're actually only getting half of the surrogate pair.



            You can use IsSurrogatePair to test for a surrogate pair. For instance:



            string GetFullCodePointAtIndex(string s, int idx) =>
            s.Substring(idx, char.IsSurrogatePair(s, idx) ? 2 : 1);


            Important to note that the rabbit hole of variable encoding in Unicode doesn't end at the code point. A grapheme cluster is the thing most people would ultimately call a "character". A grapheme cluster is made from one or more code points: a base character, and zero or more combining characters. An example of a combining character is an umlaut.



            To test for a combining character, you can use GetUnicodeCategory to check for an enclosing mark, non-spacing mark, or spacing mark.






            share|improve this answer






















            • Perfect! This solution is exactly what I was looking for, and great explanation as well.
              – hargle
              4 hours ago












            up vote
            6
            down vote



            accepted







            up vote
            6
            down vote



            accepted






            In Unicode, you have code points. These are 21 bits long.



            In Unicode encodings, you have code units. These are the natural unit of the encoding: 8-bit for UTF-8, 16-bit for UTF-16, and so on. One or more code unit encodes a single code point.



            In UTF-16, two code units that form a single code point are called a surrogate pair.



            This gets a little tricky in .NET, as a .NET Char represents a single UTF-16 code unit, and a .NET String is a collection of code units.



            So your code point 𝐀 (U+1D400) is a surrogate pair, meaning your string has two code units in it:



            var highUnicodeChar = "𝐀";
            char a = highUnicodeChar[0]; // code unit 0xD835
            char b = highUnicodeChar[1]; // code unit 0xDC00


            Meaning when you index into the string like that, you're actually only getting half of the surrogate pair.



            You can use IsSurrogatePair to test for a surrogate pair. For instance:



            string GetFullCodePointAtIndex(string s, int idx) =>
            s.Substring(idx, char.IsSurrogatePair(s, idx) ? 2 : 1);


            Important to note that the rabbit hole of variable encoding in Unicode doesn't end at the code point. A grapheme cluster is the thing most people would ultimately call a "character". A grapheme cluster is made from one or more code points: a base character, and zero or more combining characters. An example of a combining character is an umlaut.



            To test for a combining character, you can use GetUnicodeCategory to check for an enclosing mark, non-spacing mark, or spacing mark.






            share|improve this answer














            In Unicode, you have code points. These are 21 bits long.



            In Unicode encodings, you have code units. These are the natural unit of the encoding: 8-bit for UTF-8, 16-bit for UTF-16, and so on. One or more code unit encodes a single code point.



            In UTF-16, two code units that form a single code point are called a surrogate pair.



            This gets a little tricky in .NET, as a .NET Char represents a single UTF-16 code unit, and a .NET String is a collection of code units.



            So your code point 𝐀 (U+1D400) is a surrogate pair, meaning your string has two code units in it:



            var highUnicodeChar = "𝐀";
            char a = highUnicodeChar[0]; // code unit 0xD835
            char b = highUnicodeChar[1]; // code unit 0xDC00


            Meaning when you index into the string like that, you're actually only getting half of the surrogate pair.



            You can use IsSurrogatePair to test for a surrogate pair. For instance:



            string GetFullCodePointAtIndex(string s, int idx) =>
            s.Substring(idx, char.IsSurrogatePair(s, idx) ? 2 : 1);


            Important to note that the rabbit hole of variable encoding in Unicode doesn't end at the code point. A grapheme cluster is the thing most people would ultimately call a "character". A grapheme cluster is made from one or more code points: a base character, and zero or more combining characters. An example of a combining character is an umlaut.



            To test for a combining character, you can use GetUnicodeCategory to check for an enclosing mark, non-spacing mark, or spacing mark.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited 5 hours ago

























            answered 6 hours ago









            Cory Nelson

            21.8k24480




            21.8k24480











            • Perfect! This solution is exactly what I was looking for, and great explanation as well.
              – hargle
              4 hours ago
















            • Perfect! This solution is exactly what I was looking for, and great explanation as well.
              – hargle
              4 hours ago















            Perfect! This solution is exactly what I was looking for, and great explanation as well.
            – hargle
            4 hours ago




            Perfect! This solution is exactly what I was looking for, and great explanation as well.
            – hargle
            4 hours ago












            up vote
            2
            down vote













            It appears that you want to extract the first "atomic" character from the user point of view (i.e. the first Unicode grapheme cluster) from the highUnicodeChar string, where an "atomic" character includes both halves of a surrogate pair.



            You can use StringInfo.GetTextElementEnumerator() to do just this, breaking a string down into atomic chunks then taking the first.



            First, define the following extension method:



            public static class TextExtensions

            public static IEnumerable<string> TextElements(this string s)

            // StringInfo.GetTextElementEnumerator is a .Net 1.1 class that doesn't implement IEnumerable<string>, so convert
            if (s == null)
            yield break;
            var enumerator = StringInfo.GetTextElementEnumerator(s);
            while (enumerator.MoveNext())
            yield return enumerator.GetTextElement();




            Now, you can do:



            var result2 = highUnicodeChar.TextElements().FirstOrDefault() ?? "";


            Note that StringInfo.GetTextElementEnumerator() will also group Unicode combining characters, so that the first grapheme cluster of the string Ĥ=T̂+V̂ will be Ĥ not H.



            Sample fiddle here.






            share|improve this answer


























              up vote
              2
              down vote













              It appears that you want to extract the first "atomic" character from the user point of view (i.e. the first Unicode grapheme cluster) from the highUnicodeChar string, where an "atomic" character includes both halves of a surrogate pair.



              You can use StringInfo.GetTextElementEnumerator() to do just this, breaking a string down into atomic chunks then taking the first.



              First, define the following extension method:



              public static class TextExtensions

              public static IEnumerable<string> TextElements(this string s)

              // StringInfo.GetTextElementEnumerator is a .Net 1.1 class that doesn't implement IEnumerable<string>, so convert
              if (s == null)
              yield break;
              var enumerator = StringInfo.GetTextElementEnumerator(s);
              while (enumerator.MoveNext())
              yield return enumerator.GetTextElement();




              Now, you can do:



              var result2 = highUnicodeChar.TextElements().FirstOrDefault() ?? "";


              Note that StringInfo.GetTextElementEnumerator() will also group Unicode combining characters, so that the first grapheme cluster of the string Ĥ=T̂+V̂ will be Ĥ not H.



              Sample fiddle here.






              share|improve this answer
























                up vote
                2
                down vote










                up vote
                2
                down vote









                It appears that you want to extract the first "atomic" character from the user point of view (i.e. the first Unicode grapheme cluster) from the highUnicodeChar string, where an "atomic" character includes both halves of a surrogate pair.



                You can use StringInfo.GetTextElementEnumerator() to do just this, breaking a string down into atomic chunks then taking the first.



                First, define the following extension method:



                public static class TextExtensions

                public static IEnumerable<string> TextElements(this string s)

                // StringInfo.GetTextElementEnumerator is a .Net 1.1 class that doesn't implement IEnumerable<string>, so convert
                if (s == null)
                yield break;
                var enumerator = StringInfo.GetTextElementEnumerator(s);
                while (enumerator.MoveNext())
                yield return enumerator.GetTextElement();




                Now, you can do:



                var result2 = highUnicodeChar.TextElements().FirstOrDefault() ?? "";


                Note that StringInfo.GetTextElementEnumerator() will also group Unicode combining characters, so that the first grapheme cluster of the string Ĥ=T̂+V̂ will be Ĥ not H.



                Sample fiddle here.






                share|improve this answer














                It appears that you want to extract the first "atomic" character from the user point of view (i.e. the first Unicode grapheme cluster) from the highUnicodeChar string, where an "atomic" character includes both halves of a surrogate pair.



                You can use StringInfo.GetTextElementEnumerator() to do just this, breaking a string down into atomic chunks then taking the first.



                First, define the following extension method:



                public static class TextExtensions

                public static IEnumerable<string> TextElements(this string s)

                // StringInfo.GetTextElementEnumerator is a .Net 1.1 class that doesn't implement IEnumerable<string>, so convert
                if (s == null)
                yield break;
                var enumerator = StringInfo.GetTextElementEnumerator(s);
                while (enumerator.MoveNext())
                yield return enumerator.GetTextElement();




                Now, you can do:



                var result2 = highUnicodeChar.TextElements().FirstOrDefault() ?? "";


                Note that StringInfo.GetTextElementEnumerator() will also group Unicode combining characters, so that the first grapheme cluster of the string Ĥ=T̂+V̂ will be Ĥ not H.



                Sample fiddle here.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited 23 mins ago

























                answered 6 hours ago









                dbc

                50.5k763107




                50.5k763107




















                    hargle is a new contributor. Be nice, and check out our Code of Conduct.









                     

                    draft saved


                    draft discarded


















                    hargle is a new contributor. Be nice, and check out our Code of Conduct.












                    hargle is a new contributor. Be nice, and check out our Code of Conduct.











                    hargle is a new contributor. Be nice, and check out our Code of Conduct.













                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f52584308%2fconvert-unicode-surrogate-pair-to-literal-string%23new-answer', 'question_page');

                    );

                    Post as a guest













































































                    Comments

                    Popular posts from this blog

                    What does second last employer means? [closed]

                    List of Gilmore Girls characters

                    Confectionery