Using uniq on unicode text

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
1
down vote

favorite












I want to remove duplicate lines from a file with words of Syriac script. The source file has 3 lines, 1st and 3rd are identical.



$ cat file.txt 
ܐܒܘܢ
ܢܗܘܐ
ܐܒܘܢ


When I use sort and uniq, the result presumes that all the 3 lines are identical, which is wrong:



$ cat file.txt | sort | uniq -c
3 ܐܒܘܢ


Explicitly setting locale to Syriac doesn't help either.



$ LC_COLLATE=syr_SY.utf8 cat file.txt | sort | uniq -c 
3 ܐܒܘܢ


Why would that happen?
I'm using Kubuntu 18 and bash, if that matters.










share|improve this question



















  • 2




    Possible duplicate of Why is uniq ignoring Unicode and lines with a single letter?
    – Michael Homer
    5 hours ago






  • 1




    Note that both the sort and the uniq need to have the right collation to work here, so you'd want LC_COLLATE=syr_SY.utf8 sort file.txt | LC_COLLATE=syr_SY.utf8 uniq -c (or perhaps better yet in the regular environment).
    – Michael Homer
    5 hours ago














up vote
1
down vote

favorite












I want to remove duplicate lines from a file with words of Syriac script. The source file has 3 lines, 1st and 3rd are identical.



$ cat file.txt 
ܐܒܘܢ
ܢܗܘܐ
ܐܒܘܢ


When I use sort and uniq, the result presumes that all the 3 lines are identical, which is wrong:



$ cat file.txt | sort | uniq -c
3 ܐܒܘܢ


Explicitly setting locale to Syriac doesn't help either.



$ LC_COLLATE=syr_SY.utf8 cat file.txt | sort | uniq -c 
3 ܐܒܘܢ


Why would that happen?
I'm using Kubuntu 18 and bash, if that matters.










share|improve this question



















  • 2




    Possible duplicate of Why is uniq ignoring Unicode and lines with a single letter?
    – Michael Homer
    5 hours ago






  • 1




    Note that both the sort and the uniq need to have the right collation to work here, so you'd want LC_COLLATE=syr_SY.utf8 sort file.txt | LC_COLLATE=syr_SY.utf8 uniq -c (or perhaps better yet in the regular environment).
    – Michael Homer
    5 hours ago












up vote
1
down vote

favorite









up vote
1
down vote

favorite











I want to remove duplicate lines from a file with words of Syriac script. The source file has 3 lines, 1st and 3rd are identical.



$ cat file.txt 
ܐܒܘܢ
ܢܗܘܐ
ܐܒܘܢ


When I use sort and uniq, the result presumes that all the 3 lines are identical, which is wrong:



$ cat file.txt | sort | uniq -c
3 ܐܒܘܢ


Explicitly setting locale to Syriac doesn't help either.



$ LC_COLLATE=syr_SY.utf8 cat file.txt | sort | uniq -c 
3 ܐܒܘܢ


Why would that happen?
I'm using Kubuntu 18 and bash, if that matters.










share|improve this question















I want to remove duplicate lines from a file with words of Syriac script. The source file has 3 lines, 1st and 3rd are identical.



$ cat file.txt 
ܐܒܘܢ
ܢܗܘܐ
ܐܒܘܢ


When I use sort and uniq, the result presumes that all the 3 lines are identical, which is wrong:



$ cat file.txt | sort | uniq -c
3 ܐܒܘܢ


Explicitly setting locale to Syriac doesn't help either.



$ LC_COLLATE=syr_SY.utf8 cat file.txt | sort | uniq -c 
3 ܐܒܘܢ


Why would that happen?
I'm using Kubuntu 18 and bash, if that matters.







sort unicode uniq






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 5 hours ago









神秘德里克

1559




1559










asked 6 hours ago









evb

1085




1085







  • 2




    Possible duplicate of Why is uniq ignoring Unicode and lines with a single letter?
    – Michael Homer
    5 hours ago






  • 1




    Note that both the sort and the uniq need to have the right collation to work here, so you'd want LC_COLLATE=syr_SY.utf8 sort file.txt | LC_COLLATE=syr_SY.utf8 uniq -c (or perhaps better yet in the regular environment).
    – Michael Homer
    5 hours ago












  • 2




    Possible duplicate of Why is uniq ignoring Unicode and lines with a single letter?
    – Michael Homer
    5 hours ago






  • 1




    Note that both the sort and the uniq need to have the right collation to work here, so you'd want LC_COLLATE=syr_SY.utf8 sort file.txt | LC_COLLATE=syr_SY.utf8 uniq -c (or perhaps better yet in the regular environment).
    – Michael Homer
    5 hours ago







2




2




Possible duplicate of Why is uniq ignoring Unicode and lines with a single letter?
– Michael Homer
5 hours ago




Possible duplicate of Why is uniq ignoring Unicode and lines with a single letter?
– Michael Homer
5 hours ago




1




1




Note that both the sort and the uniq need to have the right collation to work here, so you'd want LC_COLLATE=syr_SY.utf8 sort file.txt | LC_COLLATE=syr_SY.utf8 uniq -c (or perhaps better yet in the regular environment).
– Michael Homer
5 hours ago




Note that both the sort and the uniq need to have the right collation to work here, so you'd want LC_COLLATE=syr_SY.utf8 sort file.txt | LC_COLLATE=syr_SY.utf8 uniq -c (or perhaps better yet in the regular environment).
– Michael Homer
5 hours ago










2 Answers
2






active

oldest

votes

















up vote
2
down vote



accepted










First set CTYPE:



$ export LC_CTYPE=syr_SY.utf8
$ cat file.txt |sort |uniq -c
2 ܐܒܘܢ
1 ܢܗܘܐ





share|improve this answer




















  • thanks! The only problem is that I get a warning: bash: warning: setlocale: LC_CTYPE: cannot change locale (syr_SY.utf8)
    – evb
    5 hours ago






  • 1




    There is no syr_SY locale defined yet. As it doesn't exist, the locale in effect drops to default: C. That is why the commands did work. @evb
    – Isaac
    14 mins ago






  • 1




    Also, there is no need to use cat.
    – Isaac
    14 mins ago

















up vote
2
down vote













A (simplistic) portable solution:



$ ( LC_ALL=C sort syriac.txt | LC_ALL=C uniq -c )
2 ܐܒܘܢ
1 ܢܗܘܐ





share|improve this answer




















    Your Answer







    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "106"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f469347%2fusing-uniq-on-unicode-text%23new-answer', 'question_page');

    );

    Post as a guest






























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    2
    down vote



    accepted










    First set CTYPE:



    $ export LC_CTYPE=syr_SY.utf8
    $ cat file.txt |sort |uniq -c
    2 ܐܒܘܢ
    1 ܢܗܘܐ





    share|improve this answer




















    • thanks! The only problem is that I get a warning: bash: warning: setlocale: LC_CTYPE: cannot change locale (syr_SY.utf8)
      – evb
      5 hours ago






    • 1




      There is no syr_SY locale defined yet. As it doesn't exist, the locale in effect drops to default: C. That is why the commands did work. @evb
      – Isaac
      14 mins ago






    • 1




      Also, there is no need to use cat.
      – Isaac
      14 mins ago














    up vote
    2
    down vote



    accepted










    First set CTYPE:



    $ export LC_CTYPE=syr_SY.utf8
    $ cat file.txt |sort |uniq -c
    2 ܐܒܘܢ
    1 ܢܗܘܐ





    share|improve this answer




















    • thanks! The only problem is that I get a warning: bash: warning: setlocale: LC_CTYPE: cannot change locale (syr_SY.utf8)
      – evb
      5 hours ago






    • 1




      There is no syr_SY locale defined yet. As it doesn't exist, the locale in effect drops to default: C. That is why the commands did work. @evb
      – Isaac
      14 mins ago






    • 1




      Also, there is no need to use cat.
      – Isaac
      14 mins ago












    up vote
    2
    down vote



    accepted







    up vote
    2
    down vote



    accepted






    First set CTYPE:



    $ export LC_CTYPE=syr_SY.utf8
    $ cat file.txt |sort |uniq -c
    2 ܐܒܘܢ
    1 ܢܗܘܐ





    share|improve this answer












    First set CTYPE:



    $ export LC_CTYPE=syr_SY.utf8
    $ cat file.txt |sort |uniq -c
    2 ܐܒܘܢ
    1 ܢܗܘܐ






    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered 5 hours ago









    Ipor Sircer

    9,2331920




    9,2331920











    • thanks! The only problem is that I get a warning: bash: warning: setlocale: LC_CTYPE: cannot change locale (syr_SY.utf8)
      – evb
      5 hours ago






    • 1




      There is no syr_SY locale defined yet. As it doesn't exist, the locale in effect drops to default: C. That is why the commands did work. @evb
      – Isaac
      14 mins ago






    • 1




      Also, there is no need to use cat.
      – Isaac
      14 mins ago
















    • thanks! The only problem is that I get a warning: bash: warning: setlocale: LC_CTYPE: cannot change locale (syr_SY.utf8)
      – evb
      5 hours ago






    • 1




      There is no syr_SY locale defined yet. As it doesn't exist, the locale in effect drops to default: C. That is why the commands did work. @evb
      – Isaac
      14 mins ago






    • 1




      Also, there is no need to use cat.
      – Isaac
      14 mins ago















    thanks! The only problem is that I get a warning: bash: warning: setlocale: LC_CTYPE: cannot change locale (syr_SY.utf8)
    – evb
    5 hours ago




    thanks! The only problem is that I get a warning: bash: warning: setlocale: LC_CTYPE: cannot change locale (syr_SY.utf8)
    – evb
    5 hours ago




    1




    1




    There is no syr_SY locale defined yet. As it doesn't exist, the locale in effect drops to default: C. That is why the commands did work. @evb
    – Isaac
    14 mins ago




    There is no syr_SY locale defined yet. As it doesn't exist, the locale in effect drops to default: C. That is why the commands did work. @evb
    – Isaac
    14 mins ago




    1




    1




    Also, there is no need to use cat.
    – Isaac
    14 mins ago




    Also, there is no need to use cat.
    – Isaac
    14 mins ago












    up vote
    2
    down vote













    A (simplistic) portable solution:



    $ ( LC_ALL=C sort syriac.txt | LC_ALL=C uniq -c )
    2 ܐܒܘܢ
    1 ܢܗܘܐ





    share|improve this answer
























      up vote
      2
      down vote













      A (simplistic) portable solution:



      $ ( LC_ALL=C sort syriac.txt | LC_ALL=C uniq -c )
      2 ܐܒܘܢ
      1 ܢܗܘܐ





      share|improve this answer






















        up vote
        2
        down vote










        up vote
        2
        down vote









        A (simplistic) portable solution:



        $ ( LC_ALL=C sort syriac.txt | LC_ALL=C uniq -c )
        2 ܐܒܘܢ
        1 ܢܗܘܐ





        share|improve this answer












        A (simplistic) portable solution:



        $ ( LC_ALL=C sort syriac.txt | LC_ALL=C uniq -c )
        2 ܐܒܘܢ
        1 ܢܗܘܐ






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered 14 mins ago









        Isaac

        7,0121834




        7,0121834



























             

            draft saved


            draft discarded















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f469347%2fusing-uniq-on-unicode-text%23new-answer', 'question_page');

            );

            Post as a guest













































































            Comments

            Popular posts from this blog

            What does second last employer means? [closed]

            List of Gilmore Girls characters

            Confectionery