Using uniq on unicode text

Clash Royale CLAN TAG#URR8PPP
up vote
1
down vote
favorite
I want to remove duplicate lines from a file with words of Syriac script. The source file has 3 lines, 1st and 3rd are identical.
$ cat file.txt
ÃÂÃÂÃÂâ
âÃÂÃÂÃÂ
ÃÂÃÂÃÂâ
When I use sort and uniq, the result presumes that all the 3 lines are identical, which is wrong:
$ cat file.txt | sort | uniq -c
3 ÃÂÃÂÃÂâ
Explicitly setting locale to Syriac doesn't help either.
$ LC_COLLATE=syr_SY.utf8 cat file.txt | sort | uniq -c
3 ÃÂÃÂÃÂâ
Why would that happen?
I'm using Kubuntu 18 and bash, if that matters.
sort unicode uniq
add a comment |Â
up vote
1
down vote
favorite
I want to remove duplicate lines from a file with words of Syriac script. The source file has 3 lines, 1st and 3rd are identical.
$ cat file.txt
ÃÂÃÂÃÂâ
âÃÂÃÂÃÂ
ÃÂÃÂÃÂâ
When I use sort and uniq, the result presumes that all the 3 lines are identical, which is wrong:
$ cat file.txt | sort | uniq -c
3 ÃÂÃÂÃÂâ
Explicitly setting locale to Syriac doesn't help either.
$ LC_COLLATE=syr_SY.utf8 cat file.txt | sort | uniq -c
3 ÃÂÃÂÃÂâ
Why would that happen?
I'm using Kubuntu 18 and bash, if that matters.
sort unicode uniq
2
Possible duplicate of Why is uniq ignoring Unicode and lines with a single letter?
â Michael Homer
5 hours ago
1
Note that both thesortand theuniqneed to have the right collation to work here, so you'd wantLC_COLLATE=syr_SY.utf8 sort file.txt | LC_COLLATE=syr_SY.utf8 uniq -c(or perhaps better yet in the regular environment).
â Michael Homer
5 hours ago
add a comment |Â
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I want to remove duplicate lines from a file with words of Syriac script. The source file has 3 lines, 1st and 3rd are identical.
$ cat file.txt
ÃÂÃÂÃÂâ
âÃÂÃÂÃÂ
ÃÂÃÂÃÂâ
When I use sort and uniq, the result presumes that all the 3 lines are identical, which is wrong:
$ cat file.txt | sort | uniq -c
3 ÃÂÃÂÃÂâ
Explicitly setting locale to Syriac doesn't help either.
$ LC_COLLATE=syr_SY.utf8 cat file.txt | sort | uniq -c
3 ÃÂÃÂÃÂâ
Why would that happen?
I'm using Kubuntu 18 and bash, if that matters.
sort unicode uniq
I want to remove duplicate lines from a file with words of Syriac script. The source file has 3 lines, 1st and 3rd are identical.
$ cat file.txt
ÃÂÃÂÃÂâ
âÃÂÃÂÃÂ
ÃÂÃÂÃÂâ
When I use sort and uniq, the result presumes that all the 3 lines are identical, which is wrong:
$ cat file.txt | sort | uniq -c
3 ÃÂÃÂÃÂâ
Explicitly setting locale to Syriac doesn't help either.
$ LC_COLLATE=syr_SY.utf8 cat file.txt | sort | uniq -c
3 ÃÂÃÂÃÂâ
Why would that happen?
I'm using Kubuntu 18 and bash, if that matters.
sort unicode uniq
sort unicode uniq
edited 5 hours ago
ç¥Âç§Âå¾·éÂÂå Â
1559
1559
asked 6 hours ago
evb
1085
1085
2
Possible duplicate of Why is uniq ignoring Unicode and lines with a single letter?
â Michael Homer
5 hours ago
1
Note that both thesortand theuniqneed to have the right collation to work here, so you'd wantLC_COLLATE=syr_SY.utf8 sort file.txt | LC_COLLATE=syr_SY.utf8 uniq -c(or perhaps better yet in the regular environment).
â Michael Homer
5 hours ago
add a comment |Â
2
Possible duplicate of Why is uniq ignoring Unicode and lines with a single letter?
â Michael Homer
5 hours ago
1
Note that both thesortand theuniqneed to have the right collation to work here, so you'd wantLC_COLLATE=syr_SY.utf8 sort file.txt | LC_COLLATE=syr_SY.utf8 uniq -c(or perhaps better yet in the regular environment).
â Michael Homer
5 hours ago
2
2
Possible duplicate of Why is uniq ignoring Unicode and lines with a single letter?
â Michael Homer
5 hours ago
Possible duplicate of Why is uniq ignoring Unicode and lines with a single letter?
â Michael Homer
5 hours ago
1
1
Note that both the
sort and the uniq need to have the right collation to work here, so you'd want LC_COLLATE=syr_SY.utf8 sort file.txt | LC_COLLATE=syr_SY.utf8 uniq -c (or perhaps better yet in the regular environment).â Michael Homer
5 hours ago
Note that both the
sort and the uniq need to have the right collation to work here, so you'd want LC_COLLATE=syr_SY.utf8 sort file.txt | LC_COLLATE=syr_SY.utf8 uniq -c (or perhaps better yet in the regular environment).â Michael Homer
5 hours ago
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
2
down vote
accepted
First set CTYPE:
$ export LC_CTYPE=syr_SY.utf8
$ cat file.txt |sort |uniq -c
2 ÃÂÃÂÃÂâ
1 âÃÂÃÂÃÂ
thanks! The only problem is that I get a warning:bash: warning: setlocale: LC_CTYPE: cannot change locale (syr_SY.utf8)
â evb
5 hours ago
1
There is nosyr_SYlocale defined yet. As it doesn't exist, the locale in effect drops to default: C. That is why the commands did work. @evb
â Isaac
14 mins ago
1
Also, there is no need to use cat.
â Isaac
14 mins ago
add a comment |Â
up vote
2
down vote
A (simplistic) portable solution:
$ ( LC_ALL=C sort syriac.txt | LC_ALL=C uniq -c )
2 ÃÂÃÂÃÂâ
1 âÃÂÃÂÃÂ
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
First set CTYPE:
$ export LC_CTYPE=syr_SY.utf8
$ cat file.txt |sort |uniq -c
2 ÃÂÃÂÃÂâ
1 âÃÂÃÂÃÂ
thanks! The only problem is that I get a warning:bash: warning: setlocale: LC_CTYPE: cannot change locale (syr_SY.utf8)
â evb
5 hours ago
1
There is nosyr_SYlocale defined yet. As it doesn't exist, the locale in effect drops to default: C. That is why the commands did work. @evb
â Isaac
14 mins ago
1
Also, there is no need to use cat.
â Isaac
14 mins ago
add a comment |Â
up vote
2
down vote
accepted
First set CTYPE:
$ export LC_CTYPE=syr_SY.utf8
$ cat file.txt |sort |uniq -c
2 ÃÂÃÂÃÂâ
1 âÃÂÃÂÃÂ
thanks! The only problem is that I get a warning:bash: warning: setlocale: LC_CTYPE: cannot change locale (syr_SY.utf8)
â evb
5 hours ago
1
There is nosyr_SYlocale defined yet. As it doesn't exist, the locale in effect drops to default: C. That is why the commands did work. @evb
â Isaac
14 mins ago
1
Also, there is no need to use cat.
â Isaac
14 mins ago
add a comment |Â
up vote
2
down vote
accepted
up vote
2
down vote
accepted
First set CTYPE:
$ export LC_CTYPE=syr_SY.utf8
$ cat file.txt |sort |uniq -c
2 ÃÂÃÂÃÂâ
1 âÃÂÃÂÃÂ
First set CTYPE:
$ export LC_CTYPE=syr_SY.utf8
$ cat file.txt |sort |uniq -c
2 ÃÂÃÂÃÂâ
1 âÃÂÃÂÃÂ
answered 5 hours ago
Ipor Sircer
9,2331920
9,2331920
thanks! The only problem is that I get a warning:bash: warning: setlocale: LC_CTYPE: cannot change locale (syr_SY.utf8)
â evb
5 hours ago
1
There is nosyr_SYlocale defined yet. As it doesn't exist, the locale in effect drops to default: C. That is why the commands did work. @evb
â Isaac
14 mins ago
1
Also, there is no need to use cat.
â Isaac
14 mins ago
add a comment |Â
thanks! The only problem is that I get a warning:bash: warning: setlocale: LC_CTYPE: cannot change locale (syr_SY.utf8)
â evb
5 hours ago
1
There is nosyr_SYlocale defined yet. As it doesn't exist, the locale in effect drops to default: C. That is why the commands did work. @evb
â Isaac
14 mins ago
1
Also, there is no need to use cat.
â Isaac
14 mins ago
thanks! The only problem is that I get a warning:
bash: warning: setlocale: LC_CTYPE: cannot change locale (syr_SY.utf8)â evb
5 hours ago
thanks! The only problem is that I get a warning:
bash: warning: setlocale: LC_CTYPE: cannot change locale (syr_SY.utf8)â evb
5 hours ago
1
1
There is no
syr_SY locale defined yet. As it doesn't exist, the locale in effect drops to default: C. That is why the commands did work. @evbâ Isaac
14 mins ago
There is no
syr_SY locale defined yet. As it doesn't exist, the locale in effect drops to default: C. That is why the commands did work. @evbâ Isaac
14 mins ago
1
1
Also, there is no need to use cat.
â Isaac
14 mins ago
Also, there is no need to use cat.
â Isaac
14 mins ago
add a comment |Â
up vote
2
down vote
A (simplistic) portable solution:
$ ( LC_ALL=C sort syriac.txt | LC_ALL=C uniq -c )
2 ÃÂÃÂÃÂâ
1 âÃÂÃÂÃÂ
add a comment |Â
up vote
2
down vote
A (simplistic) portable solution:
$ ( LC_ALL=C sort syriac.txt | LC_ALL=C uniq -c )
2 ÃÂÃÂÃÂâ
1 âÃÂÃÂÃÂ
add a comment |Â
up vote
2
down vote
up vote
2
down vote
A (simplistic) portable solution:
$ ( LC_ALL=C sort syriac.txt | LC_ALL=C uniq -c )
2 ÃÂÃÂÃÂâ
1 âÃÂÃÂÃÂ
A (simplistic) portable solution:
$ ( LC_ALL=C sort syriac.txt | LC_ALL=C uniq -c )
2 ÃÂÃÂÃÂâ
1 âÃÂÃÂÃÂ
answered 14 mins ago
Isaac
7,0121834
7,0121834
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f469347%2fusing-uniq-on-unicode-text%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password

2
Possible duplicate of Why is uniq ignoring Unicode and lines with a single letter?
â Michael Homer
5 hours ago
1
Note that both the
sortand theuniqneed to have the right collation to work here, so you'd wantLC_COLLATE=syr_SY.utf8 sort file.txt | LC_COLLATE=syr_SY.utf8 uniq -c(or perhaps better yet in the regular environment).â Michael Homer
5 hours ago