Using uniq on unicode text
Clash Royale CLAN TAG#URR8PPP
up vote
1
down vote
favorite
I want to remove duplicate lines from a file with words of Syriac script. The source file has 3 lines, 1st and 3rd are identical.
$ cat file.txt
ÃœÂܒܘܢ
ܢܗܘÜÂ
ÃœÂܒܘܢ
When I use sort
and uniq
, the result presumes that all the 3 lines are identical, which is wrong:
$ cat file.txt | sort | uniq -c
3 ÃœÂܒܘܢ
Explicitly setting locale to Syriac doesn't help either.
$ LC_COLLATE=syr_SY.utf8 cat file.txt | sort | uniq -c
3 ÃœÂܒܘܢ
Why would that happen?
I'm using Kubuntu 18 and bash, if that matters.
sort unicode uniq
add a comment |Â
up vote
1
down vote
favorite
I want to remove duplicate lines from a file with words of Syriac script. The source file has 3 lines, 1st and 3rd are identical.
$ cat file.txt
ÃœÂܒܘܢ
ܢܗܘÜÂ
ÃœÂܒܘܢ
When I use sort
and uniq
, the result presumes that all the 3 lines are identical, which is wrong:
$ cat file.txt | sort | uniq -c
3 ÃœÂܒܘܢ
Explicitly setting locale to Syriac doesn't help either.
$ LC_COLLATE=syr_SY.utf8 cat file.txt | sort | uniq -c
3 ÃœÂܒܘܢ
Why would that happen?
I'm using Kubuntu 18 and bash, if that matters.
sort unicode uniq
2
Possible duplicate of Why is uniq ignoring Unicode and lines with a single letter?
– Michael Homer
5 hours ago
1
Note that both thesort
and theuniq
need to have the right collation to work here, so you'd wantLC_COLLATE=syr_SY.utf8 sort file.txt | LC_COLLATE=syr_SY.utf8 uniq -c
(or perhaps better yet in the regular environment).
– Michael Homer
5 hours ago
add a comment |Â
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I want to remove duplicate lines from a file with words of Syriac script. The source file has 3 lines, 1st and 3rd are identical.
$ cat file.txt
ÃœÂܒܘܢ
ܢܗܘÜÂ
ÃœÂܒܘܢ
When I use sort
and uniq
, the result presumes that all the 3 lines are identical, which is wrong:
$ cat file.txt | sort | uniq -c
3 ÃœÂܒܘܢ
Explicitly setting locale to Syriac doesn't help either.
$ LC_COLLATE=syr_SY.utf8 cat file.txt | sort | uniq -c
3 ÃœÂܒܘܢ
Why would that happen?
I'm using Kubuntu 18 and bash, if that matters.
sort unicode uniq
I want to remove duplicate lines from a file with words of Syriac script. The source file has 3 lines, 1st and 3rd are identical.
$ cat file.txt
ÃœÂܒܘܢ
ܢܗܘÜÂ
ÃœÂܒܘܢ
When I use sort
and uniq
, the result presumes that all the 3 lines are identical, which is wrong:
$ cat file.txt | sort | uniq -c
3 ÃœÂܒܘܢ
Explicitly setting locale to Syriac doesn't help either.
$ LC_COLLATE=syr_SY.utf8 cat file.txt | sort | uniq -c
3 ÃœÂܒܘܢ
Why would that happen?
I'm using Kubuntu 18 and bash, if that matters.
sort unicode uniq
sort unicode uniq
edited 5 hours ago
神秘德里克
1559
1559
asked 6 hours ago
evb
1085
1085
2
Possible duplicate of Why is uniq ignoring Unicode and lines with a single letter?
– Michael Homer
5 hours ago
1
Note that both thesort
and theuniq
need to have the right collation to work here, so you'd wantLC_COLLATE=syr_SY.utf8 sort file.txt | LC_COLLATE=syr_SY.utf8 uniq -c
(or perhaps better yet in the regular environment).
– Michael Homer
5 hours ago
add a comment |Â
2
Possible duplicate of Why is uniq ignoring Unicode and lines with a single letter?
– Michael Homer
5 hours ago
1
Note that both thesort
and theuniq
need to have the right collation to work here, so you'd wantLC_COLLATE=syr_SY.utf8 sort file.txt | LC_COLLATE=syr_SY.utf8 uniq -c
(or perhaps better yet in the regular environment).
– Michael Homer
5 hours ago
2
2
Possible duplicate of Why is uniq ignoring Unicode and lines with a single letter?
– Michael Homer
5 hours ago
Possible duplicate of Why is uniq ignoring Unicode and lines with a single letter?
– Michael Homer
5 hours ago
1
1
Note that both the
sort
and the uniq
need to have the right collation to work here, so you'd want LC_COLLATE=syr_SY.utf8 sort file.txt | LC_COLLATE=syr_SY.utf8 uniq -c
(or perhaps better yet in the regular environment).– Michael Homer
5 hours ago
Note that both the
sort
and the uniq
need to have the right collation to work here, so you'd want LC_COLLATE=syr_SY.utf8 sort file.txt | LC_COLLATE=syr_SY.utf8 uniq -c
(or perhaps better yet in the regular environment).– Michael Homer
5 hours ago
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
2
down vote
accepted
First set CTYPE:
$ export LC_CTYPE=syr_SY.utf8
$ cat file.txt |sort |uniq -c
2 ÃœÂܒܘܢ
1 ܢܗܘÜÂ
thanks! The only problem is that I get a warning:bash: warning: setlocale: LC_CTYPE: cannot change locale (syr_SY.utf8)
– evb
5 hours ago
1
There is nosyr_SY
locale defined yet. As it doesn't exist, the locale in effect drops to default: C. That is why the commands did work. @evb
– Isaac
14 mins ago
1
Also, there is no need to use cat.
– Isaac
14 mins ago
add a comment |Â
up vote
2
down vote
A (simplistic) portable solution:
$ ( LC_ALL=C sort syriac.txt | LC_ALL=C uniq -c )
2 ÃœÂܒܘܢ
1 ܢܗܘÜÂ
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
First set CTYPE:
$ export LC_CTYPE=syr_SY.utf8
$ cat file.txt |sort |uniq -c
2 ÃœÂܒܘܢ
1 ܢܗܘÜÂ
thanks! The only problem is that I get a warning:bash: warning: setlocale: LC_CTYPE: cannot change locale (syr_SY.utf8)
– evb
5 hours ago
1
There is nosyr_SY
locale defined yet. As it doesn't exist, the locale in effect drops to default: C. That is why the commands did work. @evb
– Isaac
14 mins ago
1
Also, there is no need to use cat.
– Isaac
14 mins ago
add a comment |Â
up vote
2
down vote
accepted
First set CTYPE:
$ export LC_CTYPE=syr_SY.utf8
$ cat file.txt |sort |uniq -c
2 ÃœÂܒܘܢ
1 ܢܗܘÜÂ
thanks! The only problem is that I get a warning:bash: warning: setlocale: LC_CTYPE: cannot change locale (syr_SY.utf8)
– evb
5 hours ago
1
There is nosyr_SY
locale defined yet. As it doesn't exist, the locale in effect drops to default: C. That is why the commands did work. @evb
– Isaac
14 mins ago
1
Also, there is no need to use cat.
– Isaac
14 mins ago
add a comment |Â
up vote
2
down vote
accepted
up vote
2
down vote
accepted
First set CTYPE:
$ export LC_CTYPE=syr_SY.utf8
$ cat file.txt |sort |uniq -c
2 ÃœÂܒܘܢ
1 ܢܗܘÜÂ
First set CTYPE:
$ export LC_CTYPE=syr_SY.utf8
$ cat file.txt |sort |uniq -c
2 ÃœÂܒܘܢ
1 ܢܗܘÜÂ
answered 5 hours ago


Ipor Sircer
9,2331920
9,2331920
thanks! The only problem is that I get a warning:bash: warning: setlocale: LC_CTYPE: cannot change locale (syr_SY.utf8)
– evb
5 hours ago
1
There is nosyr_SY
locale defined yet. As it doesn't exist, the locale in effect drops to default: C. That is why the commands did work. @evb
– Isaac
14 mins ago
1
Also, there is no need to use cat.
– Isaac
14 mins ago
add a comment |Â
thanks! The only problem is that I get a warning:bash: warning: setlocale: LC_CTYPE: cannot change locale (syr_SY.utf8)
– evb
5 hours ago
1
There is nosyr_SY
locale defined yet. As it doesn't exist, the locale in effect drops to default: C. That is why the commands did work. @evb
– Isaac
14 mins ago
1
Also, there is no need to use cat.
– Isaac
14 mins ago
thanks! The only problem is that I get a warning:
bash: warning: setlocale: LC_CTYPE: cannot change locale (syr_SY.utf8)
– evb
5 hours ago
thanks! The only problem is that I get a warning:
bash: warning: setlocale: LC_CTYPE: cannot change locale (syr_SY.utf8)
– evb
5 hours ago
1
1
There is no
syr_SY
locale defined yet. As it doesn't exist, the locale in effect drops to default: C. That is why the commands did work. @evb– Isaac
14 mins ago
There is no
syr_SY
locale defined yet. As it doesn't exist, the locale in effect drops to default: C. That is why the commands did work. @evb– Isaac
14 mins ago
1
1
Also, there is no need to use cat.
– Isaac
14 mins ago
Also, there is no need to use cat.
– Isaac
14 mins ago
add a comment |Â
up vote
2
down vote
A (simplistic) portable solution:
$ ( LC_ALL=C sort syriac.txt | LC_ALL=C uniq -c )
2 ÃœÂܒܘܢ
1 ܢܗܘÜÂ
add a comment |Â
up vote
2
down vote
A (simplistic) portable solution:
$ ( LC_ALL=C sort syriac.txt | LC_ALL=C uniq -c )
2 ÃœÂܒܘܢ
1 ܢܗܘÜÂ
add a comment |Â
up vote
2
down vote
up vote
2
down vote
A (simplistic) portable solution:
$ ( LC_ALL=C sort syriac.txt | LC_ALL=C uniq -c )
2 ÃœÂܒܘܢ
1 ܢܗܘÜÂ
A (simplistic) portable solution:
$ ( LC_ALL=C sort syriac.txt | LC_ALL=C uniq -c )
2 ÃœÂܒܘܢ
1 ܢܗܘÜÂ
answered 14 mins ago


Isaac
7,0121834
7,0121834
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f469347%2fusing-uniq-on-unicode-text%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
2
Possible duplicate of Why is uniq ignoring Unicode and lines with a single letter?
– Michael Homer
5 hours ago
1
Note that both the
sort
and theuniq
need to have the right collation to work here, so you'd wantLC_COLLATE=syr_SY.utf8 sort file.txt | LC_COLLATE=syr_SY.utf8 uniq -c
(or perhaps better yet in the regular environment).– Michael Homer
5 hours ago