Removing characters with sed

up vote
2
down vote

favorite

I am working on AIX unix and trying to remove non-printable characters from file the data looks like Caucasian male lives in Arizona w/ fiancÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ in file when I view in Notepad++ using UTF-8 encoding. When I try to view file in unix I get ^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’ instead of the special characters.

I want to replace all those special characters with space.

I tried sed 's/[^[:print:]]/ /g' file but it does not remove those characters.My locale are listed below when I run locale -a

C
POSIX
en_US.8859-15
en_US.ISO8859-1
en_US

I even tried sed -e 's/[^ -~]/ /g' file and it did not remove the characters.

I see that others stackflow answers used UTF-8 locale with GNU sed and this worked but I do not have that locale.

Also I am using ksh.

edited 5 hours ago

asked 5 hours ago

Auguster

133

New contributor

Ãƒ and Ã¢Â–Â’ look pretty printable to me. A UTF-8 Ãƒ is encoded as 0xc3 0x83. 0xc3 in iso8859-1 or 15 is also Ãƒ as it happens which is printable, 0x83 would be a control character in both though
â€“Â StÃ©phane Chazelas
5 hours ago

Possible dublicate unix.stackexchange.com/questions/201751/â€¦
â€“Â Goro
4 hours ago

1

@Goro Yes at this point its is possibly a duplicate now that I understand to use C locale
â€“Â Auguster
4 hours ago

To actually show what the characeters are it is useful to show their hex values. Something like: echo "fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚" | od -tx1, or, maybe if your sed supports it: echo "fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚" | sed -n l.
â€“Â Isaac
3 hours ago

1

Possible duplicate of Match language range in shell, sed or awk
â€“Â Isaac
3 hours ago

add a commentÂ |Â

up vote
2
down vote

favorite

I want to replace all those special characters with space.

I tried sed 's/[^[:print:]]/ /g' file but it does not remove those characters.My locale are listed below when I run locale -a

C
POSIX
en_US.8859-15
en_US.ISO8859-1
en_US

I even tried sed -e 's/[^ -~]/ /g' file and it did not remove the characters.

I see that others stackflow answers used UTF-8 locale with GNU sed and this worked but I do not have that locale.

Also I am using ksh.

edited 5 hours ago

asked 5 hours ago

Auguster

133

New contributor

Ãƒ and Ã¢Â–Â’ look pretty printable to me. A UTF-8 Ãƒ is encoded as 0xc3 0x83. 0xc3 in iso8859-1 or 15 is also Ãƒ as it happens which is printable, 0x83 would be a control character in both though
â€“Â StÃ©phane Chazelas
5 hours ago

Possible dublicate unix.stackexchange.com/questions/201751/â€¦
â€“Â Goro
4 hours ago

1

@Goro Yes at this point its is possibly a duplicate now that I understand to use C locale
â€“Â Auguster
4 hours ago

To actually show what the characeters are it is useful to show their hex values. Something like: echo "fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚" | od -tx1, or, maybe if your sed supports it: echo "fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚" | sed -n l.
â€“Â Isaac
3 hours ago

1

Possible duplicate of Match language range in shell, sed or awk
â€“Â Isaac
3 hours ago

add a commentÂ |Â

up vote
2
down vote

favorite

I want to replace all those special characters with space.

I tried sed 's/[^[:print:]]/ /g' file but it does not remove those characters.My locale are listed below when I run locale -a

C
POSIX
en_US.8859-15
en_US.ISO8859-1
en_US

I even tried sed -e 's/[^ -~]/ /g' file and it did not remove the characters.

I see that others stackflow answers used UTF-8 locale with GNU sed and this worked but I do not have that locale.

Also I am using ksh.

edited 5 hours ago

asked 5 hours ago

Auguster

133

New contributor

I want to replace all those special characters with space.

I tried sed 's/[^[:print:]]/ /g' file but it does not remove those characters.My locale are listed below when I run locale -a

C
POSIX
en_US.8859-15
en_US.ISO8859-1
en_US

I even tried sed -e 's/[^ -~]/ /g' file and it did not remove the characters.

I see that others stackflow answers used UTF-8 locale with GNU sed and this worked but I do not have that locale.

Also I am using ksh.

text-processing sed ksh aix

edited 5 hours ago

asked 5 hours ago

Auguster

133

New contributor

edited 5 hours ago

asked 5 hours ago

Auguster

133

New contributor

edited 5 hours ago

asked 5 hours ago

Auguster

133

New contributor

asked 5 hours ago

Auguster

133

asked 5 hours ago

Auguster

133

New contributor

Auguster is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

Ãƒ and Ã¢Â–Â’ look pretty printable to me. A UTF-8 Ãƒ is encoded as 0xc3 0x83. 0xc3 in iso8859-1 or 15 is also Ãƒ as it happens which is printable, 0x83 would be a control character in both though
â€“Â StÃ©phane Chazelas
5 hours ago

Possible dublicate unix.stackexchange.com/questions/201751/â€¦
â€“Â Goro
4 hours ago

1

@Goro Yes at this point its is possibly a duplicate now that I understand to use C locale
â€“Â Auguster
4 hours ago

To actually show what the characeters are it is useful to show their hex values. Something like: echo "fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚" | od -tx1, or, maybe if your sed supports it: echo "fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚" | sed -n l.
â€“Â Isaac
3 hours ago

1

Possible duplicate of Match language range in shell, sed or awk
â€“Â Isaac
3 hours ago

add a commentÂ |Â

Ãƒ and Ã¢Â–Â’ look pretty printable to me. A UTF-8 Ãƒ is encoded as 0xc3 0x83. 0xc3 in iso8859-1 or 15 is also Ãƒ as it happens which is printable, 0x83 would be a control character in both though
â€“Â StÃ©phane Chazelas
5 hours ago

Possible dublicate unix.stackexchange.com/questions/201751/â€¦
â€“Â Goro
4 hours ago

1

@Goro Yes at this point its is possibly a duplicate now that I understand to use C locale
â€“Â Auguster
4 hours ago

To actually show what the characeters are it is useful to show their hex values. Something like: echo "fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚" | od -tx1, or, maybe if your sed supports it: echo "fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚" | sed -n l.
â€“Â Isaac
3 hours ago

1

Possible duplicate of Match language range in shell, sed or awk
â€“Â Isaac
3 hours ago

Ãƒ and Ã¢Â–Â’ look pretty printable to me. A UTF-8 Ãƒ is encoded as 0xc3 0x83. 0xc3 in iso8859-1 or 15 is also Ãƒ as it happens which is printable, 0x83 would be a control character in both though
â€“Â StÃ©phane Chazelas
5 hours ago

Possible dublicate unix.stackexchange.com/questions/201751/â€¦
â€“Â Goro
4 hours ago

@Goro Yes at this point its is possibly a duplicate now that I understand to use C locale
â€“Â Auguster
4 hours ago

To actually show what the characeters are it is useful to show their hex values. Something like: echo "fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚" | od -tx1, or, maybe if your sed supports it: echo "fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚" | sed -n l.
â€“Â Isaac
3 hours ago

Possible duplicate of Match language range in shell, sed or awk
â€“Â Isaac
3 hours ago

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
3
down vote

accepted

You can use the command tr as follows:

tr -cd '[:print:]trn'

Explanation:

`[:print:]'
Any character from the `[:space:]' class, and any character that is not in the `[:graph:]' class
r -- return
t -- horizontal tab

Examples based on Centos 7:tris GNU and UTF-8 encoding

$ echo "fiancÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚" | tr -cd '[:print:]trn'
fianc

$ echo "get ^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’ " | tr -cd '[:print:]trn'
get ^^^^^^

echo " Caucasian male lives in Arizona w/ fiancÃ¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’" | tr -cd '[:print:]trn'
 Caucasian male lives in Arizona w/ fianc^^^^^^^^^^^^

edited 2 hours ago

answered 5 hours ago

Goro

4,56452356

That did not work for me I tried echo " Caucasian male lives in Arizona w/ fiancÃ¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’" | tr -d '[:print:]' and got output as some unreadable text
â€“Â Auguster
5 hours ago

1

LC_ALL=C tr ...
â€“Â Jeff Schaller
5 hours ago

1

LC_ALL=C tr -cd '[:print:]' < input works here
â€“Â Jeff Schaller
5 hours ago

1

echo "fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚" | tr -cd '[:print:]trn' should return fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ as Ã‚ is a printable character. GNU tr doesn't in UTF8 as it doesn't support multi-byte characters yet, but it does in iso8859-1. In the C locale on systems where the C locale charset is ASCII, that does remove Ã‚ (or whatever bytes those are made of) as ASCII has no such character in the first place.
â€“Â StÃ©phane Chazelas
2 hours ago

1

Because CentOS tr is GNU tr and you probably tried it in a UTF-8 locale where Ãƒ is made of 2 bytes and GNU tr doesn't support multibyte characters. If you use LC_ALL=C as suggested by Auguster, it will work (at removing those Ãƒ however they're encoded) regardless of whether tr supports multibyte characters or not. In the C locale, all characters are single bytes, and on most systems including AIX, the C locale charset is ASCII that has no character with the 8th bit set (which each byte of the UTF-8 encoding of Ãƒ has as well as its single byte iso8859-1 encoding)
â€“Â StÃ©phane Chazelas
2 hours ago

Â |Â
show 3 more comments

up vote
1
down vote

If the current locale already uses UTF-8 as the charset (and file is written using that charset):

<file LC_ALL=C sed 's/[^ -~]//g'

Or, to include control characters in AIX sed:

<file LC_ALL=C sed "$(printf "s/[^[:print:]tr]//g")"

edited 2 hours ago

StÃ©phane Chazelas

286k53527866

answered 3 hours ago

Isaac

7,19111035

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

Auguster is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f471405%2fremoving-characters-with-sed%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
3
down vote

accepted

You can use the command tr as follows:

tr -cd '[:print:]trn'

Explanation:

`[:print:]'
Any character from the `[:space:]' class, and any character that is not in the `[:graph:]' class
r -- return
t -- horizontal tab

Examples based on Centos 7:tris GNU and UTF-8 encoding

$ echo "fiancÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚" | tr -cd '[:print:]trn'
fianc

$ echo "get ^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’ " | tr -cd '[:print:]trn'
get ^^^^^^

echo " Caucasian male lives in Arizona w/ fiancÃ¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’" | tr -cd '[:print:]trn'
 Caucasian male lives in Arizona w/ fianc^^^^^^^^^^^^

edited 2 hours ago

answered 5 hours ago

Goro

4,56452356

That did not work for me I tried echo " Caucasian male lives in Arizona w/ fiancÃ¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’" | tr -d '[:print:]' and got output as some unreadable text
â€“Â Auguster
5 hours ago

1

LC_ALL=C tr ...
â€“Â Jeff Schaller
5 hours ago

1

LC_ALL=C tr -cd '[:print:]' < input works here
â€“Â Jeff Schaller
5 hours ago

1

echo "fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚" | tr -cd '[:print:]trn' should return fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ as Ã‚ is a printable character. GNU tr doesn't in UTF8 as it doesn't support multi-byte characters yet, but it does in iso8859-1. In the C locale on systems where the C locale charset is ASCII, that does remove Ã‚ (or whatever bytes those are made of) as ASCII has no such character in the first place.
â€“Â StÃ©phane Chazelas
2 hours ago

1

Because CentOS tr is GNU tr and you probably tried it in a UTF-8 locale where Ãƒ is made of 2 bytes and GNU tr doesn't support multibyte characters. If you use LC_ALL=C as suggested by Auguster, it will work (at removing those Ãƒ however they're encoded) regardless of whether tr supports multibyte characters or not. In the C locale, all characters are single bytes, and on most systems including AIX, the C locale charset is ASCII that has no character with the 8th bit set (which each byte of the UTF-8 encoding of Ãƒ has as well as its single byte iso8859-1 encoding)
â€“Â StÃ©phane Chazelas
2 hours ago

Â |Â
show 3 more comments

up vote
3
down vote

accepted

You can use the command tr as follows:

tr -cd '[:print:]trn'

Explanation:

`[:print:]'
Any character from the `[:space:]' class, and any character that is not in the `[:graph:]' class
r -- return
t -- horizontal tab

Examples based on Centos 7:tris GNU and UTF-8 encoding

$ echo "fiancÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚" | tr -cd '[:print:]trn'
fianc

$ echo "get ^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’ " | tr -cd '[:print:]trn'
get ^^^^^^

echo " Caucasian male lives in Arizona w/ fiancÃ¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’" | tr -cd '[:print:]trn'
 Caucasian male lives in Arizona w/ fianc^^^^^^^^^^^^

edited 2 hours ago

answered 5 hours ago

Goro

4,56452356

That did not work for me I tried echo " Caucasian male lives in Arizona w/ fiancÃ¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’" | tr -d '[:print:]' and got output as some unreadable text
â€“Â Auguster
5 hours ago

1

LC_ALL=C tr ...
â€“Â Jeff Schaller
5 hours ago

1

LC_ALL=C tr -cd '[:print:]' < input works here
â€“Â Jeff Schaller
5 hours ago

1

echo "fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚" | tr -cd '[:print:]trn' should return fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ as Ã‚ is a printable character. GNU tr doesn't in UTF8 as it doesn't support multi-byte characters yet, but it does in iso8859-1. In the C locale on systems where the C locale charset is ASCII, that does remove Ã‚ (or whatever bytes those are made of) as ASCII has no such character in the first place.
â€“Â StÃ©phane Chazelas
2 hours ago

1

Because CentOS tr is GNU tr and you probably tried it in a UTF-8 locale where Ãƒ is made of 2 bytes and GNU tr doesn't support multibyte characters. If you use LC_ALL=C as suggested by Auguster, it will work (at removing those Ãƒ however they're encoded) regardless of whether tr supports multibyte characters or not. In the C locale, all characters are single bytes, and on most systems including AIX, the C locale charset is ASCII that has no character with the 8th bit set (which each byte of the UTF-8 encoding of Ãƒ has as well as its single byte iso8859-1 encoding)
â€“Â StÃ©phane Chazelas
2 hours ago

Â |Â
show 3 more comments

up vote
3
down vote

accepted

You can use the command tr as follows:

tr -cd '[:print:]trn'

Explanation:

`[:print:]'
Any character from the `[:space:]' class, and any character that is not in the `[:graph:]' class
r -- return
t -- horizontal tab

Examples based on Centos 7:tris GNU and UTF-8 encoding

$ echo "fiancÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚" | tr -cd '[:print:]trn'
fianc

$ echo "get ^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’ " | tr -cd '[:print:]trn'
get ^^^^^^

echo " Caucasian male lives in Arizona w/ fiancÃ¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’" | tr -cd '[:print:]trn'
 Caucasian male lives in Arizona w/ fianc^^^^^^^^^^^^

edited 2 hours ago

answered 5 hours ago

Goro

4,56452356

You can use the command tr as follows:

tr -cd '[:print:]trn'

Explanation:

`[:print:]'
Any character from the `[:space:]' class, and any character that is not in the `[:graph:]' class
r -- return
t -- horizontal tab

Examples based on Centos 7:tris GNU and UTF-8 encoding

$ echo "fiancÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚" | tr -cd '[:print:]trn'
fianc

$ echo "get ^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’ " | tr -cd '[:print:]trn'
get ^^^^^^

echo " Caucasian male lives in Arizona w/ fiancÃ¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’" | tr -cd '[:print:]trn'
 Caucasian male lives in Arizona w/ fianc^^^^^^^^^^^^

edited 2 hours ago

answered 5 hours ago

Goro

4,56452356

edited 2 hours ago

answered 5 hours ago

Goro

4,56452356

answered 5 hours ago

Goro

4,56452356

answered 5 hours ago

Goro

4,56452356

That did not work for me I tried echo " Caucasian male lives in Arizona w/ fiancÃ¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’" | tr -d '[:print:]' and got output as some unreadable text
â€“Â Auguster
5 hours ago

1

LC_ALL=C tr ...
â€“Â Jeff Schaller
5 hours ago

1

LC_ALL=C tr -cd '[:print:]' < input works here
â€“Â Jeff Schaller
5 hours ago

1

echo "fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚" | tr -cd '[:print:]trn' should return fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ as Ã‚ is a printable character. GNU tr doesn't in UTF8 as it doesn't support multi-byte characters yet, but it does in iso8859-1. In the C locale on systems where the C locale charset is ASCII, that does remove Ã‚ (or whatever bytes those are made of) as ASCII has no such character in the first place.
â€“Â StÃ©phane Chazelas
2 hours ago

1

Because CentOS tr is GNU tr and you probably tried it in a UTF-8 locale where Ãƒ is made of 2 bytes and GNU tr doesn't support multibyte characters. If you use LC_ALL=C as suggested by Auguster, it will work (at removing those Ãƒ however they're encoded) regardless of whether tr supports multibyte characters or not. In the C locale, all characters are single bytes, and on most systems including AIX, the C locale charset is ASCII that has no character with the 8th bit set (which each byte of the UTF-8 encoding of Ãƒ has as well as its single byte iso8859-1 encoding)
â€“Â StÃ©phane Chazelas
2 hours ago

Â |Â
show 3 more comments

That did not work for me I tried echo " Caucasian male lives in Arizona w/ fiancÃ¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’" | tr -d '[:print:]' and got output as some unreadable text
â€“Â Auguster
5 hours ago

1

LC_ALL=C tr ...
â€“Â Jeff Schaller
5 hours ago

1

LC_ALL=C tr -cd '[:print:]' < input works here
â€“Â Jeff Schaller
5 hours ago

1

echo "fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚" | tr -cd '[:print:]trn' should return fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ as Ã‚ is a printable character. GNU tr doesn't in UTF8 as it doesn't support multi-byte characters yet, but it does in iso8859-1. In the C locale on systems where the C locale charset is ASCII, that does remove Ã‚ (or whatever bytes those are made of) as ASCII has no such character in the first place.
â€“Â StÃ©phane Chazelas
2 hours ago

1

Because CentOS tr is GNU tr and you probably tried it in a UTF-8 locale where Ãƒ is made of 2 bytes and GNU tr doesn't support multibyte characters. If you use LC_ALL=C as suggested by Auguster, it will work (at removing those Ãƒ however they're encoded) regardless of whether tr supports multibyte characters or not. In the C locale, all characters are single bytes, and on most systems including AIX, the C locale charset is ASCII that has no character with the 8th bit set (which each byte of the UTF-8 encoding of Ãƒ has as well as its single byte iso8859-1 encoding)
â€“Â StÃ©phane Chazelas
2 hours ago

That did not work for me I tried echo

" Caucasian male lives in Arizona w/ fiancÃ¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’" | tr -d '[:print:]'

and got output as some unreadable text
â€“Â Auguster
5 hours ago

That did not work for me I tried echo

" Caucasian male lives in Arizona w/ fiancÃ¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’" | tr -d '[:print:]'

and got output as some unreadable text
â€“Â Auguster
5 hours ago

LC_ALL=C tr ...
â€“Â Jeff Schaller
5 hours ago

LC_ALL=C tr -cd '[:print:]' < input works here
â€“Â Jeff Schaller
5 hours ago

echo "fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚" | tr -cd '[:print:]trn' should return fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ as Ã‚ is a printable character. GNU tr doesn't in UTF8 as it doesn't support multi-byte characters yet, but it does in iso8859-1. In the C locale on systems where the C locale charset is ASCII, that does remove Ã‚ (or whatever bytes those are made of) as ASCII has no such character in the first place.
â€“Â StÃ©phane Chazelas
2 hours ago

Because CentOS tr is GNU tr and you probably tried it in a UTF-8 locale where Ãƒ is made of 2 bytes and GNU tr doesn't support multibyte characters. If you use LC_ALL=C as suggested by Auguster, it will work (at removing those Ãƒ however they're encoded) regardless of whether tr supports multibyte characters or not. In the C locale, all characters are single bytes, and on most systems including AIX, the C locale charset is ASCII that has no character with the 8th bit set (which each byte of the UTF-8 encoding of Ãƒ has as well as its single byte iso8859-1 encoding)
â€“Â StÃ©phane Chazelas
2 hours ago

Â |Â
show 3 more comments

up vote
1
down vote

If the current locale already uses UTF-8 as the charset (and file is written using that charset):

<file LC_ALL=C sed 's/[^ -~]//g'

Or, to include control characters in AIX sed:

<file LC_ALL=C sed "$(printf "s/[^[:print:]tr]//g")"

edited 2 hours ago

286k53527866

answered 3 hours ago

Isaac

7,19111035

add a commentÂ |Â

up vote
1
down vote

If the current locale already uses UTF-8 as the charset (and file is written using that charset):

<file LC_ALL=C sed 's/[^ -~]//g'

Or, to include control characters in AIX sed:

<file LC_ALL=C sed "$(printf "s/[^[:print:]tr]//g")"

edited 2 hours ago

286k53527866

answered 3 hours ago

Isaac

7,19111035

add a commentÂ |Â

up vote
1
down vote

If the current locale already uses UTF-8 as the charset (and file is written using that charset):

<file LC_ALL=C sed 's/[^ -~]//g'

Or, to include control characters in AIX sed:

<file LC_ALL=C sed "$(printf "s/[^[:print:]tr]//g")"

edited 2 hours ago

286k53527866

answered 3 hours ago

Isaac

7,19111035

If the current locale already uses UTF-8 as the charset (and file is written using that charset):

<file LC_ALL=C sed 's/[^ -~]//g'

Or, to include control characters in AIX sed:

<file LC_ALL=C sed "$(printf "s/[^[:print:]tr]//g")"

edited 2 hours ago

286k53527866

answered 3 hours ago

Isaac

7,19111035

edited 2 hours ago

286k53527866

edited 2 hours ago

286k53527866

edited 2 hours ago

286k53527866

answered 3 hours ago

Isaac

7,19111035

answered 3 hours ago

Isaac

7,19111035

answered 3 hours ago

Isaac

7,19111035

add a commentÂ |Â

Auguster is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Auguster is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Post as a guest

Name

Search This Blog

Iyfjky