cleaning genes database polluted by non-numeric characters except plus and minus signs
Clash Royale CLAN TAG#URR8PPP
up vote
1
down vote
favorite
I have this genes database which completely messed up by extra non-alphanumeric values. This happened as a sort of encryption to the data which was made incorrectly, and I don't know how to clean it up. I tried sed
and awk
, but failed. This is sample of the data which is very large amount of documents
chr2#@!!~//=^%$74711&&*&127472363@Pos1%%0^^+
chr3#@!!~//=^%$74723&&*&127473530@Pos2%%0^^+
chr1#@!!~//=^%$73530&&*&127474697@Pos3%%0^^+
chr2#@!!~//=^%$17469&&*&127475864@Pos4%%0^^+
chr3#@!!~//=^%$12747&&*&127477031@Neg1%%0^^-
chr5#@!!~//=^%$17477&&*&127478198@Neg2%%0^^-
chr7#@!!~//=^%$74781&&*&127479365@Neg3%%0^^-
chr7#@!!~//=^%$74795&&*&127480532@Pos5%%0^^+
chr1#@!!~//=^%$12748&&*&127481699@Neg4%%0^^-
The cleaned data must be like this
chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -
please help very big problem here!
text-processing awk sed bioinformatics
New contributor
add a comment |Â
up vote
1
down vote
favorite
I have this genes database which completely messed up by extra non-alphanumeric values. This happened as a sort of encryption to the data which was made incorrectly, and I don't know how to clean it up. I tried sed
and awk
, but failed. This is sample of the data which is very large amount of documents
chr2#@!!~//=^%$74711&&*&127472363@Pos1%%0^^+
chr3#@!!~//=^%$74723&&*&127473530@Pos2%%0^^+
chr1#@!!~//=^%$73530&&*&127474697@Pos3%%0^^+
chr2#@!!~//=^%$17469&&*&127475864@Pos4%%0^^+
chr3#@!!~//=^%$12747&&*&127477031@Neg1%%0^^-
chr5#@!!~//=^%$17477&&*&127478198@Neg2%%0^^-
chr7#@!!~//=^%$74781&&*&127479365@Neg3%%0^^-
chr7#@!!~//=^%$74795&&*&127480532@Pos5%%0^^+
chr1#@!!~//=^%$12748&&*&127481699@Neg4%%0^^-
The cleaned data must be like this
chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -
please help very big problem here!
text-processing awk sed bioinformatics
New contributor
trysed -E 's/[^a-zA-Z0-9+-]+/ /g' file
â mosvy
46 mins ago
add a comment |Â
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I have this genes database which completely messed up by extra non-alphanumeric values. This happened as a sort of encryption to the data which was made incorrectly, and I don't know how to clean it up. I tried sed
and awk
, but failed. This is sample of the data which is very large amount of documents
chr2#@!!~//=^%$74711&&*&127472363@Pos1%%0^^+
chr3#@!!~//=^%$74723&&*&127473530@Pos2%%0^^+
chr1#@!!~//=^%$73530&&*&127474697@Pos3%%0^^+
chr2#@!!~//=^%$17469&&*&127475864@Pos4%%0^^+
chr3#@!!~//=^%$12747&&*&127477031@Neg1%%0^^-
chr5#@!!~//=^%$17477&&*&127478198@Neg2%%0^^-
chr7#@!!~//=^%$74781&&*&127479365@Neg3%%0^^-
chr7#@!!~//=^%$74795&&*&127480532@Pos5%%0^^+
chr1#@!!~//=^%$12748&&*&127481699@Neg4%%0^^-
The cleaned data must be like this
chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -
please help very big problem here!
text-processing awk sed bioinformatics
New contributor
I have this genes database which completely messed up by extra non-alphanumeric values. This happened as a sort of encryption to the data which was made incorrectly, and I don't know how to clean it up. I tried sed
and awk
, but failed. This is sample of the data which is very large amount of documents
chr2#@!!~//=^%$74711&&*&127472363@Pos1%%0^^+
chr3#@!!~//=^%$74723&&*&127473530@Pos2%%0^^+
chr1#@!!~//=^%$73530&&*&127474697@Pos3%%0^^+
chr2#@!!~//=^%$17469&&*&127475864@Pos4%%0^^+
chr3#@!!~//=^%$12747&&*&127477031@Neg1%%0^^-
chr5#@!!~//=^%$17477&&*&127478198@Neg2%%0^^-
chr7#@!!~//=^%$74781&&*&127479365@Neg3%%0^^-
chr7#@!!~//=^%$74795&&*&127480532@Pos5%%0^^+
chr1#@!!~//=^%$12748&&*&127481699@Neg4%%0^^-
The cleaned data must be like this
chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -
please help very big problem here!
text-processing awk sed bioinformatics
text-processing awk sed bioinformatics
New contributor
New contributor
edited 14 mins ago
Jeff Schaller
33.6k851113
33.6k851113
New contributor
asked 51 mins ago
marco
1024
1024
New contributor
New contributor
trysed -E 's/[^a-zA-Z0-9+-]+/ /g' file
â mosvy
46 mins ago
add a comment |Â
trysed -E 's/[^a-zA-Z0-9+-]+/ /g' file
â mosvy
46 mins ago
try
sed -E 's/[^a-zA-Z0-9+-]+/ /g' file
â mosvy
46 mins ago
try
sed -E 's/[^a-zA-Z0-9+-]+/ /g' file
â mosvy
46 mins ago
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
3
down vote
accepted
You can do it with sed
. Something as follows:
sed 's/[^a-zA-Z0-9+-]/ /g' file | column -tc2
chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -
1
THIS IS AMAZING!!!!!!!!!!!!! thank you @goro i didn't know how to keep - and +
â marco
35 mins ago
add a comment |Â
up vote
2
down vote
With tr
, transliterating characters from the complement of the wanted set to spaces, and squeezing repeats:
$ tr -sc '[:alnum:][:space:]+-' ' ' < data
chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
3
down vote
accepted
You can do it with sed
. Something as follows:
sed 's/[^a-zA-Z0-9+-]/ /g' file | column -tc2
chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -
1
THIS IS AMAZING!!!!!!!!!!!!! thank you @goro i didn't know how to keep - and +
â marco
35 mins ago
add a comment |Â
up vote
3
down vote
accepted
You can do it with sed
. Something as follows:
sed 's/[^a-zA-Z0-9+-]/ /g' file | column -tc2
chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -
1
THIS IS AMAZING!!!!!!!!!!!!! thank you @goro i didn't know how to keep - and +
â marco
35 mins ago
add a comment |Â
up vote
3
down vote
accepted
up vote
3
down vote
accepted
You can do it with sed
. Something as follows:
sed 's/[^a-zA-Z0-9+-]/ /g' file | column -tc2
chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -
You can do it with sed
. Something as follows:
sed 's/[^a-zA-Z0-9+-]/ /g' file | column -tc2
chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -
answered 47 mins ago
Goro
6,94352965
6,94352965
1
THIS IS AMAZING!!!!!!!!!!!!! thank you @goro i didn't know how to keep - and +
â marco
35 mins ago
add a comment |Â
1
THIS IS AMAZING!!!!!!!!!!!!! thank you @goro i didn't know how to keep - and +
â marco
35 mins ago
1
1
THIS IS AMAZING!!!!!!!!!!!!! thank you @goro i didn't know how to keep - and +
â marco
35 mins ago
THIS IS AMAZING!!!!!!!!!!!!! thank you @goro i didn't know how to keep - and +
â marco
35 mins ago
add a comment |Â
up vote
2
down vote
With tr
, transliterating characters from the complement of the wanted set to spaces, and squeezing repeats:
$ tr -sc '[:alnum:][:space:]+-' ' ' < data
chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -
add a comment |Â
up vote
2
down vote
With tr
, transliterating characters from the complement of the wanted set to spaces, and squeezing repeats:
$ tr -sc '[:alnum:][:space:]+-' ' ' < data
chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -
add a comment |Â
up vote
2
down vote
up vote
2
down vote
With tr
, transliterating characters from the complement of the wanted set to spaces, and squeezing repeats:
$ tr -sc '[:alnum:][:space:]+-' ' ' < data
chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -
With tr
, transliterating characters from the complement of the wanted set to spaces, and squeezing repeats:
$ tr -sc '[:alnum:][:space:]+-' ' ' < data
chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -
edited 16 mins ago
answered 26 mins ago
steeldriver
32.5k34980
32.5k34980
add a comment |Â
add a comment |Â
marco is a new contributor. Be nice, and check out our Code of Conduct.
marco is a new contributor. Be nice, and check out our Code of Conduct.
marco is a new contributor. Be nice, and check out our Code of Conduct.
marco is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f473709%2fcleaning-genes-database-polluted-by-non-numeric-characters-except-plus-and-minus%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
try
sed -E 's/[^a-zA-Z0-9+-]+/ /g' file
â mosvy
46 mins ago