Cleaning a genes database polluted by non-numeric characters except plus and minus signs
Clash Royale CLAN TAG#URR8PPP
up vote
3
down vote
favorite
I have this genes database which is completely messed up by extra non-alphanumeric values. This happened as a sort of encryption to the data which was made incorrectly, and I don't know how to clean it up. I tried sed
and awk
, but failed. This is sample of the data which is a very large amount of documents:
chr2#@!!~//=^%$74711&&*&127472363@Pos1%%0^^+
chr3#@!!~//=^%$74723&&*&127473530@Pos2%%0^^+
chr1#@!!~//=^%$73530&&*&127474697@Pos3%%0^^+
chr2#@!!~//=^%$17469&&*&127475864@Pos4%%0^^+
chr3#@!!~//=^%$12747&&*&127477031@Neg1%%0^^-
chr5#@!!~//=^%$17477&&*&127478198@Neg2%%0^^-
chr7#@!!~//=^%$74781&&*&127479365@Neg3%%0^^-
chr7#@!!~//=^%$74795&&*&127480532@Pos5%%0^^+
chr1#@!!~//=^%$12748&&*&127481699@Neg4%%0^^-
The cleaned data must be like this:
chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -
How can I do this?
text-processing awk sed bioinformatics
New contributor
add a comment |Â
up vote
3
down vote
favorite
I have this genes database which is completely messed up by extra non-alphanumeric values. This happened as a sort of encryption to the data which was made incorrectly, and I don't know how to clean it up. I tried sed
and awk
, but failed. This is sample of the data which is a very large amount of documents:
chr2#@!!~//=^%$74711&&*&127472363@Pos1%%0^^+
chr3#@!!~//=^%$74723&&*&127473530@Pos2%%0^^+
chr1#@!!~//=^%$73530&&*&127474697@Pos3%%0^^+
chr2#@!!~//=^%$17469&&*&127475864@Pos4%%0^^+
chr3#@!!~//=^%$12747&&*&127477031@Neg1%%0^^-
chr5#@!!~//=^%$17477&&*&127478198@Neg2%%0^^-
chr7#@!!~//=^%$74781&&*&127479365@Neg3%%0^^-
chr7#@!!~//=^%$74795&&*&127480532@Pos5%%0^^+
chr1#@!!~//=^%$12748&&*&127481699@Neg4%%0^^-
The cleaned data must be like this:
chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -
How can I do this?
text-processing awk sed bioinformatics
New contributor
trysed -E 's/[^a-zA-Z0-9+-]+/ /g' file
â mosvy
20 hours ago
1
What have your tried? I don't see any code showing effort to solve the problem.
â Pedro Lobito
12 hours ago
add a comment |Â
up vote
3
down vote
favorite
up vote
3
down vote
favorite
I have this genes database which is completely messed up by extra non-alphanumeric values. This happened as a sort of encryption to the data which was made incorrectly, and I don't know how to clean it up. I tried sed
and awk
, but failed. This is sample of the data which is a very large amount of documents:
chr2#@!!~//=^%$74711&&*&127472363@Pos1%%0^^+
chr3#@!!~//=^%$74723&&*&127473530@Pos2%%0^^+
chr1#@!!~//=^%$73530&&*&127474697@Pos3%%0^^+
chr2#@!!~//=^%$17469&&*&127475864@Pos4%%0^^+
chr3#@!!~//=^%$12747&&*&127477031@Neg1%%0^^-
chr5#@!!~//=^%$17477&&*&127478198@Neg2%%0^^-
chr7#@!!~//=^%$74781&&*&127479365@Neg3%%0^^-
chr7#@!!~//=^%$74795&&*&127480532@Pos5%%0^^+
chr1#@!!~//=^%$12748&&*&127481699@Neg4%%0^^-
The cleaned data must be like this:
chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -
How can I do this?
text-processing awk sed bioinformatics
New contributor
I have this genes database which is completely messed up by extra non-alphanumeric values. This happened as a sort of encryption to the data which was made incorrectly, and I don't know how to clean it up. I tried sed
and awk
, but failed. This is sample of the data which is a very large amount of documents:
chr2#@!!~//=^%$74711&&*&127472363@Pos1%%0^^+
chr3#@!!~//=^%$74723&&*&127473530@Pos2%%0^^+
chr1#@!!~//=^%$73530&&*&127474697@Pos3%%0^^+
chr2#@!!~//=^%$17469&&*&127475864@Pos4%%0^^+
chr3#@!!~//=^%$12747&&*&127477031@Neg1%%0^^-
chr5#@!!~//=^%$17477&&*&127478198@Neg2%%0^^-
chr7#@!!~//=^%$74781&&*&127479365@Neg3%%0^^-
chr7#@!!~//=^%$74795&&*&127480532@Pos5%%0^^+
chr1#@!!~//=^%$12748&&*&127481699@Neg4%%0^^-
The cleaned data must be like this:
chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -
How can I do this?
text-processing awk sed bioinformatics
text-processing awk sed bioinformatics
New contributor
New contributor
edited 20 mins ago
Peter Mortensen
82358
82358
New contributor
asked 21 hours ago
marco
1154
1154
New contributor
New contributor
trysed -E 's/[^a-zA-Z0-9+-]+/ /g' file
â mosvy
20 hours ago
1
What have your tried? I don't see any code showing effort to solve the problem.
â Pedro Lobito
12 hours ago
add a comment |Â
trysed -E 's/[^a-zA-Z0-9+-]+/ /g' file
â mosvy
20 hours ago
1
What have your tried? I don't see any code showing effort to solve the problem.
â Pedro Lobito
12 hours ago
try
sed -E 's/[^a-zA-Z0-9+-]+/ /g' file
â mosvy
20 hours ago
try
sed -E 's/[^a-zA-Z0-9+-]+/ /g' file
â mosvy
20 hours ago
1
1
What have your tried? I don't see any code showing effort to solve the problem.
â Pedro Lobito
12 hours ago
What have your tried? I don't see any code showing effort to solve the problem.
â Pedro Lobito
12 hours ago
add a comment |Â
3 Answers
3
active
oldest
votes
up vote
12
down vote
accepted
You can do it with sed
. Something as follows:
sed 's/[^a-zA-Z0-9+-]/ /g' file | column -tc2
chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -
1
THIS IS AMAZING!!!!!!!!!!!!! thank you @goro i didn't know how to keep - and +
â marco
20 hours ago
1
Proving answers to users who didn't show any effort to solve the problem isn't encouraged.
â Pedro Lobito
12 hours ago
2
@PedroLobito Says who? This isn't a teaching website, it's a website dedicated to collect answers to problems.
â pipe
10 hours ago
Says the SO community rules.
â Pedro Lobito
2 hours ago
add a comment |Â
up vote
8
down vote
With tr
, transliterating characters from the complement of the wanted set to spaces, and squeezing repeats:
$ tr -sc '[:alnum:][:space:]+-' ' ' < data
chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -
add a comment |Â
up vote
0
down vote
An awk
solution
awk -F '[^[:alnum:]+-]+' '$1=$1;print' file
add a comment |Â
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
12
down vote
accepted
You can do it with sed
. Something as follows:
sed 's/[^a-zA-Z0-9+-]/ /g' file | column -tc2
chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -
1
THIS IS AMAZING!!!!!!!!!!!!! thank you @goro i didn't know how to keep - and +
â marco
20 hours ago
1
Proving answers to users who didn't show any effort to solve the problem isn't encouraged.
â Pedro Lobito
12 hours ago
2
@PedroLobito Says who? This isn't a teaching website, it's a website dedicated to collect answers to problems.
â pipe
10 hours ago
Says the SO community rules.
â Pedro Lobito
2 hours ago
add a comment |Â
up vote
12
down vote
accepted
You can do it with sed
. Something as follows:
sed 's/[^a-zA-Z0-9+-]/ /g' file | column -tc2
chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -
1
THIS IS AMAZING!!!!!!!!!!!!! thank you @goro i didn't know how to keep - and +
â marco
20 hours ago
1
Proving answers to users who didn't show any effort to solve the problem isn't encouraged.
â Pedro Lobito
12 hours ago
2
@PedroLobito Says who? This isn't a teaching website, it's a website dedicated to collect answers to problems.
â pipe
10 hours ago
Says the SO community rules.
â Pedro Lobito
2 hours ago
add a comment |Â
up vote
12
down vote
accepted
up vote
12
down vote
accepted
You can do it with sed
. Something as follows:
sed 's/[^a-zA-Z0-9+-]/ /g' file | column -tc2
chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -
You can do it with sed
. Something as follows:
sed 's/[^a-zA-Z0-9+-]/ /g' file | column -tc2
chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -
answered 21 hours ago
Goro
7,19253168
7,19253168
1
THIS IS AMAZING!!!!!!!!!!!!! thank you @goro i didn't know how to keep - and +
â marco
20 hours ago
1
Proving answers to users who didn't show any effort to solve the problem isn't encouraged.
â Pedro Lobito
12 hours ago
2
@PedroLobito Says who? This isn't a teaching website, it's a website dedicated to collect answers to problems.
â pipe
10 hours ago
Says the SO community rules.
â Pedro Lobito
2 hours ago
add a comment |Â
1
THIS IS AMAZING!!!!!!!!!!!!! thank you @goro i didn't know how to keep - and +
â marco
20 hours ago
1
Proving answers to users who didn't show any effort to solve the problem isn't encouraged.
â Pedro Lobito
12 hours ago
2
@PedroLobito Says who? This isn't a teaching website, it's a website dedicated to collect answers to problems.
â pipe
10 hours ago
Says the SO community rules.
â Pedro Lobito
2 hours ago
1
1
THIS IS AMAZING!!!!!!!!!!!!! thank you @goro i didn't know how to keep - and +
â marco
20 hours ago
THIS IS AMAZING!!!!!!!!!!!!! thank you @goro i didn't know how to keep - and +
â marco
20 hours ago
1
1
Proving answers to users who didn't show any effort to solve the problem isn't encouraged.
â Pedro Lobito
12 hours ago
Proving answers to users who didn't show any effort to solve the problem isn't encouraged.
â Pedro Lobito
12 hours ago
2
2
@PedroLobito Says who? This isn't a teaching website, it's a website dedicated to collect answers to problems.
â pipe
10 hours ago
@PedroLobito Says who? This isn't a teaching website, it's a website dedicated to collect answers to problems.
â pipe
10 hours ago
Says the SO community rules.
â Pedro Lobito
2 hours ago
Says the SO community rules.
â Pedro Lobito
2 hours ago
add a comment |Â
up vote
8
down vote
With tr
, transliterating characters from the complement of the wanted set to spaces, and squeezing repeats:
$ tr -sc '[:alnum:][:space:]+-' ' ' < data
chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -
add a comment |Â
up vote
8
down vote
With tr
, transliterating characters from the complement of the wanted set to spaces, and squeezing repeats:
$ tr -sc '[:alnum:][:space:]+-' ' ' < data
chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -
add a comment |Â
up vote
8
down vote
up vote
8
down vote
With tr
, transliterating characters from the complement of the wanted set to spaces, and squeezing repeats:
$ tr -sc '[:alnum:][:space:]+-' ' ' < data
chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -
With tr
, transliterating characters from the complement of the wanted set to spaces, and squeezing repeats:
$ tr -sc '[:alnum:][:space:]+-' ' ' < data
chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -
edited 20 hours ago
answered 20 hours ago
steeldriver
32.6k34980
32.6k34980
add a comment |Â
add a comment |Â
up vote
0
down vote
An awk
solution
awk -F '[^[:alnum:]+-]+' '$1=$1;print' file
add a comment |Â
up vote
0
down vote
An awk
solution
awk -F '[^[:alnum:]+-]+' '$1=$1;print' file
add a comment |Â
up vote
0
down vote
up vote
0
down vote
An awk
solution
awk -F '[^[:alnum:]+-]+' '$1=$1;print' file
An awk
solution
awk -F '[^[:alnum:]+-]+' '$1=$1;print' file
answered 6 hours ago
iruvar
11.6k62959
11.6k62959
add a comment |Â
add a comment |Â
marco is a new contributor. Be nice, and check out our Code of Conduct.
marco is a new contributor. Be nice, and check out our Code of Conduct.
marco is a new contributor. Be nice, and check out our Code of Conduct.
marco is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f473709%2fcleaning-a-genes-database-polluted-by-non-numeric-characters-except-plus-and-min%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
try
sed -E 's/[^a-zA-Z0-9+-]+/ /g' file
â mosvy
20 hours ago
1
What have your tried? I don't see any code showing effort to solve the problem.
â Pedro Lobito
12 hours ago