cleaning genes database polluted by non-numeric characters except plus and minus signs

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
1
down vote

favorite












I have this genes database which completely messed up by extra non-alphanumeric values. This happened as a sort of encryption to the data which was made incorrectly, and I don't know how to clean it up. I tried sed and awk, but failed. This is sample of the data which is very large amount of documents



chr2#@!!~//=^%$74711&&*&127472363@Pos1%%0^^+
chr3#@!!~//=^%$74723&&*&127473530@Pos2%%0^^+
chr1#@!!~//=^%$73530&&*&127474697@Pos3%%0^^+
chr2#@!!~//=^%$17469&&*&127475864@Pos4%%0^^+
chr3#@!!~//=^%$12747&&*&127477031@Neg1%%0^^-
chr5#@!!~//=^%$17477&&*&127478198@Neg2%%0^^-
chr7#@!!~//=^%$74781&&*&127479365@Neg3%%0^^-
chr7#@!!~//=^%$74795&&*&127480532@Pos5%%0^^+
chr1#@!!~//=^%$12748&&*&127481699@Neg4%%0^^-


The cleaned data must be like this



chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -


please help very big problem here!










share|improve this question









New contributor




marco is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.



















  • try sed -E 's/[^a-zA-Z0-9+-]+/ /g' file
    – mosvy
    46 mins ago














up vote
1
down vote

favorite












I have this genes database which completely messed up by extra non-alphanumeric values. This happened as a sort of encryption to the data which was made incorrectly, and I don't know how to clean it up. I tried sed and awk, but failed. This is sample of the data which is very large amount of documents



chr2#@!!~//=^%$74711&&*&127472363@Pos1%%0^^+
chr3#@!!~//=^%$74723&&*&127473530@Pos2%%0^^+
chr1#@!!~//=^%$73530&&*&127474697@Pos3%%0^^+
chr2#@!!~//=^%$17469&&*&127475864@Pos4%%0^^+
chr3#@!!~//=^%$12747&&*&127477031@Neg1%%0^^-
chr5#@!!~//=^%$17477&&*&127478198@Neg2%%0^^-
chr7#@!!~//=^%$74781&&*&127479365@Neg3%%0^^-
chr7#@!!~//=^%$74795&&*&127480532@Pos5%%0^^+
chr1#@!!~//=^%$12748&&*&127481699@Neg4%%0^^-


The cleaned data must be like this



chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -


please help very big problem here!










share|improve this question









New contributor




marco is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.



















  • try sed -E 's/[^a-zA-Z0-9+-]+/ /g' file
    – mosvy
    46 mins ago












up vote
1
down vote

favorite









up vote
1
down vote

favorite











I have this genes database which completely messed up by extra non-alphanumeric values. This happened as a sort of encryption to the data which was made incorrectly, and I don't know how to clean it up. I tried sed and awk, but failed. This is sample of the data which is very large amount of documents



chr2#@!!~//=^%$74711&&*&127472363@Pos1%%0^^+
chr3#@!!~//=^%$74723&&*&127473530@Pos2%%0^^+
chr1#@!!~//=^%$73530&&*&127474697@Pos3%%0^^+
chr2#@!!~//=^%$17469&&*&127475864@Pos4%%0^^+
chr3#@!!~//=^%$12747&&*&127477031@Neg1%%0^^-
chr5#@!!~//=^%$17477&&*&127478198@Neg2%%0^^-
chr7#@!!~//=^%$74781&&*&127479365@Neg3%%0^^-
chr7#@!!~//=^%$74795&&*&127480532@Pos5%%0^^+
chr1#@!!~//=^%$12748&&*&127481699@Neg4%%0^^-


The cleaned data must be like this



chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -


please help very big problem here!










share|improve this question









New contributor




marco is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











I have this genes database which completely messed up by extra non-alphanumeric values. This happened as a sort of encryption to the data which was made incorrectly, and I don't know how to clean it up. I tried sed and awk, but failed. This is sample of the data which is very large amount of documents



chr2#@!!~//=^%$74711&&*&127472363@Pos1%%0^^+
chr3#@!!~//=^%$74723&&*&127473530@Pos2%%0^^+
chr1#@!!~//=^%$73530&&*&127474697@Pos3%%0^^+
chr2#@!!~//=^%$17469&&*&127475864@Pos4%%0^^+
chr3#@!!~//=^%$12747&&*&127477031@Neg1%%0^^-
chr5#@!!~//=^%$17477&&*&127478198@Neg2%%0^^-
chr7#@!!~//=^%$74781&&*&127479365@Neg3%%0^^-
chr7#@!!~//=^%$74795&&*&127480532@Pos5%%0^^+
chr1#@!!~//=^%$12748&&*&127481699@Neg4%%0^^-


The cleaned data must be like this



chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -


please help very big problem here!







text-processing awk sed bioinformatics






share|improve this question









New contributor




marco is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question









New contributor




marco is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question








edited 14 mins ago









Jeff Schaller

33.6k851113




33.6k851113






New contributor




marco is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked 51 mins ago









marco

1024




1024




New contributor




marco is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





marco is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






marco is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











  • try sed -E 's/[^a-zA-Z0-9+-]+/ /g' file
    – mosvy
    46 mins ago
















  • try sed -E 's/[^a-zA-Z0-9+-]+/ /g' file
    – mosvy
    46 mins ago















try sed -E 's/[^a-zA-Z0-9+-]+/ /g' file
– mosvy
46 mins ago




try sed -E 's/[^a-zA-Z0-9+-]+/ /g' file
– mosvy
46 mins ago










2 Answers
2






active

oldest

votes

















up vote
3
down vote



accepted










You can do it with sed. Something as follows:



sed 's/[^a-zA-Z0-9+-]/ /g' file | column -tc2

chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -





share|improve this answer
















  • 1




    THIS IS AMAZING!!!!!!!!!!!!! thank you @goro i didn't know how to keep - and +
    – marco
    35 mins ago

















up vote
2
down vote













With tr, transliterating characters from the complement of the wanted set to spaces, and squeezing repeats:



$ tr -sc '[:alnum:][:space:]+-' ' ' < data
chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -





share|improve this answer






















    Your Answer







    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "106"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );






    marco is a new contributor. Be nice, and check out our Code of Conduct.









     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f473709%2fcleaning-genes-database-polluted-by-non-numeric-characters-except-plus-and-minus%23new-answer', 'question_page');

    );

    Post as a guest






























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    3
    down vote



    accepted










    You can do it with sed. Something as follows:



    sed 's/[^a-zA-Z0-9+-]/ /g' file | column -tc2

    chr2 74711 127472363 Pos1 0 +
    chr3 74723 127473530 Pos2 0 +
    chr1 73530 127474697 Pos3 0 +
    chr2 17469 127475864 Pos4 0 +
    chr3 12747 127477031 Neg1 0 -
    chr5 17477 127478198 Neg2 0 -
    chr7 74781 127479365 Neg3 0 -
    chr7 74795 127480532 Pos5 0 +
    chr1 12748 127481699 Neg4 0 -





    share|improve this answer
















    • 1




      THIS IS AMAZING!!!!!!!!!!!!! thank you @goro i didn't know how to keep - and +
      – marco
      35 mins ago














    up vote
    3
    down vote



    accepted










    You can do it with sed. Something as follows:



    sed 's/[^a-zA-Z0-9+-]/ /g' file | column -tc2

    chr2 74711 127472363 Pos1 0 +
    chr3 74723 127473530 Pos2 0 +
    chr1 73530 127474697 Pos3 0 +
    chr2 17469 127475864 Pos4 0 +
    chr3 12747 127477031 Neg1 0 -
    chr5 17477 127478198 Neg2 0 -
    chr7 74781 127479365 Neg3 0 -
    chr7 74795 127480532 Pos5 0 +
    chr1 12748 127481699 Neg4 0 -





    share|improve this answer
















    • 1




      THIS IS AMAZING!!!!!!!!!!!!! thank you @goro i didn't know how to keep - and +
      – marco
      35 mins ago












    up vote
    3
    down vote



    accepted







    up vote
    3
    down vote



    accepted






    You can do it with sed. Something as follows:



    sed 's/[^a-zA-Z0-9+-]/ /g' file | column -tc2

    chr2 74711 127472363 Pos1 0 +
    chr3 74723 127473530 Pos2 0 +
    chr1 73530 127474697 Pos3 0 +
    chr2 17469 127475864 Pos4 0 +
    chr3 12747 127477031 Neg1 0 -
    chr5 17477 127478198 Neg2 0 -
    chr7 74781 127479365 Neg3 0 -
    chr7 74795 127480532 Pos5 0 +
    chr1 12748 127481699 Neg4 0 -





    share|improve this answer












    You can do it with sed. Something as follows:



    sed 's/[^a-zA-Z0-9+-]/ /g' file | column -tc2

    chr2 74711 127472363 Pos1 0 +
    chr3 74723 127473530 Pos2 0 +
    chr1 73530 127474697 Pos3 0 +
    chr2 17469 127475864 Pos4 0 +
    chr3 12747 127477031 Neg1 0 -
    chr5 17477 127478198 Neg2 0 -
    chr7 74781 127479365 Neg3 0 -
    chr7 74795 127480532 Pos5 0 +
    chr1 12748 127481699 Neg4 0 -






    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered 47 mins ago









    Goro

    6,94352965




    6,94352965







    • 1




      THIS IS AMAZING!!!!!!!!!!!!! thank you @goro i didn't know how to keep - and +
      – marco
      35 mins ago












    • 1




      THIS IS AMAZING!!!!!!!!!!!!! thank you @goro i didn't know how to keep - and +
      – marco
      35 mins ago







    1




    1




    THIS IS AMAZING!!!!!!!!!!!!! thank you @goro i didn't know how to keep - and +
    – marco
    35 mins ago




    THIS IS AMAZING!!!!!!!!!!!!! thank you @goro i didn't know how to keep - and +
    – marco
    35 mins ago












    up vote
    2
    down vote













    With tr, transliterating characters from the complement of the wanted set to spaces, and squeezing repeats:



    $ tr -sc '[:alnum:][:space:]+-' ' ' < data
    chr2 74711 127472363 Pos1 0 +
    chr3 74723 127473530 Pos2 0 +
    chr1 73530 127474697 Pos3 0 +
    chr2 17469 127475864 Pos4 0 +
    chr3 12747 127477031 Neg1 0 -
    chr5 17477 127478198 Neg2 0 -
    chr7 74781 127479365 Neg3 0 -
    chr7 74795 127480532 Pos5 0 +
    chr1 12748 127481699 Neg4 0 -





    share|improve this answer


























      up vote
      2
      down vote













      With tr, transliterating characters from the complement of the wanted set to spaces, and squeezing repeats:



      $ tr -sc '[:alnum:][:space:]+-' ' ' < data
      chr2 74711 127472363 Pos1 0 +
      chr3 74723 127473530 Pos2 0 +
      chr1 73530 127474697 Pos3 0 +
      chr2 17469 127475864 Pos4 0 +
      chr3 12747 127477031 Neg1 0 -
      chr5 17477 127478198 Neg2 0 -
      chr7 74781 127479365 Neg3 0 -
      chr7 74795 127480532 Pos5 0 +
      chr1 12748 127481699 Neg4 0 -





      share|improve this answer
























        up vote
        2
        down vote










        up vote
        2
        down vote









        With tr, transliterating characters from the complement of the wanted set to spaces, and squeezing repeats:



        $ tr -sc '[:alnum:][:space:]+-' ' ' < data
        chr2 74711 127472363 Pos1 0 +
        chr3 74723 127473530 Pos2 0 +
        chr1 73530 127474697 Pos3 0 +
        chr2 17469 127475864 Pos4 0 +
        chr3 12747 127477031 Neg1 0 -
        chr5 17477 127478198 Neg2 0 -
        chr7 74781 127479365 Neg3 0 -
        chr7 74795 127480532 Pos5 0 +
        chr1 12748 127481699 Neg4 0 -





        share|improve this answer














        With tr, transliterating characters from the complement of the wanted set to spaces, and squeezing repeats:



        $ tr -sc '[:alnum:][:space:]+-' ' ' < data
        chr2 74711 127472363 Pos1 0 +
        chr3 74723 127473530 Pos2 0 +
        chr1 73530 127474697 Pos3 0 +
        chr2 17469 127475864 Pos4 0 +
        chr3 12747 127477031 Neg1 0 -
        chr5 17477 127478198 Neg2 0 -
        chr7 74781 127479365 Neg3 0 -
        chr7 74795 127480532 Pos5 0 +
        chr1 12748 127481699 Neg4 0 -






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited 16 mins ago

























        answered 26 mins ago









        steeldriver

        32.5k34980




        32.5k34980




















            marco is a new contributor. Be nice, and check out our Code of Conduct.









             

            draft saved


            draft discarded


















            marco is a new contributor. Be nice, and check out our Code of Conduct.












            marco is a new contributor. Be nice, and check out our Code of Conduct.











            marco is a new contributor. Be nice, and check out our Code of Conduct.













             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f473709%2fcleaning-genes-database-polluted-by-non-numeric-characters-except-plus-and-minus%23new-answer', 'question_page');

            );

            Post as a guest













































































            Comments

            Popular posts from this blog

            Is the Concept of Multiple Fantasy Races Scientifically Flawed? [closed]

            Long meetings (6-7 hours a day): Being “babysat” by supervisor

            Confectionery