Cleaning a genes database polluted by non-numeric characters except plus and minus signs

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
3
down vote

favorite












I have this genes database which is completely messed up by extra non-alphanumeric values. This happened as a sort of encryption to the data which was made incorrectly, and I don't know how to clean it up. I tried sed and awk, but failed. This is sample of the data which is a very large amount of documents:



chr2#@!!~//=^%$74711&&*&127472363@Pos1%%0^^+
chr3#@!!~//=^%$74723&&*&127473530@Pos2%%0^^+
chr1#@!!~//=^%$73530&&*&127474697@Pos3%%0^^+
chr2#@!!~//=^%$17469&&*&127475864@Pos4%%0^^+
chr3#@!!~//=^%$12747&&*&127477031@Neg1%%0^^-
chr5#@!!~//=^%$17477&&*&127478198@Neg2%%0^^-
chr7#@!!~//=^%$74781&&*&127479365@Neg3%%0^^-
chr7#@!!~//=^%$74795&&*&127480532@Pos5%%0^^+
chr1#@!!~//=^%$12748&&*&127481699@Neg4%%0^^-


The cleaned data must be like this:



chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -


How can I do this?










share|improve this question









New contributor




marco is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.



















  • try sed -E 's/[^a-zA-Z0-9+-]+/ /g' file
    – mosvy
    20 hours ago






  • 1




    What have your tried? I don't see any code showing effort to solve the problem.
    – Pedro Lobito
    12 hours ago














up vote
3
down vote

favorite












I have this genes database which is completely messed up by extra non-alphanumeric values. This happened as a sort of encryption to the data which was made incorrectly, and I don't know how to clean it up. I tried sed and awk, but failed. This is sample of the data which is a very large amount of documents:



chr2#@!!~//=^%$74711&&*&127472363@Pos1%%0^^+
chr3#@!!~//=^%$74723&&*&127473530@Pos2%%0^^+
chr1#@!!~//=^%$73530&&*&127474697@Pos3%%0^^+
chr2#@!!~//=^%$17469&&*&127475864@Pos4%%0^^+
chr3#@!!~//=^%$12747&&*&127477031@Neg1%%0^^-
chr5#@!!~//=^%$17477&&*&127478198@Neg2%%0^^-
chr7#@!!~//=^%$74781&&*&127479365@Neg3%%0^^-
chr7#@!!~//=^%$74795&&*&127480532@Pos5%%0^^+
chr1#@!!~//=^%$12748&&*&127481699@Neg4%%0^^-


The cleaned data must be like this:



chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -


How can I do this?










share|improve this question









New contributor




marco is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.



















  • try sed -E 's/[^a-zA-Z0-9+-]+/ /g' file
    – mosvy
    20 hours ago






  • 1




    What have your tried? I don't see any code showing effort to solve the problem.
    – Pedro Lobito
    12 hours ago












up vote
3
down vote

favorite









up vote
3
down vote

favorite











I have this genes database which is completely messed up by extra non-alphanumeric values. This happened as a sort of encryption to the data which was made incorrectly, and I don't know how to clean it up. I tried sed and awk, but failed. This is sample of the data which is a very large amount of documents:



chr2#@!!~//=^%$74711&&*&127472363@Pos1%%0^^+
chr3#@!!~//=^%$74723&&*&127473530@Pos2%%0^^+
chr1#@!!~//=^%$73530&&*&127474697@Pos3%%0^^+
chr2#@!!~//=^%$17469&&*&127475864@Pos4%%0^^+
chr3#@!!~//=^%$12747&&*&127477031@Neg1%%0^^-
chr5#@!!~//=^%$17477&&*&127478198@Neg2%%0^^-
chr7#@!!~//=^%$74781&&*&127479365@Neg3%%0^^-
chr7#@!!~//=^%$74795&&*&127480532@Pos5%%0^^+
chr1#@!!~//=^%$12748&&*&127481699@Neg4%%0^^-


The cleaned data must be like this:



chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -


How can I do this?










share|improve this question









New contributor




marco is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











I have this genes database which is completely messed up by extra non-alphanumeric values. This happened as a sort of encryption to the data which was made incorrectly, and I don't know how to clean it up. I tried sed and awk, but failed. This is sample of the data which is a very large amount of documents:



chr2#@!!~//=^%$74711&&*&127472363@Pos1%%0^^+
chr3#@!!~//=^%$74723&&*&127473530@Pos2%%0^^+
chr1#@!!~//=^%$73530&&*&127474697@Pos3%%0^^+
chr2#@!!~//=^%$17469&&*&127475864@Pos4%%0^^+
chr3#@!!~//=^%$12747&&*&127477031@Neg1%%0^^-
chr5#@!!~//=^%$17477&&*&127478198@Neg2%%0^^-
chr7#@!!~//=^%$74781&&*&127479365@Neg3%%0^^-
chr7#@!!~//=^%$74795&&*&127480532@Pos5%%0^^+
chr1#@!!~//=^%$12748&&*&127481699@Neg4%%0^^-


The cleaned data must be like this:



chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -


How can I do this?







text-processing awk sed bioinformatics






share|improve this question









New contributor




marco is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question









New contributor




marco is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question








edited 20 mins ago









Peter Mortensen

82358




82358






New contributor




marco is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked 21 hours ago









marco

1154




1154




New contributor




marco is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





marco is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






marco is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











  • try sed -E 's/[^a-zA-Z0-9+-]+/ /g' file
    – mosvy
    20 hours ago






  • 1




    What have your tried? I don't see any code showing effort to solve the problem.
    – Pedro Lobito
    12 hours ago
















  • try sed -E 's/[^a-zA-Z0-9+-]+/ /g' file
    – mosvy
    20 hours ago






  • 1




    What have your tried? I don't see any code showing effort to solve the problem.
    – Pedro Lobito
    12 hours ago















try sed -E 's/[^a-zA-Z0-9+-]+/ /g' file
– mosvy
20 hours ago




try sed -E 's/[^a-zA-Z0-9+-]+/ /g' file
– mosvy
20 hours ago




1




1




What have your tried? I don't see any code showing effort to solve the problem.
– Pedro Lobito
12 hours ago




What have your tried? I don't see any code showing effort to solve the problem.
– Pedro Lobito
12 hours ago










3 Answers
3






active

oldest

votes

















up vote
12
down vote



accepted










You can do it with sed. Something as follows:



sed 's/[^a-zA-Z0-9+-]/ /g' file | column -tc2

chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -





share|improve this answer
















  • 1




    THIS IS AMAZING!!!!!!!!!!!!! thank you @goro i didn't know how to keep - and +
    – marco
    20 hours ago






  • 1




    Proving answers to users who didn't show any effort to solve the problem isn't encouraged.
    – Pedro Lobito
    12 hours ago






  • 2




    @PedroLobito Says who? This isn't a teaching website, it's a website dedicated to collect answers to problems.
    – pipe
    10 hours ago










  • Says the SO community rules.
    – Pedro Lobito
    2 hours ago

















up vote
8
down vote













With tr, transliterating characters from the complement of the wanted set to spaces, and squeezing repeats:



$ tr -sc '[:alnum:][:space:]+-' ' ' < data
chr2 74711 127472363 Pos1 0 +
chr3 74723 127473530 Pos2 0 +
chr1 73530 127474697 Pos3 0 +
chr2 17469 127475864 Pos4 0 +
chr3 12747 127477031 Neg1 0 -
chr5 17477 127478198 Neg2 0 -
chr7 74781 127479365 Neg3 0 -
chr7 74795 127480532 Pos5 0 +
chr1 12748 127481699 Neg4 0 -





share|improve this answer





























    up vote
    0
    down vote













    An awk solution



    awk -F '[^[:alnum:]+-]+' '$1=$1;print' file





    share|improve this answer




















      Your Answer







      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "106"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      convertImagesToLinks: false,
      noModals: false,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );






      marco is a new contributor. Be nice, and check out our Code of Conduct.









       

      draft saved


      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f473709%2fcleaning-a-genes-database-polluted-by-non-numeric-characters-except-plus-and-min%23new-answer', 'question_page');

      );

      Post as a guest






























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      12
      down vote



      accepted










      You can do it with sed. Something as follows:



      sed 's/[^a-zA-Z0-9+-]/ /g' file | column -tc2

      chr2 74711 127472363 Pos1 0 +
      chr3 74723 127473530 Pos2 0 +
      chr1 73530 127474697 Pos3 0 +
      chr2 17469 127475864 Pos4 0 +
      chr3 12747 127477031 Neg1 0 -
      chr5 17477 127478198 Neg2 0 -
      chr7 74781 127479365 Neg3 0 -
      chr7 74795 127480532 Pos5 0 +
      chr1 12748 127481699 Neg4 0 -





      share|improve this answer
















      • 1




        THIS IS AMAZING!!!!!!!!!!!!! thank you @goro i didn't know how to keep - and +
        – marco
        20 hours ago






      • 1




        Proving answers to users who didn't show any effort to solve the problem isn't encouraged.
        – Pedro Lobito
        12 hours ago






      • 2




        @PedroLobito Says who? This isn't a teaching website, it's a website dedicated to collect answers to problems.
        – pipe
        10 hours ago










      • Says the SO community rules.
        – Pedro Lobito
        2 hours ago














      up vote
      12
      down vote



      accepted










      You can do it with sed. Something as follows:



      sed 's/[^a-zA-Z0-9+-]/ /g' file | column -tc2

      chr2 74711 127472363 Pos1 0 +
      chr3 74723 127473530 Pos2 0 +
      chr1 73530 127474697 Pos3 0 +
      chr2 17469 127475864 Pos4 0 +
      chr3 12747 127477031 Neg1 0 -
      chr5 17477 127478198 Neg2 0 -
      chr7 74781 127479365 Neg3 0 -
      chr7 74795 127480532 Pos5 0 +
      chr1 12748 127481699 Neg4 0 -





      share|improve this answer
















      • 1




        THIS IS AMAZING!!!!!!!!!!!!! thank you @goro i didn't know how to keep - and +
        – marco
        20 hours ago






      • 1




        Proving answers to users who didn't show any effort to solve the problem isn't encouraged.
        – Pedro Lobito
        12 hours ago






      • 2




        @PedroLobito Says who? This isn't a teaching website, it's a website dedicated to collect answers to problems.
        – pipe
        10 hours ago










      • Says the SO community rules.
        – Pedro Lobito
        2 hours ago












      up vote
      12
      down vote



      accepted







      up vote
      12
      down vote



      accepted






      You can do it with sed. Something as follows:



      sed 's/[^a-zA-Z0-9+-]/ /g' file | column -tc2

      chr2 74711 127472363 Pos1 0 +
      chr3 74723 127473530 Pos2 0 +
      chr1 73530 127474697 Pos3 0 +
      chr2 17469 127475864 Pos4 0 +
      chr3 12747 127477031 Neg1 0 -
      chr5 17477 127478198 Neg2 0 -
      chr7 74781 127479365 Neg3 0 -
      chr7 74795 127480532 Pos5 0 +
      chr1 12748 127481699 Neg4 0 -





      share|improve this answer












      You can do it with sed. Something as follows:



      sed 's/[^a-zA-Z0-9+-]/ /g' file | column -tc2

      chr2 74711 127472363 Pos1 0 +
      chr3 74723 127473530 Pos2 0 +
      chr1 73530 127474697 Pos3 0 +
      chr2 17469 127475864 Pos4 0 +
      chr3 12747 127477031 Neg1 0 -
      chr5 17477 127478198 Neg2 0 -
      chr7 74781 127479365 Neg3 0 -
      chr7 74795 127480532 Pos5 0 +
      chr1 12748 127481699 Neg4 0 -






      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered 21 hours ago









      Goro

      7,19253168




      7,19253168







      • 1




        THIS IS AMAZING!!!!!!!!!!!!! thank you @goro i didn't know how to keep - and +
        – marco
        20 hours ago






      • 1




        Proving answers to users who didn't show any effort to solve the problem isn't encouraged.
        – Pedro Lobito
        12 hours ago






      • 2




        @PedroLobito Says who? This isn't a teaching website, it's a website dedicated to collect answers to problems.
        – pipe
        10 hours ago










      • Says the SO community rules.
        – Pedro Lobito
        2 hours ago












      • 1




        THIS IS AMAZING!!!!!!!!!!!!! thank you @goro i didn't know how to keep - and +
        – marco
        20 hours ago






      • 1




        Proving answers to users who didn't show any effort to solve the problem isn't encouraged.
        – Pedro Lobito
        12 hours ago






      • 2




        @PedroLobito Says who? This isn't a teaching website, it's a website dedicated to collect answers to problems.
        – pipe
        10 hours ago










      • Says the SO community rules.
        – Pedro Lobito
        2 hours ago







      1




      1




      THIS IS AMAZING!!!!!!!!!!!!! thank you @goro i didn't know how to keep - and +
      – marco
      20 hours ago




      THIS IS AMAZING!!!!!!!!!!!!! thank you @goro i didn't know how to keep - and +
      – marco
      20 hours ago




      1




      1




      Proving answers to users who didn't show any effort to solve the problem isn't encouraged.
      – Pedro Lobito
      12 hours ago




      Proving answers to users who didn't show any effort to solve the problem isn't encouraged.
      – Pedro Lobito
      12 hours ago




      2




      2




      @PedroLobito Says who? This isn't a teaching website, it's a website dedicated to collect answers to problems.
      – pipe
      10 hours ago




      @PedroLobito Says who? This isn't a teaching website, it's a website dedicated to collect answers to problems.
      – pipe
      10 hours ago












      Says the SO community rules.
      – Pedro Lobito
      2 hours ago




      Says the SO community rules.
      – Pedro Lobito
      2 hours ago












      up vote
      8
      down vote













      With tr, transliterating characters from the complement of the wanted set to spaces, and squeezing repeats:



      $ tr -sc '[:alnum:][:space:]+-' ' ' < data
      chr2 74711 127472363 Pos1 0 +
      chr3 74723 127473530 Pos2 0 +
      chr1 73530 127474697 Pos3 0 +
      chr2 17469 127475864 Pos4 0 +
      chr3 12747 127477031 Neg1 0 -
      chr5 17477 127478198 Neg2 0 -
      chr7 74781 127479365 Neg3 0 -
      chr7 74795 127480532 Pos5 0 +
      chr1 12748 127481699 Neg4 0 -





      share|improve this answer


























        up vote
        8
        down vote













        With tr, transliterating characters from the complement of the wanted set to spaces, and squeezing repeats:



        $ tr -sc '[:alnum:][:space:]+-' ' ' < data
        chr2 74711 127472363 Pos1 0 +
        chr3 74723 127473530 Pos2 0 +
        chr1 73530 127474697 Pos3 0 +
        chr2 17469 127475864 Pos4 0 +
        chr3 12747 127477031 Neg1 0 -
        chr5 17477 127478198 Neg2 0 -
        chr7 74781 127479365 Neg3 0 -
        chr7 74795 127480532 Pos5 0 +
        chr1 12748 127481699 Neg4 0 -





        share|improve this answer
























          up vote
          8
          down vote










          up vote
          8
          down vote









          With tr, transliterating characters from the complement of the wanted set to spaces, and squeezing repeats:



          $ tr -sc '[:alnum:][:space:]+-' ' ' < data
          chr2 74711 127472363 Pos1 0 +
          chr3 74723 127473530 Pos2 0 +
          chr1 73530 127474697 Pos3 0 +
          chr2 17469 127475864 Pos4 0 +
          chr3 12747 127477031 Neg1 0 -
          chr5 17477 127478198 Neg2 0 -
          chr7 74781 127479365 Neg3 0 -
          chr7 74795 127480532 Pos5 0 +
          chr1 12748 127481699 Neg4 0 -





          share|improve this answer














          With tr, transliterating characters from the complement of the wanted set to spaces, and squeezing repeats:



          $ tr -sc '[:alnum:][:space:]+-' ' ' < data
          chr2 74711 127472363 Pos1 0 +
          chr3 74723 127473530 Pos2 0 +
          chr1 73530 127474697 Pos3 0 +
          chr2 17469 127475864 Pos4 0 +
          chr3 12747 127477031 Neg1 0 -
          chr5 17477 127478198 Neg2 0 -
          chr7 74781 127479365 Neg3 0 -
          chr7 74795 127480532 Pos5 0 +
          chr1 12748 127481699 Neg4 0 -






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited 20 hours ago

























          answered 20 hours ago









          steeldriver

          32.6k34980




          32.6k34980




















              up vote
              0
              down vote













              An awk solution



              awk -F '[^[:alnum:]+-]+' '$1=$1;print' file





              share|improve this answer
























                up vote
                0
                down vote













                An awk solution



                awk -F '[^[:alnum:]+-]+' '$1=$1;print' file





                share|improve this answer






















                  up vote
                  0
                  down vote










                  up vote
                  0
                  down vote









                  An awk solution



                  awk -F '[^[:alnum:]+-]+' '$1=$1;print' file





                  share|improve this answer












                  An awk solution



                  awk -F '[^[:alnum:]+-]+' '$1=$1;print' file






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered 6 hours ago









                  iruvar

                  11.6k62959




                  11.6k62959




















                      marco is a new contributor. Be nice, and check out our Code of Conduct.









                       

                      draft saved


                      draft discarded


















                      marco is a new contributor. Be nice, and check out our Code of Conduct.












                      marco is a new contributor. Be nice, and check out our Code of Conduct.











                      marco is a new contributor. Be nice, and check out our Code of Conduct.













                       


                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f473709%2fcleaning-a-genes-database-polluted-by-non-numeric-characters-except-plus-and-min%23new-answer', 'question_page');

                      );

                      Post as a guest













































































                      Comments

                      Popular posts from this blog

                      Long meetings (6-7 hours a day): Being “babysat” by supervisor

                      Is the Concept of Multiple Fantasy Races Scientifically Flawed? [closed]

                      Confectionery