Find how many times a certain DNA base sequence occurs in a file

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
1
down vote

favorite












The assignment is to write a bash script named “countmatches” that will display the number of times a certain sequence, such as aac, appears in a specified file. The script should expect at least two arguments in which the first argument has to be the pathname of a file containing a valid DNA string which we are given. The remaining argument(s) are strings containing only the bases a, c, g, and t in any order. 
For each valid argument string, it will search the DNA string in the file and count how many non-overlapping occurrences of that argument string are in the DNA string (i.e., the file).



An example sequence and output would be if the string aaccgtttgtaaccggaac is in a file named dnafile, then the script should work as follows



$ countmatches dnafile ttt
ttt 1


with the command being countmatches dnafile ttt and the output being ttt 1, showing that ttt appears once.



This is my script:



#!/bin/bash
for /data/biocs/b/student.accounts/cs132/data/dna_textfiles
do
count=$grep -o '[acgt][acgt][acgt]' /data/biocs/b/student.accounts/cs132/data/dna_textfiles | wc -w
echo $/data/biocs/b/student.accounts/cs132/data/dna_textfiles $count
done


and this is the error I get



[Osama.Chaudry07@cslab5 assignment3]$ ./countmatches /data/biocs/b/student.accounts/cs132/data/dna_textfiles aac
./countmatches: line 6: '/data/biocs/b/student.accounts/cs132/data/dna_textfiles': not a valid identifier









share|improve this question









New contributor




Chaudry Osama is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.















  • 2




    We're not a script-writing service, but people where will be happy to help you when you hit specific issues with a script you've written.
    – roaima
    4 hours ago










  • Oh. That screenshot is a script is it? Please don't post pictures of text. They're harder to read, impossible for people who need screenreaders, and not good for search engines.
    – roaima
    3 hours ago










  • dna_textfiles is a file with nothing but a sequence of letters a, c , g, and t. This is the file for which we have to write a script that will show you how many times a certain sequence such as aac comes up.
    – Chaudry Osama
    3 hours ago










  • @Goro yes that is what the goal is, to enter any sequence that is present in the large dna_textfiles and have the output be how many times that base appears. I wrote a script for it but it isn't achieving what I want it to and I don't understand where I went wrong.
    – Chaudry Osama
    2 hours ago










  • @Goro all of the repeats in a sequence. For example in the sequence aaccgtttgtaaccggaac, if I were to input the base aac, it would show that it comes up 3 times.
    – Chaudry Osama
    2 hours ago














up vote
1
down vote

favorite












The assignment is to write a bash script named “countmatches” that will display the number of times a certain sequence, such as aac, appears in a specified file. The script should expect at least two arguments in which the first argument has to be the pathname of a file containing a valid DNA string which we are given. The remaining argument(s) are strings containing only the bases a, c, g, and t in any order. 
For each valid argument string, it will search the DNA string in the file and count how many non-overlapping occurrences of that argument string are in the DNA string (i.e., the file).



An example sequence and output would be if the string aaccgtttgtaaccggaac is in a file named dnafile, then the script should work as follows



$ countmatches dnafile ttt
ttt 1


with the command being countmatches dnafile ttt and the output being ttt 1, showing that ttt appears once.



This is my script:



#!/bin/bash
for /data/biocs/b/student.accounts/cs132/data/dna_textfiles
do
count=$grep -o '[acgt][acgt][acgt]' /data/biocs/b/student.accounts/cs132/data/dna_textfiles | wc -w
echo $/data/biocs/b/student.accounts/cs132/data/dna_textfiles $count
done


and this is the error I get



[Osama.Chaudry07@cslab5 assignment3]$ ./countmatches /data/biocs/b/student.accounts/cs132/data/dna_textfiles aac
./countmatches: line 6: '/data/biocs/b/student.accounts/cs132/data/dna_textfiles': not a valid identifier









share|improve this question









New contributor




Chaudry Osama is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.















  • 2




    We're not a script-writing service, but people where will be happy to help you when you hit specific issues with a script you've written.
    – roaima
    4 hours ago










  • Oh. That screenshot is a script is it? Please don't post pictures of text. They're harder to read, impossible for people who need screenreaders, and not good for search engines.
    – roaima
    3 hours ago










  • dna_textfiles is a file with nothing but a sequence of letters a, c , g, and t. This is the file for which we have to write a script that will show you how many times a certain sequence such as aac comes up.
    – Chaudry Osama
    3 hours ago










  • @Goro yes that is what the goal is, to enter any sequence that is present in the large dna_textfiles and have the output be how many times that base appears. I wrote a script for it but it isn't achieving what I want it to and I don't understand where I went wrong.
    – Chaudry Osama
    2 hours ago










  • @Goro all of the repeats in a sequence. For example in the sequence aaccgtttgtaaccggaac, if I were to input the base aac, it would show that it comes up 3 times.
    – Chaudry Osama
    2 hours ago












up vote
1
down vote

favorite









up vote
1
down vote

favorite











The assignment is to write a bash script named “countmatches” that will display the number of times a certain sequence, such as aac, appears in a specified file. The script should expect at least two arguments in which the first argument has to be the pathname of a file containing a valid DNA string which we are given. The remaining argument(s) are strings containing only the bases a, c, g, and t in any order. 
For each valid argument string, it will search the DNA string in the file and count how many non-overlapping occurrences of that argument string are in the DNA string (i.e., the file).



An example sequence and output would be if the string aaccgtttgtaaccggaac is in a file named dnafile, then the script should work as follows



$ countmatches dnafile ttt
ttt 1


with the command being countmatches dnafile ttt and the output being ttt 1, showing that ttt appears once.



This is my script:



#!/bin/bash
for /data/biocs/b/student.accounts/cs132/data/dna_textfiles
do
count=$grep -o '[acgt][acgt][acgt]' /data/biocs/b/student.accounts/cs132/data/dna_textfiles | wc -w
echo $/data/biocs/b/student.accounts/cs132/data/dna_textfiles $count
done


and this is the error I get



[Osama.Chaudry07@cslab5 assignment3]$ ./countmatches /data/biocs/b/student.accounts/cs132/data/dna_textfiles aac
./countmatches: line 6: '/data/biocs/b/student.accounts/cs132/data/dna_textfiles': not a valid identifier









share|improve this question









New contributor




Chaudry Osama is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











The assignment is to write a bash script named “countmatches” that will display the number of times a certain sequence, such as aac, appears in a specified file. The script should expect at least two arguments in which the first argument has to be the pathname of a file containing a valid DNA string which we are given. The remaining argument(s) are strings containing only the bases a, c, g, and t in any order. 
For each valid argument string, it will search the DNA string in the file and count how many non-overlapping occurrences of that argument string are in the DNA string (i.e., the file).



An example sequence and output would be if the string aaccgtttgtaaccggaac is in a file named dnafile, then the script should work as follows



$ countmatches dnafile ttt
ttt 1


with the command being countmatches dnafile ttt and the output being ttt 1, showing that ttt appears once.



This is my script:



#!/bin/bash
for /data/biocs/b/student.accounts/cs132/data/dna_textfiles
do
count=$grep -o '[acgt][acgt][acgt]' /data/biocs/b/student.accounts/cs132/data/dna_textfiles | wc -w
echo $/data/biocs/b/student.accounts/cs132/data/dna_textfiles $count
done


and this is the error I get



[Osama.Chaudry07@cslab5 assignment3]$ ./countmatches /data/biocs/b/student.accounts/cs132/data/dna_textfiles aac
./countmatches: line 6: '/data/biocs/b/student.accounts/cs132/data/dna_textfiles': not a valid identifier






text-processing scripting bioinformatics






share|improve this question









New contributor




Chaudry Osama is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question









New contributor




Chaudry Osama is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question








edited 37 mins ago









G-Man

12k92759




12k92759






New contributor




Chaudry Osama is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked 4 hours ago









Chaudry Osama

203




203




New contributor




Chaudry Osama is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





Chaudry Osama is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






Chaudry Osama is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







  • 2




    We're not a script-writing service, but people where will be happy to help you when you hit specific issues with a script you've written.
    – roaima
    4 hours ago










  • Oh. That screenshot is a script is it? Please don't post pictures of text. They're harder to read, impossible for people who need screenreaders, and not good for search engines.
    – roaima
    3 hours ago










  • dna_textfiles is a file with nothing but a sequence of letters a, c , g, and t. This is the file for which we have to write a script that will show you how many times a certain sequence such as aac comes up.
    – Chaudry Osama
    3 hours ago










  • @Goro yes that is what the goal is, to enter any sequence that is present in the large dna_textfiles and have the output be how many times that base appears. I wrote a script for it but it isn't achieving what I want it to and I don't understand where I went wrong.
    – Chaudry Osama
    2 hours ago










  • @Goro all of the repeats in a sequence. For example in the sequence aaccgtttgtaaccggaac, if I were to input the base aac, it would show that it comes up 3 times.
    – Chaudry Osama
    2 hours ago












  • 2




    We're not a script-writing service, but people where will be happy to help you when you hit specific issues with a script you've written.
    – roaima
    4 hours ago










  • Oh. That screenshot is a script is it? Please don't post pictures of text. They're harder to read, impossible for people who need screenreaders, and not good for search engines.
    – roaima
    3 hours ago










  • dna_textfiles is a file with nothing but a sequence of letters a, c , g, and t. This is the file for which we have to write a script that will show you how many times a certain sequence such as aac comes up.
    – Chaudry Osama
    3 hours ago










  • @Goro yes that is what the goal is, to enter any sequence that is present in the large dna_textfiles and have the output be how many times that base appears. I wrote a script for it but it isn't achieving what I want it to and I don't understand where I went wrong.
    – Chaudry Osama
    2 hours ago










  • @Goro all of the repeats in a sequence. For example in the sequence aaccgtttgtaaccggaac, if I were to input the base aac, it would show that it comes up 3 times.
    – Chaudry Osama
    2 hours ago







2




2




We're not a script-writing service, but people where will be happy to help you when you hit specific issues with a script you've written.
– roaima
4 hours ago




We're not a script-writing service, but people where will be happy to help you when you hit specific issues with a script you've written.
– roaima
4 hours ago












Oh. That screenshot is a script is it? Please don't post pictures of text. They're harder to read, impossible for people who need screenreaders, and not good for search engines.
– roaima
3 hours ago




Oh. That screenshot is a script is it? Please don't post pictures of text. They're harder to read, impossible for people who need screenreaders, and not good for search engines.
– roaima
3 hours ago












dna_textfiles is a file with nothing but a sequence of letters a, c , g, and t. This is the file for which we have to write a script that will show you how many times a certain sequence such as aac comes up.
– Chaudry Osama
3 hours ago




dna_textfiles is a file with nothing but a sequence of letters a, c , g, and t. This is the file for which we have to write a script that will show you how many times a certain sequence such as aac comes up.
– Chaudry Osama
3 hours ago












@Goro yes that is what the goal is, to enter any sequence that is present in the large dna_textfiles and have the output be how many times that base appears. I wrote a script for it but it isn't achieving what I want it to and I don't understand where I went wrong.
– Chaudry Osama
2 hours ago




@Goro yes that is what the goal is, to enter any sequence that is present in the large dna_textfiles and have the output be how many times that base appears. I wrote a script for it but it isn't achieving what I want it to and I don't understand where I went wrong.
– Chaudry Osama
2 hours ago












@Goro all of the repeats in a sequence. For example in the sequence aaccgtttgtaaccggaac, if I were to input the base aac, it would show that it comes up 3 times.
– Chaudry Osama
2 hours ago




@Goro all of the repeats in a sequence. For example in the sequence aaccgtttgtaaccggaac, if I were to input the base aac, it would show that it comes up 3 times.
– Chaudry Osama
2 hours ago










1 Answer
1






active

oldest

votes

















up vote
4
down vote













cat dna_textfile 
aaccgtttgtaaccggaac

#!/bin/bash
dna_file=/autofs/cluster/atassigp/garbage/dna_textfiles
printf "e[31mnucleotide sequence?:";
read -en 3 userInput
while [[ -z "$userInput" ]]
do
read -en 3 userInput
done

count=$(grep -o "$userInput" $dna_file | wc -l)

echo "$userInput", $count


output:



 ttt, 1



#!/bin/bash
#set first and second arguments (dnafile and base respectively)

dir=$1
base=$2

count=$(grep -o $base $dir | wc -l)

echo "$base", $count


output:



$ ./countmatches dnafile ttt
ttt, 1





share|improve this answer






















    Your Answer







    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "106"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );






    Chaudry Osama is a new contributor. Be nice, and check out our Code of Conduct.









     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f475426%2ffind-how-many-times-a-certain-dna-base-sequence-occurs-in-a-file%23new-answer', 'question_page');

    );

    Post as a guest






























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    4
    down vote













    cat dna_textfile 
    aaccgtttgtaaccggaac

    #!/bin/bash
    dna_file=/autofs/cluster/atassigp/garbage/dna_textfiles
    printf "e[31mnucleotide sequence?:";
    read -en 3 userInput
    while [[ -z "$userInput" ]]
    do
    read -en 3 userInput
    done

    count=$(grep -o "$userInput" $dna_file | wc -l)

    echo "$userInput", $count


    output:



     ttt, 1



    #!/bin/bash
    #set first and second arguments (dnafile and base respectively)

    dir=$1
    base=$2

    count=$(grep -o $base $dir | wc -l)

    echo "$base", $count


    output:



    $ ./countmatches dnafile ttt
    ttt, 1





    share|improve this answer


























      up vote
      4
      down vote













      cat dna_textfile 
      aaccgtttgtaaccggaac

      #!/bin/bash
      dna_file=/autofs/cluster/atassigp/garbage/dna_textfiles
      printf "e[31mnucleotide sequence?:";
      read -en 3 userInput
      while [[ -z "$userInput" ]]
      do
      read -en 3 userInput
      done

      count=$(grep -o "$userInput" $dna_file | wc -l)

      echo "$userInput", $count


      output:



       ttt, 1



      #!/bin/bash
      #set first and second arguments (dnafile and base respectively)

      dir=$1
      base=$2

      count=$(grep -o $base $dir | wc -l)

      echo "$base", $count


      output:



      $ ./countmatches dnafile ttt
      ttt, 1





      share|improve this answer
























        up vote
        4
        down vote










        up vote
        4
        down vote









        cat dna_textfile 
        aaccgtttgtaaccggaac

        #!/bin/bash
        dna_file=/autofs/cluster/atassigp/garbage/dna_textfiles
        printf "e[31mnucleotide sequence?:";
        read -en 3 userInput
        while [[ -z "$userInput" ]]
        do
        read -en 3 userInput
        done

        count=$(grep -o "$userInput" $dna_file | wc -l)

        echo "$userInput", $count


        output:



         ttt, 1



        #!/bin/bash
        #set first and second arguments (dnafile and base respectively)

        dir=$1
        base=$2

        count=$(grep -o $base $dir | wc -l)

        echo "$base", $count


        output:



        $ ./countmatches dnafile ttt
        ttt, 1





        share|improve this answer














        cat dna_textfile 
        aaccgtttgtaaccggaac

        #!/bin/bash
        dna_file=/autofs/cluster/atassigp/garbage/dna_textfiles
        printf "e[31mnucleotide sequence?:";
        read -en 3 userInput
        while [[ -z "$userInput" ]]
        do
        read -en 3 userInput
        done

        count=$(grep -o "$userInput" $dna_file | wc -l)

        echo "$userInput", $count


        output:



         ttt, 1



        #!/bin/bash
        #set first and second arguments (dnafile and base respectively)

        dir=$1
        base=$2

        count=$(grep -o $base $dir | wc -l)

        echo "$base", $count


        output:



        $ ./countmatches dnafile ttt
        ttt, 1






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited 30 mins ago

























        answered 2 hours ago









        Goro

        9,19464486




        9,19464486




















            Chaudry Osama is a new contributor. Be nice, and check out our Code of Conduct.









             

            draft saved


            draft discarded


















            Chaudry Osama is a new contributor. Be nice, and check out our Code of Conduct.












            Chaudry Osama is a new contributor. Be nice, and check out our Code of Conduct.











            Chaudry Osama is a new contributor. Be nice, and check out our Code of Conduct.













             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f475426%2ffind-how-many-times-a-certain-dna-base-sequence-occurs-in-a-file%23new-answer', 'question_page');

            );

            Post as a guest













































































            Comments

            Popular posts from this blog

            Long meetings (6-7 hours a day): Being “babysat” by supervisor

            Is the Concept of Multiple Fantasy Races Scientifically Flawed? [closed]

            Confectionery