Find how many times a certain DNA base sequence occurs in a file

up vote
1
down vote

favorite

The assignment is to write a bash script named Ã¢Â€ÂœcountmatchesÃ¢Â€Â that will display the number of times a certain sequence, such as aac, appears in a specified file. The script should expect at least two arguments in which the first argument has to be the pathname of a file containing a valid DNA string which we are given. The remaining argument(s) are strings containing only the bases a, c, g, and t in any order.Ã‚Â
For each valid argument string, it will search the DNA string in the file and count how many non-overlapping occurrences of that argument string are in the DNA string (i.e.,Ã‚Â theÃ‚Â file).

An example sequence and output would be if the string aaccgtttgtaaccggaac is in a file named dnafile, then the script should work as follows

$ countmatches dnafile ttt
ttt 1

with the command being countmatches dnafile ttt and the output being tttÃ‚Â 1, showing that ttt appears once.

This is my script:

#!/bin/bash
for /data/biocs/b/student.accounts/cs132/data/dna_textfiles
do
 count=$grep -o '[acgt][acgt][acgt]' /data/biocs/b/student.accounts/cs132/data/dna_textfiles | wc -w
 echo $/data/biocs/b/student.accounts/cs132/data/dna_textfiles $count
done

and this is the error I get

[Osama.Chaudry07@cslab5 assignment3]$ ./countmatches /data/biocs/b/student.accounts/cs132/data/dna_textfiles aac
./countmatches: line 6: '/data/biocs/b/student.accounts/cs132/data/dna_textfiles': not a valid identifier

edited 37 mins ago

G-Man

12k92759

asked 4 hours ago

Chaudry Osama

203

New contributor

2

We're not a script-writing service, but people where will be happy to help you when you hit specific issues with a script you've written.
â€“Â roaima
4 hours ago

Oh. That screenshot is a script is it? Please don't post pictures of text. They're harder to read, impossible for people who need screenreaders, and not good for search engines.
â€“Â roaima
3 hours ago

dna_textfiles is a file with nothing but a sequence of letters a, c , g, and t. This is the file for which we have to write a script that will show you how many times a certain sequence such as aac comes up.
â€“Â Chaudry Osama
3 hours ago

@Goro yes that is what the goal is, to enter any sequence that is present in the large dna_textfiles and have the output be how many times that base appears. I wrote a script for it but it isn't achieving what I want it to and I don't understand where I went wrong.
â€“Â Chaudry Osama
2 hours ago

@Goro all of the repeats in a sequence. For example in the sequence aaccgtttgtaaccggaac, if I were to input the base aac, it would show that it comes up 3 times.
â€“Â Chaudry Osama
2 hours ago

Â |Â
show 2 more comments

up vote
1
down vote

favorite

An example sequence and output would be if the string aaccgtttgtaaccggaac is in a file named dnafile, then the script should work as follows

$ countmatches dnafile ttt
ttt 1

with the command being countmatches dnafile ttt and the output being tttÃ‚Â 1, showing that ttt appears once.

This is my script:

#!/bin/bash
for /data/biocs/b/student.accounts/cs132/data/dna_textfiles
do
 count=$grep -o '[acgt][acgt][acgt]' /data/biocs/b/student.accounts/cs132/data/dna_textfiles | wc -w
 echo $/data/biocs/b/student.accounts/cs132/data/dna_textfiles $count
done

and this is the error I get

[Osama.Chaudry07@cslab5 assignment3]$ ./countmatches /data/biocs/b/student.accounts/cs132/data/dna_textfiles aac
./countmatches: line 6: '/data/biocs/b/student.accounts/cs132/data/dna_textfiles': not a valid identifier

edited 37 mins ago

G-Man

12k92759

asked 4 hours ago

Chaudry Osama

203

New contributor

2

We're not a script-writing service, but people where will be happy to help you when you hit specific issues with a script you've written.
â€“Â roaima
4 hours ago

Oh. That screenshot is a script is it? Please don't post pictures of text. They're harder to read, impossible for people who need screenreaders, and not good for search engines.
â€“Â roaima
3 hours ago

dna_textfiles is a file with nothing but a sequence of letters a, c , g, and t. This is the file for which we have to write a script that will show you how many times a certain sequence such as aac comes up.
â€“Â Chaudry Osama
3 hours ago

@Goro yes that is what the goal is, to enter any sequence that is present in the large dna_textfiles and have the output be how many times that base appears. I wrote a script for it but it isn't achieving what I want it to and I don't understand where I went wrong.
â€“Â Chaudry Osama
2 hours ago

@Goro all of the repeats in a sequence. For example in the sequence aaccgtttgtaaccggaac, if I were to input the base aac, it would show that it comes up 3 times.
â€“Â Chaudry Osama
2 hours ago

Â |Â
show 2 more comments

up vote
1
down vote

favorite

An example sequence and output would be if the string aaccgtttgtaaccggaac is in a file named dnafile, then the script should work as follows

$ countmatches dnafile ttt
ttt 1

with the command being countmatches dnafile ttt and the output being tttÃ‚Â 1, showing that ttt appears once.

This is my script:

#!/bin/bash
for /data/biocs/b/student.accounts/cs132/data/dna_textfiles
do
 count=$grep -o '[acgt][acgt][acgt]' /data/biocs/b/student.accounts/cs132/data/dna_textfiles | wc -w
 echo $/data/biocs/b/student.accounts/cs132/data/dna_textfiles $count
done

and this is the error I get

[Osama.Chaudry07@cslab5 assignment3]$ ./countmatches /data/biocs/b/student.accounts/cs132/data/dna_textfiles aac
./countmatches: line 6: '/data/biocs/b/student.accounts/cs132/data/dna_textfiles': not a valid identifier

edited 37 mins ago

G-Man

12k92759

asked 4 hours ago

Chaudry Osama

203

New contributor

An example sequence and output would be if the string aaccgtttgtaaccggaac is in a file named dnafile, then the script should work as follows

$ countmatches dnafile ttt
ttt 1

with the command being countmatches dnafile ttt and the output being tttÃ‚Â 1, showing that ttt appears once.

This is my script:

#!/bin/bash
for /data/biocs/b/student.accounts/cs132/data/dna_textfiles
do
 count=$grep -o '[acgt][acgt][acgt]' /data/biocs/b/student.accounts/cs132/data/dna_textfiles | wc -w
 echo $/data/biocs/b/student.accounts/cs132/data/dna_textfiles $count
done

and this is the error I get

[Osama.Chaudry07@cslab5 assignment3]$ ./countmatches /data/biocs/b/student.accounts/cs132/data/dna_textfiles aac
./countmatches: line 6: '/data/biocs/b/student.accounts/cs132/data/dna_textfiles': not a valid identifier

text-processing scripting bioinformatics

edited 37 mins ago

G-Man

12k92759

asked 4 hours ago

Chaudry Osama

203

New contributor

edited 37 mins ago

G-Man

12k92759

asked 4 hours ago

Chaudry Osama

203

New contributor

edited 37 mins ago

G-Man

12k92759

edited 37 mins ago

G-Man

12k92759

edited 37 mins ago

G-Man

12k92759

asked 4 hours ago

Chaudry Osama

203

New contributor

asked 4 hours ago

Chaudry Osama

203

asked 4 hours ago

Chaudry Osama

203

New contributor

Chaudry Osama is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

2

We're not a script-writing service, but people where will be happy to help you when you hit specific issues with a script you've written.
â€“Â roaima
4 hours ago

Oh. That screenshot is a script is it? Please don't post pictures of text. They're harder to read, impossible for people who need screenreaders, and not good for search engines.
â€“Â roaima
3 hours ago

dna_textfiles is a file with nothing but a sequence of letters a, c , g, and t. This is the file for which we have to write a script that will show you how many times a certain sequence such as aac comes up.
â€“Â Chaudry Osama
3 hours ago

@Goro yes that is what the goal is, to enter any sequence that is present in the large dna_textfiles and have the output be how many times that base appears. I wrote a script for it but it isn't achieving what I want it to and I don't understand where I went wrong.
â€“Â Chaudry Osama
2 hours ago

@Goro all of the repeats in a sequence. For example in the sequence aaccgtttgtaaccggaac, if I were to input the base aac, it would show that it comes up 3 times.
â€“Â Chaudry Osama
2 hours ago

Â |Â
show 2 more comments

2

We're not a script-writing service, but people where will be happy to help you when you hit specific issues with a script you've written.
â€“Â roaima
4 hours ago

Oh. That screenshot is a script is it? Please don't post pictures of text. They're harder to read, impossible for people who need screenreaders, and not good for search engines.
â€“Â roaima
3 hours ago

dna_textfiles is a file with nothing but a sequence of letters a, c , g, and t. This is the file for which we have to write a script that will show you how many times a certain sequence such as aac comes up.
â€“Â Chaudry Osama
3 hours ago

@Goro yes that is what the goal is, to enter any sequence that is present in the large dna_textfiles and have the output be how many times that base appears. I wrote a script for it but it isn't achieving what I want it to and I don't understand where I went wrong.
â€“Â Chaudry Osama
2 hours ago

@Goro all of the repeats in a sequence. For example in the sequence aaccgtttgtaaccggaac, if I were to input the base aac, it would show that it comes up 3 times.
â€“Â Chaudry Osama
2 hours ago

We're not a script-writing service, but people where will be happy to help you when you hit specific issues with a script you've written.
â€“Â roaima
4 hours ago

Oh. That screenshot is a script is it? Please don't post pictures of text. They're harder to read, impossible for people who need screenreaders, and not good for search engines.
â€“Â roaima
3 hours ago

dna_textfiles is a file with nothing but a sequence of letters a, c , g, and t. This is the file for which we have to write a script that will show you how many times a certain sequence such as aac comes up.
â€“Â Chaudry Osama
3 hours ago

@Goro yes that is what the goal is, to enter any sequence that is present in the large dna_textfiles and have the output be how many times that base appears. I wrote a script for it but it isn't achieving what I want it to and I don't understand where I went wrong.
â€“Â Chaudry Osama
2 hours ago

@Goro all of the repeats in a sequence. For example in the sequence aaccgtttgtaaccggaac, if I were to input the base aac, it would show that it comes up 3 times.
â€“Â Chaudry Osama
2 hours ago

Â |Â
show 2 more comments

1 Answer
1

active

oldest

votes

up vote
4
down vote

cat dna_textfile 
aaccgtttgtaaccggaac 

#!/bin/bash 
dna_file=/autofs/cluster/atassigp/garbage/dna_textfiles
printf "e[31mnucleotide sequence?:";
read -en 3 userInput
while [[ -z "$userInput" ]]
do
read -en 3 userInput
done

count=$(grep -o "$userInput" $dna_file | wc -l)

echo "$userInput", $count

output:

 ttt, 1

#!/bin/bash
#set first and second arguments (dnafile and base respectively)

dir=$1
base=$2

count=$(grep -o $base $dir | wc -l)

echo "$base", $count

output:

$ ./countmatches dnafile ttt
ttt, 1

edited 30 mins ago

answered 2 hours ago

Goro

9,19464486

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

Chaudry Osama is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f475426%2ffind-how-many-times-a-certain-dna-base-sequence-occurs-in-a-file%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
4
down vote

cat dna_textfile 
aaccgtttgtaaccggaac 

#!/bin/bash 
dna_file=/autofs/cluster/atassigp/garbage/dna_textfiles
printf "e[31mnucleotide sequence?:";
read -en 3 userInput
while [[ -z "$userInput" ]]
do
read -en 3 userInput
done

count=$(grep -o "$userInput" $dna_file | wc -l)

echo "$userInput", $count

output:

 ttt, 1

#!/bin/bash
#set first and second arguments (dnafile and base respectively)

dir=$1
base=$2

count=$(grep -o $base $dir | wc -l)

echo "$base", $count

output:

$ ./countmatches dnafile ttt
ttt, 1

edited 30 mins ago

answered 2 hours ago

Goro

9,19464486

add a commentÂ |Â

up vote
4
down vote

cat dna_textfile 
aaccgtttgtaaccggaac 

#!/bin/bash 
dna_file=/autofs/cluster/atassigp/garbage/dna_textfiles
printf "e[31mnucleotide sequence?:";
read -en 3 userInput
while [[ -z "$userInput" ]]
do
read -en 3 userInput
done

count=$(grep -o "$userInput" $dna_file | wc -l)

echo "$userInput", $count

output:

 ttt, 1

#!/bin/bash
#set first and second arguments (dnafile and base respectively)

dir=$1
base=$2

count=$(grep -o $base $dir | wc -l)

echo "$base", $count

output:

$ ./countmatches dnafile ttt
ttt, 1

edited 30 mins ago

answered 2 hours ago

Goro

9,19464486

add a commentÂ |Â

up vote
4
down vote

cat dna_textfile 
aaccgtttgtaaccggaac 

#!/bin/bash 
dna_file=/autofs/cluster/atassigp/garbage/dna_textfiles
printf "e[31mnucleotide sequence?:";
read -en 3 userInput
while [[ -z "$userInput" ]]
do
read -en 3 userInput
done

count=$(grep -o "$userInput" $dna_file | wc -l)

echo "$userInput", $count

output:

 ttt, 1

#!/bin/bash
#set first and second arguments (dnafile and base respectively)

dir=$1
base=$2

count=$(grep -o $base $dir | wc -l)

echo "$base", $count

output:

$ ./countmatches dnafile ttt
ttt, 1

edited 30 mins ago

answered 2 hours ago

Goro

9,19464486

cat dna_textfile 
aaccgtttgtaaccggaac 

#!/bin/bash 
dna_file=/autofs/cluster/atassigp/garbage/dna_textfiles
printf "e[31mnucleotide sequence?:";
read -en 3 userInput
while [[ -z "$userInput" ]]
do
read -en 3 userInput
done

count=$(grep -o "$userInput" $dna_file | wc -l)

echo "$userInput", $count

output:

 ttt, 1

#!/bin/bash
#set first and second arguments (dnafile and base respectively)

dir=$1
base=$2

count=$(grep -o $base $dir | wc -l)

echo "$base", $count

output:

$ ./countmatches dnafile ttt
ttt, 1

edited 30 mins ago

answered 2 hours ago

Goro

9,19464486

edited 30 mins ago

answered 2 hours ago

Goro

9,19464486

answered 2 hours ago

Goro

9,19464486

answered 2 hours ago

Goro

9,19464486

add a commentÂ |Â

Chaudry Osama is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Chaudry Osama is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Post as a guest

Name

Search This Blog

Iyfjky