Sequence alignment using Markov Model

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
6
down vote

favorite












I am learning about applying Markov model to sequence alignment. The prof says that the transition probabilities from a gap-residue alignment to a residue-gap alignment and vice versa are both 0. Is there any biological/mathematical reason behind this statement? Why are the (X,Y) and (Y,X) cell 0? This is a lecture slide of lecture 1, week 4 of the "Bioinformatics: Introdcution and Methods" course on coursera.
enter image description here










share|improve this question







New contributor




Zeyuan is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.























    up vote
    6
    down vote

    favorite












    I am learning about applying Markov model to sequence alignment. The prof says that the transition probabilities from a gap-residue alignment to a residue-gap alignment and vice versa are both 0. Is there any biological/mathematical reason behind this statement? Why are the (X,Y) and (Y,X) cell 0? This is a lecture slide of lecture 1, week 4 of the "Bioinformatics: Introdcution and Methods" course on coursera.
    enter image description here










    share|improve this question







    New contributor




    Zeyuan is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.





















      up vote
      6
      down vote

      favorite









      up vote
      6
      down vote

      favorite











      I am learning about applying Markov model to sequence alignment. The prof says that the transition probabilities from a gap-residue alignment to a residue-gap alignment and vice versa are both 0. Is there any biological/mathematical reason behind this statement? Why are the (X,Y) and (Y,X) cell 0? This is a lecture slide of lecture 1, week 4 of the "Bioinformatics: Introdcution and Methods" course on coursera.
      enter image description here










      share|improve this question







      New contributor




      Zeyuan is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      I am learning about applying Markov model to sequence alignment. The prof says that the transition probabilities from a gap-residue alignment to a residue-gap alignment and vice versa are both 0. Is there any biological/mathematical reason behind this statement? Why are the (X,Y) and (Y,X) cell 0? This is a lecture slide of lecture 1, week 4 of the "Bioinformatics: Introdcution and Methods" course on coursera.
      enter image description here







      alignment hidden-markov-models






      share|improve this question







      New contributor




      Zeyuan is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|improve this question







      New contributor




      Zeyuan is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this question




      share|improve this question






      New contributor




      Zeyuan is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked yesterday









      Zeyuan

      333




      333




      New contributor




      Zeyuan is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      Zeyuan is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      Zeyuan is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.




















          2 Answers
          2






          active

          oldest

          votes

















          up vote
          6
          down vote



          accepted










          If I understand your question correctly, then I think for case of pairwise alignment, there is a simple explanation.



          I believe the key insight is that: a mismatch should always score better than a gap.*



          This follows biologically since the insertion/deletion (indel) rate is roughly 1/10th that of the substitution rate (i.e. the occurrence of single nucleotide changes), at least in vertebrates. (This varies across the tree of life but I think the substitution rate virtually always exceeds the indel rate.)



          To understand why this matters, consider an example:



          ATG-AG
          ATGT-G


          This is an 'impossible alignment' under the probabilities you gave since here we have a transition from a gap-residue alignment to a residue-gap.



          However, under our assumption that mismatches are more likely biologically than indels, the correct alignment should be:



          ATGAG
          ATGTG


          Indeed, the latter does look like a better alignment.



          This also follows for more complex examples, so this:



          ATG--AAG
          ATGTT-AG


          Becomes this:



          ATG-AAG
          ATGTTAG


          (Or this:



          ATGA-AG
          ATGTTAG


          )



          * Strictly, I mean a substitution should score better than an indel (with the associated gap opening and extension penalties). In fact, for the assumption to always be true, a run of mismatches should still score worse than a single indel. This may not always be a correct assumption, consider this example below, is the true alignment case 1) or 2) or something else? Or is in fact a global alignment bad here and this should be split into 2 local alignments? Is there a likely biological mutational event that could explain this? I ask these questions just to point out it is not black-and-white, I don't have clear answers



          1)



          CGTACGTAGAGGAATGCCCCCCCCC--------AGCAACGTAGCAT 
          CGTACGTAGAGGAATG---------TTTTTTTTAGCAACGTAGCAT


          2)



          CGTACGTAGAGGAATGCCCCCCCCCAGCAACGTAGCAT 
          CGTACGTAGAGGAATGTTTTTTTT-AGCAACGTAGCAT





          share|improve this answer





























            up vote
            1
            down vote













            All Chris_Rands said is correct: you set the probability of $Xto Y$ and $Yto X$ to 0 to forbid adjacent insertions/deletions in the alignment. A lot of textbooks including some classical ones use this rule, but in fact, the rule is questionable. It is easier to see this from Smith-Waterman alignment under the affine gap penalty, which is largely the non-probabilistic view of paired HMM.



            With the affine gap penalty, a gap of length $k>0$ is scored as
            $$
            g(k)=-(d+kcdot e)
            $$
            where $dge0$ is the gap open penalty and $e>0$ is the gap extension penalty. Suppose we are using a simple scoring matrix where a mismatch gets $-b$, $b>0$. We may see an insertion immediately followed by a deletion (and vice versa) if $b>2e$. It is actually not so difficult for this to happen. For example, for the human-mouse alignment (see the blastz paper), $e=30$ and $b$ is ranged from 31 to 125. It is possible that an $Xto Y$ transition is preferred in the alignment.



            Theoretically speaking, it makes more sense to consider immediate transitions between insertions and deletions. In practice, though, the difference between allowing/disallowing such transitions is probably minor most of time.






            share|improve this answer




















              Your Answer




              StackExchange.ifUsing("editor", function ()
              return StackExchange.using("mathjaxEditing", function ()
              StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
              StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
              );
              );
              , "mathjax-editing");

              StackExchange.ready(function()
              var channelOptions =
              tags: "".split(" "),
              id: "676"
              ;
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function()
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled)
              StackExchange.using("snippets", function()
              createEditor();
              );

              else
              createEditor();

              );

              function createEditor()
              StackExchange.prepareEditor(
              heartbeatType: 'answer',
              convertImagesToLinks: false,
              noModals: false,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              );



              );






              Zeyuan is a new contributor. Be nice, and check out our Code of Conduct.









               

              draft saved


              draft discarded


















              StackExchange.ready(
              function ()
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f5017%2fsequence-alignment-using-markov-model%23new-answer', 'question_page');

              );

              Post as a guest






























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes








              up vote
              6
              down vote



              accepted










              If I understand your question correctly, then I think for case of pairwise alignment, there is a simple explanation.



              I believe the key insight is that: a mismatch should always score better than a gap.*



              This follows biologically since the insertion/deletion (indel) rate is roughly 1/10th that of the substitution rate (i.e. the occurrence of single nucleotide changes), at least in vertebrates. (This varies across the tree of life but I think the substitution rate virtually always exceeds the indel rate.)



              To understand why this matters, consider an example:



              ATG-AG
              ATGT-G


              This is an 'impossible alignment' under the probabilities you gave since here we have a transition from a gap-residue alignment to a residue-gap.



              However, under our assumption that mismatches are more likely biologically than indels, the correct alignment should be:



              ATGAG
              ATGTG


              Indeed, the latter does look like a better alignment.



              This also follows for more complex examples, so this:



              ATG--AAG
              ATGTT-AG


              Becomes this:



              ATG-AAG
              ATGTTAG


              (Or this:



              ATGA-AG
              ATGTTAG


              )



              * Strictly, I mean a substitution should score better than an indel (with the associated gap opening and extension penalties). In fact, for the assumption to always be true, a run of mismatches should still score worse than a single indel. This may not always be a correct assumption, consider this example below, is the true alignment case 1) or 2) or something else? Or is in fact a global alignment bad here and this should be split into 2 local alignments? Is there a likely biological mutational event that could explain this? I ask these questions just to point out it is not black-and-white, I don't have clear answers



              1)



              CGTACGTAGAGGAATGCCCCCCCCC--------AGCAACGTAGCAT 
              CGTACGTAGAGGAATG---------TTTTTTTTAGCAACGTAGCAT


              2)



              CGTACGTAGAGGAATGCCCCCCCCCAGCAACGTAGCAT 
              CGTACGTAGAGGAATGTTTTTTTT-AGCAACGTAGCAT





              share|improve this answer


























                up vote
                6
                down vote



                accepted










                If I understand your question correctly, then I think for case of pairwise alignment, there is a simple explanation.



                I believe the key insight is that: a mismatch should always score better than a gap.*



                This follows biologically since the insertion/deletion (indel) rate is roughly 1/10th that of the substitution rate (i.e. the occurrence of single nucleotide changes), at least in vertebrates. (This varies across the tree of life but I think the substitution rate virtually always exceeds the indel rate.)



                To understand why this matters, consider an example:



                ATG-AG
                ATGT-G


                This is an 'impossible alignment' under the probabilities you gave since here we have a transition from a gap-residue alignment to a residue-gap.



                However, under our assumption that mismatches are more likely biologically than indels, the correct alignment should be:



                ATGAG
                ATGTG


                Indeed, the latter does look like a better alignment.



                This also follows for more complex examples, so this:



                ATG--AAG
                ATGTT-AG


                Becomes this:



                ATG-AAG
                ATGTTAG


                (Or this:



                ATGA-AG
                ATGTTAG


                )



                * Strictly, I mean a substitution should score better than an indel (with the associated gap opening and extension penalties). In fact, for the assumption to always be true, a run of mismatches should still score worse than a single indel. This may not always be a correct assumption, consider this example below, is the true alignment case 1) or 2) or something else? Or is in fact a global alignment bad here and this should be split into 2 local alignments? Is there a likely biological mutational event that could explain this? I ask these questions just to point out it is not black-and-white, I don't have clear answers



                1)



                CGTACGTAGAGGAATGCCCCCCCCC--------AGCAACGTAGCAT 
                CGTACGTAGAGGAATG---------TTTTTTTTAGCAACGTAGCAT


                2)



                CGTACGTAGAGGAATGCCCCCCCCCAGCAACGTAGCAT 
                CGTACGTAGAGGAATGTTTTTTTT-AGCAACGTAGCAT





                share|improve this answer
























                  up vote
                  6
                  down vote



                  accepted







                  up vote
                  6
                  down vote



                  accepted






                  If I understand your question correctly, then I think for case of pairwise alignment, there is a simple explanation.



                  I believe the key insight is that: a mismatch should always score better than a gap.*



                  This follows biologically since the insertion/deletion (indel) rate is roughly 1/10th that of the substitution rate (i.e. the occurrence of single nucleotide changes), at least in vertebrates. (This varies across the tree of life but I think the substitution rate virtually always exceeds the indel rate.)



                  To understand why this matters, consider an example:



                  ATG-AG
                  ATGT-G


                  This is an 'impossible alignment' under the probabilities you gave since here we have a transition from a gap-residue alignment to a residue-gap.



                  However, under our assumption that mismatches are more likely biologically than indels, the correct alignment should be:



                  ATGAG
                  ATGTG


                  Indeed, the latter does look like a better alignment.



                  This also follows for more complex examples, so this:



                  ATG--AAG
                  ATGTT-AG


                  Becomes this:



                  ATG-AAG
                  ATGTTAG


                  (Or this:



                  ATGA-AG
                  ATGTTAG


                  )



                  * Strictly, I mean a substitution should score better than an indel (with the associated gap opening and extension penalties). In fact, for the assumption to always be true, a run of mismatches should still score worse than a single indel. This may not always be a correct assumption, consider this example below, is the true alignment case 1) or 2) or something else? Or is in fact a global alignment bad here and this should be split into 2 local alignments? Is there a likely biological mutational event that could explain this? I ask these questions just to point out it is not black-and-white, I don't have clear answers



                  1)



                  CGTACGTAGAGGAATGCCCCCCCCC--------AGCAACGTAGCAT 
                  CGTACGTAGAGGAATG---------TTTTTTTTAGCAACGTAGCAT


                  2)



                  CGTACGTAGAGGAATGCCCCCCCCCAGCAACGTAGCAT 
                  CGTACGTAGAGGAATGTTTTTTTT-AGCAACGTAGCAT





                  share|improve this answer














                  If I understand your question correctly, then I think for case of pairwise alignment, there is a simple explanation.



                  I believe the key insight is that: a mismatch should always score better than a gap.*



                  This follows biologically since the insertion/deletion (indel) rate is roughly 1/10th that of the substitution rate (i.e. the occurrence of single nucleotide changes), at least in vertebrates. (This varies across the tree of life but I think the substitution rate virtually always exceeds the indel rate.)



                  To understand why this matters, consider an example:



                  ATG-AG
                  ATGT-G


                  This is an 'impossible alignment' under the probabilities you gave since here we have a transition from a gap-residue alignment to a residue-gap.



                  However, under our assumption that mismatches are more likely biologically than indels, the correct alignment should be:



                  ATGAG
                  ATGTG


                  Indeed, the latter does look like a better alignment.



                  This also follows for more complex examples, so this:



                  ATG--AAG
                  ATGTT-AG


                  Becomes this:



                  ATG-AAG
                  ATGTTAG


                  (Or this:



                  ATGA-AG
                  ATGTTAG


                  )



                  * Strictly, I mean a substitution should score better than an indel (with the associated gap opening and extension penalties). In fact, for the assumption to always be true, a run of mismatches should still score worse than a single indel. This may not always be a correct assumption, consider this example below, is the true alignment case 1) or 2) or something else? Or is in fact a global alignment bad here and this should be split into 2 local alignments? Is there a likely biological mutational event that could explain this? I ask these questions just to point out it is not black-and-white, I don't have clear answers



                  1)



                  CGTACGTAGAGGAATGCCCCCCCCC--------AGCAACGTAGCAT 
                  CGTACGTAGAGGAATG---------TTTTTTTTAGCAACGTAGCAT


                  2)



                  CGTACGTAGAGGAATGCCCCCCCCCAGCAACGTAGCAT 
                  CGTACGTAGAGGAATGTTTTTTTT-AGCAACGTAGCAT






                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited 22 hours ago

























                  answered yesterday









                  Chris_Rands

                  1,100319




                  1,100319




















                      up vote
                      1
                      down vote













                      All Chris_Rands said is correct: you set the probability of $Xto Y$ and $Yto X$ to 0 to forbid adjacent insertions/deletions in the alignment. A lot of textbooks including some classical ones use this rule, but in fact, the rule is questionable. It is easier to see this from Smith-Waterman alignment under the affine gap penalty, which is largely the non-probabilistic view of paired HMM.



                      With the affine gap penalty, a gap of length $k>0$ is scored as
                      $$
                      g(k)=-(d+kcdot e)
                      $$
                      where $dge0$ is the gap open penalty and $e>0$ is the gap extension penalty. Suppose we are using a simple scoring matrix where a mismatch gets $-b$, $b>0$. We may see an insertion immediately followed by a deletion (and vice versa) if $b>2e$. It is actually not so difficult for this to happen. For example, for the human-mouse alignment (see the blastz paper), $e=30$ and $b$ is ranged from 31 to 125. It is possible that an $Xto Y$ transition is preferred in the alignment.



                      Theoretically speaking, it makes more sense to consider immediate transitions between insertions and deletions. In practice, though, the difference between allowing/disallowing such transitions is probably minor most of time.






                      share|improve this answer
























                        up vote
                        1
                        down vote













                        All Chris_Rands said is correct: you set the probability of $Xto Y$ and $Yto X$ to 0 to forbid adjacent insertions/deletions in the alignment. A lot of textbooks including some classical ones use this rule, but in fact, the rule is questionable. It is easier to see this from Smith-Waterman alignment under the affine gap penalty, which is largely the non-probabilistic view of paired HMM.



                        With the affine gap penalty, a gap of length $k>0$ is scored as
                        $$
                        g(k)=-(d+kcdot e)
                        $$
                        where $dge0$ is the gap open penalty and $e>0$ is the gap extension penalty. Suppose we are using a simple scoring matrix where a mismatch gets $-b$, $b>0$. We may see an insertion immediately followed by a deletion (and vice versa) if $b>2e$. It is actually not so difficult for this to happen. For example, for the human-mouse alignment (see the blastz paper), $e=30$ and $b$ is ranged from 31 to 125. It is possible that an $Xto Y$ transition is preferred in the alignment.



                        Theoretically speaking, it makes more sense to consider immediate transitions between insertions and deletions. In practice, though, the difference between allowing/disallowing such transitions is probably minor most of time.






                        share|improve this answer






















                          up vote
                          1
                          down vote










                          up vote
                          1
                          down vote









                          All Chris_Rands said is correct: you set the probability of $Xto Y$ and $Yto X$ to 0 to forbid adjacent insertions/deletions in the alignment. A lot of textbooks including some classical ones use this rule, but in fact, the rule is questionable. It is easier to see this from Smith-Waterman alignment under the affine gap penalty, which is largely the non-probabilistic view of paired HMM.



                          With the affine gap penalty, a gap of length $k>0$ is scored as
                          $$
                          g(k)=-(d+kcdot e)
                          $$
                          where $dge0$ is the gap open penalty and $e>0$ is the gap extension penalty. Suppose we are using a simple scoring matrix where a mismatch gets $-b$, $b>0$. We may see an insertion immediately followed by a deletion (and vice versa) if $b>2e$. It is actually not so difficult for this to happen. For example, for the human-mouse alignment (see the blastz paper), $e=30$ and $b$ is ranged from 31 to 125. It is possible that an $Xto Y$ transition is preferred in the alignment.



                          Theoretically speaking, it makes more sense to consider immediate transitions between insertions and deletions. In practice, though, the difference between allowing/disallowing such transitions is probably minor most of time.






                          share|improve this answer












                          All Chris_Rands said is correct: you set the probability of $Xto Y$ and $Yto X$ to 0 to forbid adjacent insertions/deletions in the alignment. A lot of textbooks including some classical ones use this rule, but in fact, the rule is questionable. It is easier to see this from Smith-Waterman alignment under the affine gap penalty, which is largely the non-probabilistic view of paired HMM.



                          With the affine gap penalty, a gap of length $k>0$ is scored as
                          $$
                          g(k)=-(d+kcdot e)
                          $$
                          where $dge0$ is the gap open penalty and $e>0$ is the gap extension penalty. Suppose we are using a simple scoring matrix where a mismatch gets $-b$, $b>0$. We may see an insertion immediately followed by a deletion (and vice versa) if $b>2e$. It is actually not so difficult for this to happen. For example, for the human-mouse alignment (see the blastz paper), $e=30$ and $b$ is ranged from 31 to 125. It is possible that an $Xto Y$ transition is preferred in the alignment.



                          Theoretically speaking, it makes more sense to consider immediate transitions between insertions and deletions. In practice, though, the difference between allowing/disallowing such transitions is probably minor most of time.







                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered 12 hours ago









                          user172818♦

                          3,6241421




                          3,6241421




















                              Zeyuan is a new contributor. Be nice, and check out our Code of Conduct.









                               

                              draft saved


                              draft discarded


















                              Zeyuan is a new contributor. Be nice, and check out our Code of Conduct.












                              Zeyuan is a new contributor. Be nice, and check out our Code of Conduct.











                              Zeyuan is a new contributor. Be nice, and check out our Code of Conduct.













                               


                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function ()
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f5017%2fsequence-alignment-using-markov-model%23new-answer', 'question_page');

                              );

                              Post as a guest













































































                              Comments

                              Popular posts from this blog

                              White Anglo-Saxon Protestant

                              Is the Concept of Multiple Fantasy Races Scientifically Flawed? [closed]

                              One-line joke