Mupdf: finding hyphenated words in PDF file

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
1
down vote

favorite












When I search for a word in a PDF file using mupdf. It, only, finds the whole word. For example, searching for the word “meaningless” will find the whole word:



This is a short, staggeringly meaningless sentence.


There is no way I can know in advance whether a word is broken over two lines – and therefore: hyphenated – or not. Searching for hyphenation explicitly would also be too cumbersome. However, when a word is wrapped at the end of a line , it will not be found. Searching for “meaningless” won’t find the word in this example:



This is a short, staggeringly meaning-
less sentence.


The PDF viewer "Evince" behaves in the same way. Is there a (simple) way to make "Mupdf" find hyphenated terms?










share|improve this question



























    up vote
    1
    down vote

    favorite












    When I search for a word in a PDF file using mupdf. It, only, finds the whole word. For example, searching for the word “meaningless” will find the whole word:



    This is a short, staggeringly meaningless sentence.


    There is no way I can know in advance whether a word is broken over two lines – and therefore: hyphenated – or not. Searching for hyphenation explicitly would also be too cumbersome. However, when a word is wrapped at the end of a line , it will not be found. Searching for “meaningless” won’t find the word in this example:



    This is a short, staggeringly meaning-
    less sentence.


    The PDF viewer "Evince" behaves in the same way. Is there a (simple) way to make "Mupdf" find hyphenated terms?










    share|improve this question

























      up vote
      1
      down vote

      favorite









      up vote
      1
      down vote

      favorite











      When I search for a word in a PDF file using mupdf. It, only, finds the whole word. For example, searching for the word “meaningless” will find the whole word:



      This is a short, staggeringly meaningless sentence.


      There is no way I can know in advance whether a word is broken over two lines – and therefore: hyphenated – or not. Searching for hyphenation explicitly would also be too cumbersome. However, when a word is wrapped at the end of a line , it will not be found. Searching for “meaningless” won’t find the word in this example:



      This is a short, staggeringly meaning-
      less sentence.


      The PDF viewer "Evince" behaves in the same way. Is there a (simple) way to make "Mupdf" find hyphenated terms?










      share|improve this question















      When I search for a word in a PDF file using mupdf. It, only, finds the whole word. For example, searching for the word “meaningless” will find the whole word:



      This is a short, staggeringly meaningless sentence.


      There is no way I can know in advance whether a word is broken over two lines – and therefore: hyphenated – or not. Searching for hyphenation explicitly would also be too cumbersome. However, when a word is wrapped at the end of a line , it will not be found. Searching for “meaningless” won’t find the word in this example:



      This is a short, staggeringly meaning-
      less sentence.


      The PDF viewer "Evince" behaves in the same way. Is there a (simple) way to make "Mupdf" find hyphenated terms?







      pdf evince






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited 3 hours ago









      Goro

      2,99941949




      2,99941949










      asked 4 hours ago









      Philipp

      137118




      137118




















          2 Answers
          2






          active

          oldest

          votes

















          up vote
          2
          down vote



          accepted










          Note that the PDF doesn't contain the original text, but a description which glyphs to put where. Searching text in a PDF depends on (1) the PDF having table(s) which describe which glyphs correspond to which unicode chars (2) a way to reassemble those translated chars into words (3) assumptions about how the generating application worked, e.g. put down glyphs in text order (which e.g. will horrendously fail when two-column text is rendered in both columns simultanously).



          To take into account hyphenation, you'd have to implement an algorithm that detects dashes at the end of a line (different glyphs could be used for that), and then merges the word (and takes special rules about hypenation into account, e.g. for German ck).



          So yes, it can be done, but not easily, and then it would work only for some languages/scripts anyway.






          share|improve this answer



























            up vote
            4
            down vote













            Searching for word in a PDF is really a function of the viewer. As such, each viewer takes a different approach to what it will work with. In practice, I found Okular was the best choice between all the PDF viewers that I had tested. To the best of my knowledge Mupdf can't handle hyphenated words.






            share|improve this answer




















              Your Answer







              StackExchange.ready(function()
              var channelOptions =
              tags: "".split(" "),
              id: "106"
              ;
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function()
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled)
              StackExchange.using("snippets", function()
              createEditor();
              );

              else
              createEditor();

              );

              function createEditor()
              StackExchange.prepareEditor(
              heartbeatType: 'answer',
              convertImagesToLinks: false,
              noModals: false,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              );



              );













               

              draft saved


              draft discarded


















              StackExchange.ready(
              function ()
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f469975%2fmupdf-finding-hyphenated-words-in-pdf-file%23new-answer', 'question_page');

              );

              Post as a guest






























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes








              up vote
              2
              down vote



              accepted










              Note that the PDF doesn't contain the original text, but a description which glyphs to put where. Searching text in a PDF depends on (1) the PDF having table(s) which describe which glyphs correspond to which unicode chars (2) a way to reassemble those translated chars into words (3) assumptions about how the generating application worked, e.g. put down glyphs in text order (which e.g. will horrendously fail when two-column text is rendered in both columns simultanously).



              To take into account hyphenation, you'd have to implement an algorithm that detects dashes at the end of a line (different glyphs could be used for that), and then merges the word (and takes special rules about hypenation into account, e.g. for German ck).



              So yes, it can be done, but not easily, and then it would work only for some languages/scripts anyway.






              share|improve this answer
























                up vote
                2
                down vote



                accepted










                Note that the PDF doesn't contain the original text, but a description which glyphs to put where. Searching text in a PDF depends on (1) the PDF having table(s) which describe which glyphs correspond to which unicode chars (2) a way to reassemble those translated chars into words (3) assumptions about how the generating application worked, e.g. put down glyphs in text order (which e.g. will horrendously fail when two-column text is rendered in both columns simultanously).



                To take into account hyphenation, you'd have to implement an algorithm that detects dashes at the end of a line (different glyphs could be used for that), and then merges the word (and takes special rules about hypenation into account, e.g. for German ck).



                So yes, it can be done, but not easily, and then it would work only for some languages/scripts anyway.






                share|improve this answer






















                  up vote
                  2
                  down vote



                  accepted







                  up vote
                  2
                  down vote



                  accepted






                  Note that the PDF doesn't contain the original text, but a description which glyphs to put where. Searching text in a PDF depends on (1) the PDF having table(s) which describe which glyphs correspond to which unicode chars (2) a way to reassemble those translated chars into words (3) assumptions about how the generating application worked, e.g. put down glyphs in text order (which e.g. will horrendously fail when two-column text is rendered in both columns simultanously).



                  To take into account hyphenation, you'd have to implement an algorithm that detects dashes at the end of a line (different glyphs could be used for that), and then merges the word (and takes special rules about hypenation into account, e.g. for German ck).



                  So yes, it can be done, but not easily, and then it would work only for some languages/scripts anyway.






                  share|improve this answer












                  Note that the PDF doesn't contain the original text, but a description which glyphs to put where. Searching text in a PDF depends on (1) the PDF having table(s) which describe which glyphs correspond to which unicode chars (2) a way to reassemble those translated chars into words (3) assumptions about how the generating application worked, e.g. put down glyphs in text order (which e.g. will horrendously fail when two-column text is rendered in both columns simultanously).



                  To take into account hyphenation, you'd have to implement an algorithm that detects dashes at the end of a line (different glyphs could be used for that), and then merges the word (and takes special rules about hypenation into account, e.g. for German ck).



                  So yes, it can be done, but not easily, and then it would work only for some languages/scripts anyway.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered 4 hours ago









                  dirkt

                  14.5k2931




                  14.5k2931






















                      up vote
                      4
                      down vote













                      Searching for word in a PDF is really a function of the viewer. As such, each viewer takes a different approach to what it will work with. In practice, I found Okular was the best choice between all the PDF viewers that I had tested. To the best of my knowledge Mupdf can't handle hyphenated words.






                      share|improve this answer
























                        up vote
                        4
                        down vote













                        Searching for word in a PDF is really a function of the viewer. As such, each viewer takes a different approach to what it will work with. In practice, I found Okular was the best choice between all the PDF viewers that I had tested. To the best of my knowledge Mupdf can't handle hyphenated words.






                        share|improve this answer






















                          up vote
                          4
                          down vote










                          up vote
                          4
                          down vote









                          Searching for word in a PDF is really a function of the viewer. As such, each viewer takes a different approach to what it will work with. In practice, I found Okular was the best choice between all the PDF viewers that I had tested. To the best of my knowledge Mupdf can't handle hyphenated words.






                          share|improve this answer












                          Searching for word in a PDF is really a function of the viewer. As such, each viewer takes a different approach to what it will work with. In practice, I found Okular was the best choice between all the PDF viewers that I had tested. To the best of my knowledge Mupdf can't handle hyphenated words.







                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered 4 hours ago









                          Goro

                          2,99941949




                          2,99941949



























                               

                              draft saved


                              draft discarded















































                               


                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function ()
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f469975%2fmupdf-finding-hyphenated-words-in-pdf-file%23new-answer', 'question_page');

                              );

                              Post as a guest













































































                              Comments

                              Popular posts from this blog

                              What does second last employer means? [closed]

                              Installing NextGIS Connect into QGIS 3?

                              Confectionery