Mupdf: finding hyphenated words in PDF file
Clash Royale CLAN TAG#URR8PPP
up vote
1
down vote
favorite
When I search for a word in a PDF file using mupdf
. It, only, finds the whole word. For example, searching for the word âÂÂmeaninglessâ will find the whole word:
This is a short, staggeringly meaningless sentence.
There is no way I can know in advance whether a word is broken over two lines â and therefore: hyphenated â or not. Searching for hyphenation explicitly would also be too cumbersome. However, when a word is wrapped at the end of a line , it will not be found. Searching for âÂÂmeaninglessâ wonâÂÂt find the word in this example:
This is a short, staggeringly meaning-
less sentence.
The PDF viewer "Evince" behaves in the same way. Is there a (simple) way to make "Mupdf" find hyphenated terms?
pdf evince
add a comment |Â
up vote
1
down vote
favorite
When I search for a word in a PDF file using mupdf
. It, only, finds the whole word. For example, searching for the word âÂÂmeaninglessâ will find the whole word:
This is a short, staggeringly meaningless sentence.
There is no way I can know in advance whether a word is broken over two lines â and therefore: hyphenated â or not. Searching for hyphenation explicitly would also be too cumbersome. However, when a word is wrapped at the end of a line , it will not be found. Searching for âÂÂmeaninglessâ wonâÂÂt find the word in this example:
This is a short, staggeringly meaning-
less sentence.
The PDF viewer "Evince" behaves in the same way. Is there a (simple) way to make "Mupdf" find hyphenated terms?
pdf evince
add a comment |Â
up vote
1
down vote
favorite
up vote
1
down vote
favorite
When I search for a word in a PDF file using mupdf
. It, only, finds the whole word. For example, searching for the word âÂÂmeaninglessâ will find the whole word:
This is a short, staggeringly meaningless sentence.
There is no way I can know in advance whether a word is broken over two lines â and therefore: hyphenated â or not. Searching for hyphenation explicitly would also be too cumbersome. However, when a word is wrapped at the end of a line , it will not be found. Searching for âÂÂmeaninglessâ wonâÂÂt find the word in this example:
This is a short, staggeringly meaning-
less sentence.
The PDF viewer "Evince" behaves in the same way. Is there a (simple) way to make "Mupdf" find hyphenated terms?
pdf evince
When I search for a word in a PDF file using mupdf
. It, only, finds the whole word. For example, searching for the word âÂÂmeaninglessâ will find the whole word:
This is a short, staggeringly meaningless sentence.
There is no way I can know in advance whether a word is broken over two lines â and therefore: hyphenated â or not. Searching for hyphenation explicitly would also be too cumbersome. However, when a word is wrapped at the end of a line , it will not be found. Searching for âÂÂmeaninglessâ wonâÂÂt find the word in this example:
This is a short, staggeringly meaning-
less sentence.
The PDF viewer "Evince" behaves in the same way. Is there a (simple) way to make "Mupdf" find hyphenated terms?
pdf evince
pdf evince
edited 3 hours ago
Goro
2,99941949
2,99941949
asked 4 hours ago
Philipp
137118
137118
add a comment |Â
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
2
down vote
accepted
Note that the PDF doesn't contain the original text, but a description which glyphs to put where. Searching text in a PDF depends on (1) the PDF having table(s) which describe which glyphs correspond to which unicode chars (2) a way to reassemble those translated chars into words (3) assumptions about how the generating application worked, e.g. put down glyphs in text order (which e.g. will horrendously fail when two-column text is rendered in both columns simultanously).
To take into account hyphenation, you'd have to implement an algorithm that detects dashes at the end of a line (different glyphs could be used for that), and then merges the word (and takes special rules about hypenation into account, e.g. for German ck
).
So yes, it can be done, but not easily, and then it would work only for some languages/scripts anyway.
add a comment |Â
up vote
4
down vote
Searching for word in a PDF is really a function of the viewer. As such, each viewer takes a different approach to what it will work with. In practice, I found Okular was the best choice between all the PDF viewers that I had tested. To the best of my knowledge Mupdf can't handle hyphenated words.
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
Note that the PDF doesn't contain the original text, but a description which glyphs to put where. Searching text in a PDF depends on (1) the PDF having table(s) which describe which glyphs correspond to which unicode chars (2) a way to reassemble those translated chars into words (3) assumptions about how the generating application worked, e.g. put down glyphs in text order (which e.g. will horrendously fail when two-column text is rendered in both columns simultanously).
To take into account hyphenation, you'd have to implement an algorithm that detects dashes at the end of a line (different glyphs could be used for that), and then merges the word (and takes special rules about hypenation into account, e.g. for German ck
).
So yes, it can be done, but not easily, and then it would work only for some languages/scripts anyway.
add a comment |Â
up vote
2
down vote
accepted
Note that the PDF doesn't contain the original text, but a description which glyphs to put where. Searching text in a PDF depends on (1) the PDF having table(s) which describe which glyphs correspond to which unicode chars (2) a way to reassemble those translated chars into words (3) assumptions about how the generating application worked, e.g. put down glyphs in text order (which e.g. will horrendously fail when two-column text is rendered in both columns simultanously).
To take into account hyphenation, you'd have to implement an algorithm that detects dashes at the end of a line (different glyphs could be used for that), and then merges the word (and takes special rules about hypenation into account, e.g. for German ck
).
So yes, it can be done, but not easily, and then it would work only for some languages/scripts anyway.
add a comment |Â
up vote
2
down vote
accepted
up vote
2
down vote
accepted
Note that the PDF doesn't contain the original text, but a description which glyphs to put where. Searching text in a PDF depends on (1) the PDF having table(s) which describe which glyphs correspond to which unicode chars (2) a way to reassemble those translated chars into words (3) assumptions about how the generating application worked, e.g. put down glyphs in text order (which e.g. will horrendously fail when two-column text is rendered in both columns simultanously).
To take into account hyphenation, you'd have to implement an algorithm that detects dashes at the end of a line (different glyphs could be used for that), and then merges the word (and takes special rules about hypenation into account, e.g. for German ck
).
So yes, it can be done, but not easily, and then it would work only for some languages/scripts anyway.
Note that the PDF doesn't contain the original text, but a description which glyphs to put where. Searching text in a PDF depends on (1) the PDF having table(s) which describe which glyphs correspond to which unicode chars (2) a way to reassemble those translated chars into words (3) assumptions about how the generating application worked, e.g. put down glyphs in text order (which e.g. will horrendously fail when two-column text is rendered in both columns simultanously).
To take into account hyphenation, you'd have to implement an algorithm that detects dashes at the end of a line (different glyphs could be used for that), and then merges the word (and takes special rules about hypenation into account, e.g. for German ck
).
So yes, it can be done, but not easily, and then it would work only for some languages/scripts anyway.
answered 4 hours ago
dirkt
14.5k2931
14.5k2931
add a comment |Â
add a comment |Â
up vote
4
down vote
Searching for word in a PDF is really a function of the viewer. As such, each viewer takes a different approach to what it will work with. In practice, I found Okular was the best choice between all the PDF viewers that I had tested. To the best of my knowledge Mupdf can't handle hyphenated words.
add a comment |Â
up vote
4
down vote
Searching for word in a PDF is really a function of the viewer. As such, each viewer takes a different approach to what it will work with. In practice, I found Okular was the best choice between all the PDF viewers that I had tested. To the best of my knowledge Mupdf can't handle hyphenated words.
add a comment |Â
up vote
4
down vote
up vote
4
down vote
Searching for word in a PDF is really a function of the viewer. As such, each viewer takes a different approach to what it will work with. In practice, I found Okular was the best choice between all the PDF viewers that I had tested. To the best of my knowledge Mupdf can't handle hyphenated words.
Searching for word in a PDF is really a function of the viewer. As such, each viewer takes a different approach to what it will work with. In practice, I found Okular was the best choice between all the PDF viewers that I had tested. To the best of my knowledge Mupdf can't handle hyphenated words.
answered 4 hours ago
Goro
2,99941949
2,99941949
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f469975%2fmupdf-finding-hyphenated-words-in-pdf-file%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password