Mupdf: finding hyphenated words in PDF file

up vote
1
down vote

favorite

When I search for a word in a PDF file using mupdf. It, only, finds the whole word. For example, searching for the word Ã¢Â€ÂœmeaninglessÃ¢Â€Â will find the whole word:

This is a short, staggeringly meaningless sentence.

There is no way I can know in advance whether a word is broken over two lines Ã¢Â€Â“ and therefore: hyphenated Ã¢Â€Â“ or not. Searching for hyphenation explicitly would also be too cumbersome. However, when a word is wrapped at the end of a line , it will not be found. Searching for Ã¢Â€ÂœmeaninglessÃ¢Â€Â wonÃ¢Â€Â™t find the word in this example:

This is a short, staggeringly meaning-
less sentence.

The PDF viewer "Evince" behaves in the same way. Is there a (simple) way to make "Mupdf" find hyphenated terms?

edited 3 hours ago

Goro

2,99941949

asked 4 hours ago

Philipp

137118

add a commentÂ |Â

up vote
1
down vote

favorite

When I search for a word in a PDF file using mupdf. It, only, finds the whole word. For example, searching for the word Ã¢Â€ÂœmeaninglessÃ¢Â€Â will find the whole word:

This is a short, staggeringly meaningless sentence.

This is a short, staggeringly meaning-
less sentence.

The PDF viewer "Evince" behaves in the same way. Is there a (simple) way to make "Mupdf" find hyphenated terms?

edited 3 hours ago

Goro

2,99941949

asked 4 hours ago

Philipp

137118

add a commentÂ |Â

up vote
1
down vote

favorite

When I search for a word in a PDF file using mupdf. It, only, finds the whole word. For example, searching for the word Ã¢Â€ÂœmeaninglessÃ¢Â€Â will find the whole word:

This is a short, staggeringly meaningless sentence.

This is a short, staggeringly meaning-
less sentence.

The PDF viewer "Evince" behaves in the same way. Is there a (simple) way to make "Mupdf" find hyphenated terms?

edited 3 hours ago

Goro

2,99941949

asked 4 hours ago

Philipp

137118

When I search for a word in a PDF file using mupdf. It, only, finds the whole word. For example, searching for the word Ã¢Â€ÂœmeaninglessÃ¢Â€Â will find the whole word:

This is a short, staggeringly meaningless sentence.

This is a short, staggeringly meaning-
less sentence.

The PDF viewer "Evince" behaves in the same way. Is there a (simple) way to make "Mupdf" find hyphenated terms?

pdf evince

edited 3 hours ago

Goro

2,99941949

asked 4 hours ago

Philipp

137118

edited 3 hours ago

Goro

2,99941949

asked 4 hours ago

Philipp

137118

edited 3 hours ago

Goro

2,99941949

edited 3 hours ago

Goro

2,99941949

edited 3 hours ago

Goro

2,99941949

asked 4 hours ago

Philipp

137118

asked 4 hours ago

Philipp

137118

asked 4 hours ago

Philipp

137118

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
2
down vote

accepted

Note that the PDF doesn't contain the original text, but a description which glyphs to put where. Searching text in a PDF depends on (1) the PDF having table(s) which describe which glyphs correspond to which unicode chars (2) a way to reassemble those translated chars into words (3) assumptions about how the generating application worked, e.g. put down glyphs in text order (which e.g. will horrendously fail when two-column text is rendered in both columns simultanously).

To take into account hyphenation, you'd have to implement an algorithm that detects dashes at the end of a line (different glyphs could be used for that), and then merges the word (and takes special rules about hypenation into account, e.g. for German ck).

So yes, it can be done, but not easily, and then it would work only for some languages/scripts anyway.

answered 4 hours ago

dirkt

14.5k2931

add a commentÂ |Â

up vote
4
down vote

Searching for word in a PDF is really a function of the viewer. As such, each viewer takes a different approach to what it will work with. In practice, I found Okular was the best choice between all the PDF viewers that I had tested. To the best of my knowledge Mupdf can't handle hyphenated words.

answered 4 hours ago

Goro

2,99941949

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f469975%2fmupdf-finding-hyphenated-words-in-pdf-file%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
2
down vote

accepted

So yes, it can be done, but not easily, and then it would work only for some languages/scripts anyway.

answered 4 hours ago

dirkt

14.5k2931

add a commentÂ |Â

up vote
2
down vote

accepted

So yes, it can be done, but not easily, and then it would work only for some languages/scripts anyway.

answered 4 hours ago

dirkt

14.5k2931

add a commentÂ |Â

up vote
2
down vote

accepted

So yes, it can be done, but not easily, and then it would work only for some languages/scripts anyway.

answered 4 hours ago

dirkt

14.5k2931

So yes, it can be done, but not easily, and then it would work only for some languages/scripts anyway.

answered 4 hours ago

dirkt

14.5k2931

answered 4 hours ago

dirkt

14.5k2931

answered 4 hours ago

dirkt

14.5k2931

answered 4 hours ago

dirkt

14.5k2931

add a commentÂ |Â

up vote
4
down vote

answered 4 hours ago

Goro

2,99941949

add a commentÂ |Â

up vote
4
down vote

answered 4 hours ago

Goro

2,99941949

add a commentÂ |Â

up vote
4
down vote

answered 4 hours ago

Goro

2,99941949

answered 4 hours ago

Goro

2,99941949

answered 4 hours ago

Goro

2,99941949

answered 4 hours ago

Goro

2,99941949

answered 4 hours ago

Goro

2,99941949

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Search This Blog

Iyfjky