Iyfjky

Question

Problem definition

Find the longest word that doesn't contain any of the following characters:

gkmqvwxz

Input comes from a file named words.txt and contains one word per line. A word is anything that is not a whitespace.

The twist

My C++ implementation posted below is surprisingly slow, taking 1.7 seconds to run, whereas my javascript implementation (link, will obviously be slower because it's online) takes only 55ms. Is it possible to somehow optimize my C++ implementation?

#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <cstdint>

#include "TimeUtil.h"

int main()

 uint64_t begin_time = currentTimeNanoseconds();

 std::string longestAcceptableWord; // Variable to contain the longest word

 std::string currentLine;
 const char badLetters = 'g', 'k', 'm', 'q', 'v', 'w', 'x', 'z' ;

 std::ifstream in;
 in.open("words.txt");
 while (!in.eof()) // Loop through the entire file (466 544 lines, 4 749 kB).
 
 getline(in, currentLine);

 if (currentLine.length() <= longestAcceptableWord.length())
 
 continue;
 

 // If the word contains one of the bad letters, don't accept it.
 for (unsigned int i = 0; i < sizeof(badLetters); i++)
 
 if (currentLine.find(badLetters[i]) != std::string::npos)
 
 continue;
 
 

 // If the word is longer than the current longest word found, 
 // and doesn't have any bad letters, make it the new longest 
 // word found.
 longestAcceptableWord = currentLine;
 
 in.close();

 std::cout << "[" << (double) ((currentTimeNanoseconds() - begin_time) / 1000) / 1000.0f << "ms] " << longestAcceptableWord << std::endl;

 return 0;

@Xydez, can you post the input file somewhere? Just out of curiosity. — 1 hour ago
@Incomputable Sure, here: mega.nz/#!V75TWKCT!R41D0y0pgnFXER5jU-Jb7-daAglZN1FNQuiQ-FQgHTw — 59 mins ago
are you sure it is 55ms and not 550ms? I'm getting 90-100ms on my computer with my C++ implementation. — 25 mins ago
Is that link supposed to be a blank page? Not very useful... — 22 mins ago
Are you sure you enabled all optimizations in your compiler? Because something like a debug mode, or even no optimization could jeopardize performance. — 18 mins ago

Edward 45k375204 · Answer 1 · 2018-10-10 16:26:04Z

I see a number of things that may help you improve your program.

Don't loop on `eof()`

It's almost always incorrect to loop on eof() while reading a file. The reason is that the eof indication is only set when an attempt is made to read something from the file when we're already at the end. So instead of this:

while (!in.eof()) 
 getline(in, currentLine); 
 // ...

write this:

while (getline(in, currentLine)) 
 // ...

See this question for more details on why using eof is usually wrong.

Initialize variables on declaration

The best practice is to initialize varibles as soon as they're created. In C++, this most often means initializing them in the same line as the declaration. So instead of this:

std::ifstream in;
in.open("words.txt");

write this:

std::ifstream file2wordsFile;

Note, too, that I use the C++11 uniform initializer syntax (with the ) to make it clear to both the compiler and the reader that this is an initialization and not a function call. See Stroustrup's description for more details on that.

Allow the user to specify input files

The words file name is currently hardcoded which certainly greatly restricts the usefulness of the program. Consider using argc and argv to allow the user to specify file names on the command line.

Prefer newer structures to plain arrays

Instead of a plain array as with badLetters, one could instead use an std::array or std::string. If your compiler has C++17 support, std::string_vew might be an even better choice.

Use standard algorithms

Instead of searching with a for loop for all badLetters, we can use std::find_first_of() instead. The result would look like this:

while (getline(in, currentLine)) 
 if (currentLine.length() > longestAcceptableWord.length()
 && (std::find_first_of(currentLine.begin(),
 currentLine.end(),
 badLetters.begin();
 badLetters.end()) == std::string::npos)
 ) 
 longestAcceptableWord = currentLine;

Note that I prefer to avoid continue and instead rely on the short-circuit evaluation of the if clauses to cause the same effect.

Don't use `std::endl` if you don't really need it

The difference betweeen std::endl and 'n' is that 'n' just emits a newline character, while std::endl actually flushes the stream. This can be time-consuming in a program with a lot of I/O and is rarely actually needed. It's best to only use std::endl when you have some good reason to flush the stream and it's not very often needed for simple programs such as this one. Avoiding the habit of using std::endl when 'n' will do will pay dividends in the future as you write more complex programs with more I/O and where performance needs to be maximized.

score 2 · Answer 2 · 2018-10-10 16:29:32Z

The code is reasonably good. There are some inefficiencies and small bug that doesn't affect the result.

Bug

!in.eof(), .eof() is never true until its read. It should be the getline instead.

Inefficiencies

reading line by line

It is in general better to just read into string variable, it is clearer and probably more efficient.

Manually closing the file

Standard file streams close themselves when they get out of scope, which end of main in this case.

Alternative implementation

Now, lets eliminate the main culprit, string validity check:

bool is_valid_word(std::string_view word) 
 for (char letter: word) 
 switch (letter) 
 case 'g':
 case 'k':
 case 'm':
 case 'q':
 case 'v':
 case 'w':
 case 'x':
 case 'z':
 return false;
 default:
 ; //silence warnings
 
 

 return true;

The main change is to use jump table (switch statement) with fallthrough. The other change is to use std::string_view, immutable non owning view into string.

The next thing to care about is iteration through the file. My personal preference is to use a pair of iterators:

std::ifstream file("words.txt");
auto first = std::istream_iterator<std::string>(file);
auto last = std::istream_iterator<std::string>();

And the algorithm itself:

template <typename InputIterator>
std::string longest_valid_word(InputIterator first, InputIterator last) 
 std::string longest_word;
 while (first != last) 
 auto word = *first;
 if (word.size() > longest_word.size() && is_valid_word(word)) 
 longest_word = std::move(word);
 
 ++first;
 

 return longest_word;

Note that the above algorithm can run on anything that provides a pair of iterators, e.g. std::vector, std::list, etc.

Full code

No demo for this post, as I don't want to put the weight onto wandbox:

#include <string_view>
#include <utility>
#include <string>

bool is_valid_word(std::string_view word) 
 for (char letter: word) 
 switch (letter) 
 case 'g':
 case 'k':
 case 'm':
 case 'q':
 case 'v':
 case 'w':
 case 'x':
 case 'z':
 return false;
 default:
 ; //silence warnings
 
 

 return true;


template <typename InputIterator>
std::string longest_valid_word(InputIterator first, InputIterator last) 
 std::string longest_word;
 while (first != last) 
 auto word = *first;
 if (word.size() > longest_word.size() && is_valid_word(word)) 
 longest_word = std::move(word);
 
 ++first;
 

 return longest_word;


#include <fstream>
#include <iterator>
#include <chrono>
#include <atomic>
#include <iostream>

int main() 
 std::ios_base::sync_with_stdio(false);
 std::ifstream file("words.txt");
 auto first = std::istream_iterator<std::string>(file);
 auto last = std::istream_iterator<std::string>();

 using namespace std::chrono;
 auto start_time = system_clock::now();
 //std::atomic_thread_fence(std::memory_order_seq_cst);
 auto found_word = longest_valid_word(first, last);
 //std::atomic_thread_fence(std::memory_order_seq_cst);
 auto end_time = system_clock::now();

 std::cout << "longest word is " << found_word << " and it took "
 << duration_cast<microseconds>(end_time - start_time).count() << " microsecondsn";

The atomic thread fences can be uncommented, but they don't really change much in this case. It is probably bottlenecked by performance of my SSD and of std::ifstream.

Conclusion

Well, I got around 90-100 milliseconds on my machine. I've got no idea why it runs so slowly, but I've exhausted my sane ideas. The other one would be to do some manual parsing of the file, which is much harder.

Thanks - but that just seems to be a list of line numbers (and a "save as text" link that doesn't - it just presents a button that when activated, says "Invalid OpenID transaction"). Never mind - my code compiles; that should be good enough for anyone... — 7 mins ago
Just a tiny question: Why don't you put return true; in default:? — 6 mins ago
@Xydez, that would return on the first character of string, not on the whole string. — 6 mins ago

score 1 · Answer 3 · 2018-10-10 16:31:38Z

I'm not sure what "TimeUtil.h" contains, but the standard facilities (in std::chrono) are to be preferred for portable code:

#include <chrono>

int main()

 auto begin_time = std::chrono::high_resolution_clock::now();

 // ...

 auto end_time = std::chrono::high_resolution_clock::now();
 auto duration
 = std::chrono::duration_cast<std::chrono::milliseconds>(end_time-begin_time);

 std::cout << "[" << duration.count() << "ms] "
 << longestAcceptableWord << std::endl;

It's better to separate the actual logic into its own function, separate from the timing and file opening housework.

badLetters can have static duration.

We don't need to separately construct and open the input file (and we'd be more flexible if we didn't hard-code that input - why not just read from standard input?). Also, no need to explicitly close it if we're not using the return value of close() - just let the constructor do its job!

while (!eof()) is an anti-pattern: we should while (getline()) instead. The latter actually attempts a read, whereas the former only determines whether the previous read hit end of file.

std::string has find_first_of which tests for all the characters in a set concurrently. The set does need to be passed as a null-terminated string (or as a std::string or std::string_view), but that's not hard to arrange.

Modified code

Applying the advice above, we get:

#include <iostream>
#include <fstream>
#include <string>

#include <chrono>

std::string findLongestLineNotContaining(std::istream& in,
 const char *badLetters)

 std::string longestAcceptableWord;
 std::string currentLine;
 while (getline(in, currentLine))
 
 if (currentLine.length() > longestAcceptableWord.length() &&
 currentLine.find_first_of(badLetters) == currentLine.npos)
 
 // Word is longer than the current longest word found,
 // and doesn't have any bad letters, so make it the new
 // longest word found.
 longestAcceptableWord = currentLine;
 
 
 return longestAcceptableWord;



int main()

 namespace chr = std::chrono;
 auto begin_time = chr::high_resolution_clock::now();

 static const char *badLetters = "gkmqvwxz";
 std::ifstream in("words.txt");
 std::string longestAcceptableWord
 = findLongestLineNotContaining(in, badLetters);

 auto end_time = chr::high_resolution_clock::now();
 auto duration = chr::duration_cast<chr::milliseconds>(end_time - begin_time);

 std::cout << "[" << duration.count() << "ms] "
 << longestAcceptableWord << std::endl;

Edward 45k375204 · Answer 4 · 2018-10-10 16:26:04Z

I see a number of things that may help you improve your program.

Don't loop on `eof()`

It's almost always incorrect to loop on eof() while reading a file. The reason is that the eof indication is only set when an attempt is made to read something from the file when we're already at the end. So instead of this:

while (!in.eof()) 
 getline(in, currentLine); 
 // ...

write this:

while (getline(in, currentLine)) 
 // ...

See this question for more details on why using eof is usually wrong.

Initialize variables on declaration

The best practice is to initialize varibles as soon as they're created. In C++, this most often means initializing them in the same line as the declaration. So instead of this:

std::ifstream in;
in.open("words.txt");

write this:

std::ifstream file2wordsFile;

Note, too, that I use the C++11 uniform initializer syntax (with the ) to make it clear to both the compiler and the reader that this is an initialization and not a function call. See Stroustrup's description for more details on that.

Allow the user to specify input files

The words file name is currently hardcoded which certainly greatly restricts the usefulness of the program. Consider using argc and argv to allow the user to specify file names on the command line.

Prefer newer structures to plain arrays

Instead of a plain array as with badLetters, one could instead use an std::array or std::string. If your compiler has C++17 support, std::string_vew might be an even better choice.

Use standard algorithms

Instead of searching with a for loop for all badLetters, we can use std::find_first_of() instead. The result would look like this:

while (getline(in, currentLine)) 
 if (currentLine.length() > longestAcceptableWord.length()
 && (std::find_first_of(currentLine.begin(),
 currentLine.end(),
 badLetters.begin();
 badLetters.end()) == std::string::npos)
 ) 
 longestAcceptableWord = currentLine;

Note that I prefer to avoid continue and instead rely on the short-circuit evaluation of the if clauses to cause the same effect.

Don't use `std::endl` if you don't really need it

The difference betweeen std::endl and 'n' is that 'n' just emits a newline character, while std::endl actually flushes the stream. This can be time-consuming in a program with a lot of I/O and is rarely actually needed. It's best to only use std::endl when you have some good reason to flush the stream and it's not very often needed for simple programs such as this one. Avoiding the habit of using std::endl when 'n' will do will pay dividends in the future as you write more complex programs with more I/O and where performance needs to be maximized.

score 2 · Answer 5 · 2018-10-10 16:29:32Z

The code is reasonably good. There are some inefficiencies and small bug that doesn't affect the result.

Bug

!in.eof(), .eof() is never true until its read. It should be the getline instead.

Inefficiencies

reading line by line

It is in general better to just read into string variable, it is clearer and probably more efficient.

Manually closing the file

Standard file streams close themselves when they get out of scope, which end of main in this case.

Alternative implementation

Now, lets eliminate the main culprit, string validity check:

bool is_valid_word(std::string_view word) 
 for (char letter: word) 
 switch (letter) 
 case 'g':
 case 'k':
 case 'm':
 case 'q':
 case 'v':
 case 'w':
 case 'x':
 case 'z':
 return false;
 default:
 ; //silence warnings
 
 

 return true;

The main change is to use jump table (switch statement) with fallthrough. The other change is to use std::string_view, immutable non owning view into string.

The next thing to care about is iteration through the file. My personal preference is to use a pair of iterators:

std::ifstream file("words.txt");
auto first = std::istream_iterator<std::string>(file);
auto last = std::istream_iterator<std::string>();

And the algorithm itself:

template <typename InputIterator>
std::string longest_valid_word(InputIterator first, InputIterator last) 
 std::string longest_word;
 while (first != last) 
 auto word = *first;
 if (word.size() > longest_word.size() && is_valid_word(word)) 
 longest_word = std::move(word);
 
 ++first;
 

 return longest_word;

Note that the above algorithm can run on anything that provides a pair of iterators, e.g. std::vector, std::list, etc.

Full code

No demo for this post, as I don't want to put the weight onto wandbox:

#include <string_view>
#include <utility>
#include <string>

bool is_valid_word(std::string_view word) 
 for (char letter: word) 
 switch (letter) 
 case 'g':
 case 'k':
 case 'm':
 case 'q':
 case 'v':
 case 'w':
 case 'x':
 case 'z':
 return false;
 default:
 ; //silence warnings
 
 

 return true;


template <typename InputIterator>
std::string longest_valid_word(InputIterator first, InputIterator last) 
 std::string longest_word;
 while (first != last) 
 auto word = *first;
 if (word.size() > longest_word.size() && is_valid_word(word)) 
 longest_word = std::move(word);
 
 ++first;
 

 return longest_word;


#include <fstream>
#include <iterator>
#include <chrono>
#include <atomic>
#include <iostream>

int main() 
 std::ios_base::sync_with_stdio(false);
 std::ifstream file("words.txt");
 auto first = std::istream_iterator<std::string>(file);
 auto last = std::istream_iterator<std::string>();

 using namespace std::chrono;
 auto start_time = system_clock::now();
 //std::atomic_thread_fence(std::memory_order_seq_cst);
 auto found_word = longest_valid_word(first, last);
 //std::atomic_thread_fence(std::memory_order_seq_cst);
 auto end_time = system_clock::now();

 std::cout << "longest word is " << found_word << " and it took "
 << duration_cast<microseconds>(end_time - start_time).count() << " microsecondsn";

The atomic thread fences can be uncommented, but they don't really change much in this case. It is probably bottlenecked by performance of my SSD and of std::ifstream.

Conclusion

Well, I got around 90-100 milliseconds on my machine. I've got no idea why it runs so slowly, but I've exhausted my sane ideas. The other one would be to do some manual parsing of the file, which is much harder.

Thanks - but that just seems to be a list of line numbers (and a "save as text" link that doesn't - it just presents a button that when activated, says "Invalid OpenID transaction"). Never mind - my code compiles; that should be good enough for anyone... — 7 mins ago
Just a tiny question: Why don't you put return true; in default:? — 6 mins ago
@Xydez, that would return on the first character of string, not on the whole string. — 6 mins ago

score 1 · Answer 6 · 2018-10-10 16:31:38Z

I'm not sure what "TimeUtil.h" contains, but the standard facilities (in std::chrono) are to be preferred for portable code:

#include <chrono>

int main()

 auto begin_time = std::chrono::high_resolution_clock::now();

 // ...

 auto end_time = std::chrono::high_resolution_clock::now();
 auto duration
 = std::chrono::duration_cast<std::chrono::milliseconds>(end_time-begin_time);

 std::cout << "[" << duration.count() << "ms] "
 << longestAcceptableWord << std::endl;

It's better to separate the actual logic into its own function, separate from the timing and file opening housework.

badLetters can have static duration.

We don't need to separately construct and open the input file (and we'd be more flexible if we didn't hard-code that input - why not just read from standard input?). Also, no need to explicitly close it if we're not using the return value of close() - just let the constructor do its job!

while (!eof()) is an anti-pattern: we should while (getline()) instead. The latter actually attempts a read, whereas the former only determines whether the previous read hit end of file.

std::string has find_first_of which tests for all the characters in a set concurrently. The set does need to be passed as a null-terminated string (or as a std::string or std::string_view), but that's not hard to arrange.

Modified code

Applying the advice above, we get:

#include <iostream>
#include <fstream>
#include <string>

#include <chrono>

std::string findLongestLineNotContaining(std::istream& in,
 const char *badLetters)

 std::string longestAcceptableWord;
 std::string currentLine;
 while (getline(in, currentLine))
 
 if (currentLine.length() > longestAcceptableWord.length() &&
 currentLine.find_first_of(badLetters) == currentLine.npos)
 
 // Word is longer than the current longest word found,
 // and doesn't have any bad letters, so make it the new
 // longest word found.
 longestAcceptableWord = currentLine;
 
 
 return longestAcceptableWord;



int main()

 namespace chr = std::chrono;
 auto begin_time = chr::high_resolution_clock::now();

 static const char *badLetters = "gkmqvwxz";
 std::ifstream in("words.txt");
 std::string longestAcceptableWord
 = findLongestLineNotContaining(in, badLetters);

 auto end_time = chr::high_resolution_clock::now();
 auto duration = chr::duration_cast<chr::milliseconds>(end_time - begin_time);

 std::cout << "[" << duration.count() << "ms] "
 << longestAcceptableWord << std::endl;

Category

Random preview

Program that finds the longest word not containing one of the disallowed character

Problem definition

The twist

Problem definition

The twist

Problem definition

The twist

Problem definition

The twist

3 Answers 3

Don't loop on eof()

Initialize variables on declaration

Allow the user to specify input files

Prefer newer structures to plain arrays

Use standard algorithms

Don't use std::endl if you don't really need it

Bug

Inefficiencies

Alternative implementation

Full code

Conclusion

Modified code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

3 Answers 3

3 Answers 3

Don't loop on eof()

Initialize variables on declaration

Allow the user to specify input files

Prefer newer structures to plain arrays

Use standard algorithms

Don't use std::endl if you don't really need it

Don't loop on eof()

Initialize variables on declaration

Allow the user to specify input files

Prefer newer structures to plain arrays

Use standard algorithms

Don't use std::endl if you don't really need it

Don't loop on eof()

Initialize variables on declaration

Allow the user to specify input files

Prefer newer structures to plain arrays

Use standard algorithms

Don't use std::endl if you don't really need it

Don't loop on eof()

Initialize variables on declaration

Allow the user to specify input files

Prefer newer structures to plain arrays

Use standard algorithms

Don't use std::endl if you don't really need it

Bug

Inefficiencies

Alternative implementation

Full code

Conclusion

Bug

Inefficiencies

Alternative implementation

Full code

Conclusion

Bug

Inefficiencies

Alternative implementation

Full code

Conclusion

Bug

Inefficiencies

Alternative implementation

Full code

Conclusion

Modified code

Modified code

Modified code

Modified code

Sign up or log in

Post as a guest

3 Answers
3

Don't loop on `eof()`

Don't use `std::endl` if you don't really need it

3 Answers
3

3 Answers
3

Don't loop on `eof()`

Don't use `std::endl` if you don't really need it

Don't loop on `eof()`

Don't use `std::endl` if you don't really need it

Don't loop on `eof()`

Don't use `std::endl` if you don't really need it

Don't loop on `eof()`

Don't use `std::endl` if you don't really need it