Generic way to remove all duplicates from a not-sorted container

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
3
down vote

favorite












Here's an interesting article called "How to Remove Elements from a Sequence Container in C++". At some point, the author also explains how to remove duplicates from a container, but only with the very restrictive condition that those duplicates are adjacent. I've given a bit of thoughts on how to make a generic algorithm that would work on any duplicate in the container, and it's a bit more complicated that one might think at first.



The naively obvious solution is:



template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
auto it = std::next(first);
while (first != last)
it = std::find(it, last, *first);
if (it == last) it = std::next(++first);
else std::rotate(it, std::next(it), last--);

return last;



and it looks a lot like an implementation of a STL's algorithm, but it also is a polynomial-time algorithm, which is only acceptable if there is no alternative. But you generally keep track of already encountered values in a separate container in order to attain linear or almost-linear time. Sets are good candidates but they come in two flavors, and the std::unordered_set being more efficient than the simple std::set must be used when possible.



I've used concepts (gcc > 6.3 && -fconcepts) to dispatch to the most efficient overload:



template <typename T>
concept bool Hashable = requires
std::hash<T>();
;

template <typename T>
concept bool Comparable = requires (T a, T b)
a == b -> bool;
;

template <typename T>
concept bool Orderable = requires (T a, T b)
a < b -> bool;
;


unordered_sets require the key to be hashable and comparable, and sets that it be orderable. After code factorization it gives:



template <typename Iterator, typename Set>
Iterator remove_duplicates_impl(Iterator first, Iterator last, Set& known_values)
while (first != last)
if (known_values.find(*first) != known_values.end()) std::rotate(first, std::next(first), last--);
else known_values.insert(*first++);

return last;


template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
requires Orderable<Value_type<Iterator>> && !Hashable<Value_type<Iterator>>

std::set<Value_type<Iterator>> known_values;
return remove_duplicates_impl(first, last, known_values);



template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
requires Hashable<Value_type<Iterator>> && Comparable<Value_type<Iterator>>

std::unordered_set<Value_type<Iterator>> known_values;
return remove_duplicates_impl(first, last, known_values);



So what's missing here, be it optimizations or corner-cases handling?



A minimal working example:



#include <iostream>
#include <algorithm>
#include <vector>
#include <unordered_set>
#include <set>
#include <iterator>
#include <type_traits>

template <typename Iterator>
using Value_type = typename std::iterator_traits<Iterator>::value_type;

template <typename T>
concept bool Hashable = requires
std::hash<T>();
;

template <typename T>
concept bool Comparable = requires (T a, T b)
a == b -> bool;
;

template <typename T>
concept bool Orderable = requires (T a, T b)
a < b -> bool;
;

template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
auto it = std::next(first);
while (first != last)
it = std::find(it, last, *first);
if (it == last) it = std::next(++first);
else std::rotate(it, std::next(it), last--);

return last;


template <typename Iterator, typename Set>
Iterator remove_duplicates_impl(Iterator first, Iterator last, Set& known_values)
while (first != last)
if (known_values.find(*first) != known_values.end()) std::rotate(first, std::next(first), last--);
else known_values.insert(*first++);

return last;


template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
requires Orderable<Value_type<Iterator>> && !Hashable<Value_type<Iterator>>

std::set<Value_type<Iterator>> known_values;
return remove_duplicates_impl(first, last, known_values);



template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
requires Hashable<Value_type<Iterator>>

std::unordered_set<Value_type<Iterator>> known_values;
return remove_duplicates_impl(first, last, known_values);


struct Foo ;
bool operator==(Foo, Foo) return true;

int main()

std::vector<int> data 1,2,3,4,5,6,7,8,9,8,7,6,5,2,4,3,1,8 ;
std::vector<int> unik(data.begin(), remove_duplicates(data.begin(), data.end()));
for (auto i : unik) std::cout << i << ' ';
std::cout << std::endl;

std::vector<std::pair<int, int>> test 1,2, 3,4 , 5,6 , 7,8, 7, 8 ;
[[maybe_unused]]
auto it = remove_duplicates(test.begin(), test.end());

std::vector<Foo> test2 Foo(), Foo(), Foo() ;
[[maybe_unused]]
auto it2 = remove_duplicates(test2.begin(), test2.end());











share|improve this question





















  • You don't say what kind of performance measurement you've done - is there a clear improvement over the naive algorithm that requires no extra storage? I suspect that if duplicates abound, then std::rotate is likely to dominate the work. Two obvious test cases suggest themselves - one with all elements different (e.g. using std::iota() and std::shuffle()) and one with all elements identical. Obviously, the timings will depend on the element type, and the cost of its move assignment, so that immediately doubles the test cases to try.
    – Toby Speight
    2 hours ago














up vote
3
down vote

favorite












Here's an interesting article called "How to Remove Elements from a Sequence Container in C++". At some point, the author also explains how to remove duplicates from a container, but only with the very restrictive condition that those duplicates are adjacent. I've given a bit of thoughts on how to make a generic algorithm that would work on any duplicate in the container, and it's a bit more complicated that one might think at first.



The naively obvious solution is:



template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
auto it = std::next(first);
while (first != last)
it = std::find(it, last, *first);
if (it == last) it = std::next(++first);
else std::rotate(it, std::next(it), last--);

return last;



and it looks a lot like an implementation of a STL's algorithm, but it also is a polynomial-time algorithm, which is only acceptable if there is no alternative. But you generally keep track of already encountered values in a separate container in order to attain linear or almost-linear time. Sets are good candidates but they come in two flavors, and the std::unordered_set being more efficient than the simple std::set must be used when possible.



I've used concepts (gcc > 6.3 && -fconcepts) to dispatch to the most efficient overload:



template <typename T>
concept bool Hashable = requires
std::hash<T>();
;

template <typename T>
concept bool Comparable = requires (T a, T b)
a == b -> bool;
;

template <typename T>
concept bool Orderable = requires (T a, T b)
a < b -> bool;
;


unordered_sets require the key to be hashable and comparable, and sets that it be orderable. After code factorization it gives:



template <typename Iterator, typename Set>
Iterator remove_duplicates_impl(Iterator first, Iterator last, Set& known_values)
while (first != last)
if (known_values.find(*first) != known_values.end()) std::rotate(first, std::next(first), last--);
else known_values.insert(*first++);

return last;


template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
requires Orderable<Value_type<Iterator>> && !Hashable<Value_type<Iterator>>

std::set<Value_type<Iterator>> known_values;
return remove_duplicates_impl(first, last, known_values);



template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
requires Hashable<Value_type<Iterator>> && Comparable<Value_type<Iterator>>

std::unordered_set<Value_type<Iterator>> known_values;
return remove_duplicates_impl(first, last, known_values);



So what's missing here, be it optimizations or corner-cases handling?



A minimal working example:



#include <iostream>
#include <algorithm>
#include <vector>
#include <unordered_set>
#include <set>
#include <iterator>
#include <type_traits>

template <typename Iterator>
using Value_type = typename std::iterator_traits<Iterator>::value_type;

template <typename T>
concept bool Hashable = requires
std::hash<T>();
;

template <typename T>
concept bool Comparable = requires (T a, T b)
a == b -> bool;
;

template <typename T>
concept bool Orderable = requires (T a, T b)
a < b -> bool;
;

template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
auto it = std::next(first);
while (first != last)
it = std::find(it, last, *first);
if (it == last) it = std::next(++first);
else std::rotate(it, std::next(it), last--);

return last;


template <typename Iterator, typename Set>
Iterator remove_duplicates_impl(Iterator first, Iterator last, Set& known_values)
while (first != last)
if (known_values.find(*first) != known_values.end()) std::rotate(first, std::next(first), last--);
else known_values.insert(*first++);

return last;


template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
requires Orderable<Value_type<Iterator>> && !Hashable<Value_type<Iterator>>

std::set<Value_type<Iterator>> known_values;
return remove_duplicates_impl(first, last, known_values);



template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
requires Hashable<Value_type<Iterator>>

std::unordered_set<Value_type<Iterator>> known_values;
return remove_duplicates_impl(first, last, known_values);


struct Foo ;
bool operator==(Foo, Foo) return true;

int main()

std::vector<int> data 1,2,3,4,5,6,7,8,9,8,7,6,5,2,4,3,1,8 ;
std::vector<int> unik(data.begin(), remove_duplicates(data.begin(), data.end()));
for (auto i : unik) std::cout << i << ' ';
std::cout << std::endl;

std::vector<std::pair<int, int>> test 1,2, 3,4 , 5,6 , 7,8, 7, 8 ;
[[maybe_unused]]
auto it = remove_duplicates(test.begin(), test.end());

std::vector<Foo> test2 Foo(), Foo(), Foo() ;
[[maybe_unused]]
auto it2 = remove_duplicates(test2.begin(), test2.end());











share|improve this question





















  • You don't say what kind of performance measurement you've done - is there a clear improvement over the naive algorithm that requires no extra storage? I suspect that if duplicates abound, then std::rotate is likely to dominate the work. Two obvious test cases suggest themselves - one with all elements different (e.g. using std::iota() and std::shuffle()) and one with all elements identical. Obviously, the timings will depend on the element type, and the cost of its move assignment, so that immediately doubles the test cases to try.
    – Toby Speight
    2 hours ago












up vote
3
down vote

favorite









up vote
3
down vote

favorite











Here's an interesting article called "How to Remove Elements from a Sequence Container in C++". At some point, the author also explains how to remove duplicates from a container, but only with the very restrictive condition that those duplicates are adjacent. I've given a bit of thoughts on how to make a generic algorithm that would work on any duplicate in the container, and it's a bit more complicated that one might think at first.



The naively obvious solution is:



template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
auto it = std::next(first);
while (first != last)
it = std::find(it, last, *first);
if (it == last) it = std::next(++first);
else std::rotate(it, std::next(it), last--);

return last;



and it looks a lot like an implementation of a STL's algorithm, but it also is a polynomial-time algorithm, which is only acceptable if there is no alternative. But you generally keep track of already encountered values in a separate container in order to attain linear or almost-linear time. Sets are good candidates but they come in two flavors, and the std::unordered_set being more efficient than the simple std::set must be used when possible.



I've used concepts (gcc > 6.3 && -fconcepts) to dispatch to the most efficient overload:



template <typename T>
concept bool Hashable = requires
std::hash<T>();
;

template <typename T>
concept bool Comparable = requires (T a, T b)
a == b -> bool;
;

template <typename T>
concept bool Orderable = requires (T a, T b)
a < b -> bool;
;


unordered_sets require the key to be hashable and comparable, and sets that it be orderable. After code factorization it gives:



template <typename Iterator, typename Set>
Iterator remove_duplicates_impl(Iterator first, Iterator last, Set& known_values)
while (first != last)
if (known_values.find(*first) != known_values.end()) std::rotate(first, std::next(first), last--);
else known_values.insert(*first++);

return last;


template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
requires Orderable<Value_type<Iterator>> && !Hashable<Value_type<Iterator>>

std::set<Value_type<Iterator>> known_values;
return remove_duplicates_impl(first, last, known_values);



template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
requires Hashable<Value_type<Iterator>> && Comparable<Value_type<Iterator>>

std::unordered_set<Value_type<Iterator>> known_values;
return remove_duplicates_impl(first, last, known_values);



So what's missing here, be it optimizations or corner-cases handling?



A minimal working example:



#include <iostream>
#include <algorithm>
#include <vector>
#include <unordered_set>
#include <set>
#include <iterator>
#include <type_traits>

template <typename Iterator>
using Value_type = typename std::iterator_traits<Iterator>::value_type;

template <typename T>
concept bool Hashable = requires
std::hash<T>();
;

template <typename T>
concept bool Comparable = requires (T a, T b)
a == b -> bool;
;

template <typename T>
concept bool Orderable = requires (T a, T b)
a < b -> bool;
;

template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
auto it = std::next(first);
while (first != last)
it = std::find(it, last, *first);
if (it == last) it = std::next(++first);
else std::rotate(it, std::next(it), last--);

return last;


template <typename Iterator, typename Set>
Iterator remove_duplicates_impl(Iterator first, Iterator last, Set& known_values)
while (first != last)
if (known_values.find(*first) != known_values.end()) std::rotate(first, std::next(first), last--);
else known_values.insert(*first++);

return last;


template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
requires Orderable<Value_type<Iterator>> && !Hashable<Value_type<Iterator>>

std::set<Value_type<Iterator>> known_values;
return remove_duplicates_impl(first, last, known_values);



template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
requires Hashable<Value_type<Iterator>>

std::unordered_set<Value_type<Iterator>> known_values;
return remove_duplicates_impl(first, last, known_values);


struct Foo ;
bool operator==(Foo, Foo) return true;

int main()

std::vector<int> data 1,2,3,4,5,6,7,8,9,8,7,6,5,2,4,3,1,8 ;
std::vector<int> unik(data.begin(), remove_duplicates(data.begin(), data.end()));
for (auto i : unik) std::cout << i << ' ';
std::cout << std::endl;

std::vector<std::pair<int, int>> test 1,2, 3,4 , 5,6 , 7,8, 7, 8 ;
[[maybe_unused]]
auto it = remove_duplicates(test.begin(), test.end());

std::vector<Foo> test2 Foo(), Foo(), Foo() ;
[[maybe_unused]]
auto it2 = remove_duplicates(test2.begin(), test2.end());











share|improve this question













Here's an interesting article called "How to Remove Elements from a Sequence Container in C++". At some point, the author also explains how to remove duplicates from a container, but only with the very restrictive condition that those duplicates are adjacent. I've given a bit of thoughts on how to make a generic algorithm that would work on any duplicate in the container, and it's a bit more complicated that one might think at first.



The naively obvious solution is:



template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
auto it = std::next(first);
while (first != last)
it = std::find(it, last, *first);
if (it == last) it = std::next(++first);
else std::rotate(it, std::next(it), last--);

return last;



and it looks a lot like an implementation of a STL's algorithm, but it also is a polynomial-time algorithm, which is only acceptable if there is no alternative. But you generally keep track of already encountered values in a separate container in order to attain linear or almost-linear time. Sets are good candidates but they come in two flavors, and the std::unordered_set being more efficient than the simple std::set must be used when possible.



I've used concepts (gcc > 6.3 && -fconcepts) to dispatch to the most efficient overload:



template <typename T>
concept bool Hashable = requires
std::hash<T>();
;

template <typename T>
concept bool Comparable = requires (T a, T b)
a == b -> bool;
;

template <typename T>
concept bool Orderable = requires (T a, T b)
a < b -> bool;
;


unordered_sets require the key to be hashable and comparable, and sets that it be orderable. After code factorization it gives:



template <typename Iterator, typename Set>
Iterator remove_duplicates_impl(Iterator first, Iterator last, Set& known_values)
while (first != last)
if (known_values.find(*first) != known_values.end()) std::rotate(first, std::next(first), last--);
else known_values.insert(*first++);

return last;


template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
requires Orderable<Value_type<Iterator>> && !Hashable<Value_type<Iterator>>

std::set<Value_type<Iterator>> known_values;
return remove_duplicates_impl(first, last, known_values);



template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
requires Hashable<Value_type<Iterator>> && Comparable<Value_type<Iterator>>

std::unordered_set<Value_type<Iterator>> known_values;
return remove_duplicates_impl(first, last, known_values);



So what's missing here, be it optimizations or corner-cases handling?



A minimal working example:



#include <iostream>
#include <algorithm>
#include <vector>
#include <unordered_set>
#include <set>
#include <iterator>
#include <type_traits>

template <typename Iterator>
using Value_type = typename std::iterator_traits<Iterator>::value_type;

template <typename T>
concept bool Hashable = requires
std::hash<T>();
;

template <typename T>
concept bool Comparable = requires (T a, T b)
a == b -> bool;
;

template <typename T>
concept bool Orderable = requires (T a, T b)
a < b -> bool;
;

template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
auto it = std::next(first);
while (first != last)
it = std::find(it, last, *first);
if (it == last) it = std::next(++first);
else std::rotate(it, std::next(it), last--);

return last;


template <typename Iterator, typename Set>
Iterator remove_duplicates_impl(Iterator first, Iterator last, Set& known_values)
while (first != last)
if (known_values.find(*first) != known_values.end()) std::rotate(first, std::next(first), last--);
else known_values.insert(*first++);

return last;


template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
requires Orderable<Value_type<Iterator>> && !Hashable<Value_type<Iterator>>

std::set<Value_type<Iterator>> known_values;
return remove_duplicates_impl(first, last, known_values);



template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
requires Hashable<Value_type<Iterator>>

std::unordered_set<Value_type<Iterator>> known_values;
return remove_duplicates_impl(first, last, known_values);


struct Foo ;
bool operator==(Foo, Foo) return true;

int main()

std::vector<int> data 1,2,3,4,5,6,7,8,9,8,7,6,5,2,4,3,1,8 ;
std::vector<int> unik(data.begin(), remove_duplicates(data.begin(), data.end()));
for (auto i : unik) std::cout << i << ' ';
std::cout << std::endl;

std::vector<std::pair<int, int>> test 1,2, 3,4 , 5,6 , 7,8, 7, 8 ;
[[maybe_unused]]
auto it = remove_duplicates(test.begin(), test.end());

std::vector<Foo> test2 Foo(), Foo(), Foo() ;
[[maybe_unused]]
auto it2 = remove_duplicates(test2.begin(), test2.end());








c++ c++20






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked 3 hours ago









papagaga

3,389217




3,389217











  • You don't say what kind of performance measurement you've done - is there a clear improvement over the naive algorithm that requires no extra storage? I suspect that if duplicates abound, then std::rotate is likely to dominate the work. Two obvious test cases suggest themselves - one with all elements different (e.g. using std::iota() and std::shuffle()) and one with all elements identical. Obviously, the timings will depend on the element type, and the cost of its move assignment, so that immediately doubles the test cases to try.
    – Toby Speight
    2 hours ago
















  • You don't say what kind of performance measurement you've done - is there a clear improvement over the naive algorithm that requires no extra storage? I suspect that if duplicates abound, then std::rotate is likely to dominate the work. Two obvious test cases suggest themselves - one with all elements different (e.g. using std::iota() and std::shuffle()) and one with all elements identical. Obviously, the timings will depend on the element type, and the cost of its move assignment, so that immediately doubles the test cases to try.
    – Toby Speight
    2 hours ago















You don't say what kind of performance measurement you've done - is there a clear improvement over the naive algorithm that requires no extra storage? I suspect that if duplicates abound, then std::rotate is likely to dominate the work. Two obvious test cases suggest themselves - one with all elements different (e.g. using std::iota() and std::shuffle()) and one with all elements identical. Obviously, the timings will depend on the element type, and the cost of its move assignment, so that immediately doubles the test cases to try.
– Toby Speight
2 hours ago




You don't say what kind of performance measurement you've done - is there a clear improvement over the naive algorithm that requires no extra storage? I suspect that if duplicates abound, then std::rotate is likely to dominate the work. Two obvious test cases suggest themselves - one with all elements different (e.g. using std::iota() and std::shuffle()) and one with all elements identical. Obviously, the timings will depend on the element type, and the cost of its move assignment, so that immediately doubles the test cases to try.
– Toby Speight
2 hours ago










2 Answers
2






active

oldest

votes

















up vote
3
down vote













Your naive solution is quadratic time, O(n^2) and O(1) additional memory. Note, polynomial time doesn't say much about the actual time complexity other than it's not exponential or factorial.



Your improved version is O(nlogn) time for the implementation with set and O(n) for the one with unordered_set and both come with the additional cost of O(n) memory.



However the hash version is not necessarily faster than the set version depending on the cost of the hash function and value of n, you should allow the user the option of choosing a specific implementation if they want.



I would probably call the functions something like remove_duplicates_stable to imply the property you are designing for that is preserving the original order and keeping the first observed value. Which by the way differs between the naive implementation and the others, the naive one keeps the last of the dupes.



If keeping the order of the elements is not important I would probably sort the source array and use std unique or something. Because sorting can make use of move construction to avoid possibly expensive copy construction of all non duplicate elements which is necessary in your implementation, requires no additional space, and is O(nlogn). But again it depends on if copy construction is expensive and of the hash function is expensive.



I would also put the implementation in a separate namespace, like "detail" but this is just personal style.






share|improve this answer





























    up vote
    1
    down vote













    Regardless of which kind of set you use to keep track, you'll be spending a lot of time in std::rotate when given an input like this:



    std::array<int, 10000>()


    Yes, all the elements compare equal, and we dutifully rotate each one individually, resulting in (I think) ¼n² move operations. This is a problem with the "naive method", too, but not with std::unique(). If we're going to use extra storage, it might be more productive to use at least some of it to selectively copy input values to a new collection, and then swap() that new collection back into place.




    I know it's not really the code for review, but I can't resist pointing out that you could have written the naive version more clearly with std::remove(), like this:



    template <typename Iterator>
    Iterator remove_duplicates(Iterator first, Iterator last)

    while (first != last)
    auto const& value = *first;
    last = std::remove(first = std::next(first), last, value);

    return last;






    share|improve this answer






















      Your Answer




      StackExchange.ifUsing("editor", function ()
      return StackExchange.using("mathjaxEditing", function ()
      StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
      StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
      );
      );
      , "mathjax-editing");

      StackExchange.ifUsing("editor", function ()
      StackExchange.using("externalEditor", function ()
      StackExchange.using("snippets", function ()
      StackExchange.snippets.init();
      );
      );
      , "code-snippets");

      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "196"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      convertImagesToLinks: false,
      noModals: false,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













       

      draft saved


      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f203991%2fgeneric-way-to-remove-all-duplicates-from-a-not-sorted-container%23new-answer', 'question_page');

      );

      Post as a guest






























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      3
      down vote













      Your naive solution is quadratic time, O(n^2) and O(1) additional memory. Note, polynomial time doesn't say much about the actual time complexity other than it's not exponential or factorial.



      Your improved version is O(nlogn) time for the implementation with set and O(n) for the one with unordered_set and both come with the additional cost of O(n) memory.



      However the hash version is not necessarily faster than the set version depending on the cost of the hash function and value of n, you should allow the user the option of choosing a specific implementation if they want.



      I would probably call the functions something like remove_duplicates_stable to imply the property you are designing for that is preserving the original order and keeping the first observed value. Which by the way differs between the naive implementation and the others, the naive one keeps the last of the dupes.



      If keeping the order of the elements is not important I would probably sort the source array and use std unique or something. Because sorting can make use of move construction to avoid possibly expensive copy construction of all non duplicate elements which is necessary in your implementation, requires no additional space, and is O(nlogn). But again it depends on if copy construction is expensive and of the hash function is expensive.



      I would also put the implementation in a separate namespace, like "detail" but this is just personal style.






      share|improve this answer


























        up vote
        3
        down vote













        Your naive solution is quadratic time, O(n^2) and O(1) additional memory. Note, polynomial time doesn't say much about the actual time complexity other than it's not exponential or factorial.



        Your improved version is O(nlogn) time for the implementation with set and O(n) for the one with unordered_set and both come with the additional cost of O(n) memory.



        However the hash version is not necessarily faster than the set version depending on the cost of the hash function and value of n, you should allow the user the option of choosing a specific implementation if they want.



        I would probably call the functions something like remove_duplicates_stable to imply the property you are designing for that is preserving the original order and keeping the first observed value. Which by the way differs between the naive implementation and the others, the naive one keeps the last of the dupes.



        If keeping the order of the elements is not important I would probably sort the source array and use std unique or something. Because sorting can make use of move construction to avoid possibly expensive copy construction of all non duplicate elements which is necessary in your implementation, requires no additional space, and is O(nlogn). But again it depends on if copy construction is expensive and of the hash function is expensive.



        I would also put the implementation in a separate namespace, like "detail" but this is just personal style.






        share|improve this answer
























          up vote
          3
          down vote










          up vote
          3
          down vote









          Your naive solution is quadratic time, O(n^2) and O(1) additional memory. Note, polynomial time doesn't say much about the actual time complexity other than it's not exponential or factorial.



          Your improved version is O(nlogn) time for the implementation with set and O(n) for the one with unordered_set and both come with the additional cost of O(n) memory.



          However the hash version is not necessarily faster than the set version depending on the cost of the hash function and value of n, you should allow the user the option of choosing a specific implementation if they want.



          I would probably call the functions something like remove_duplicates_stable to imply the property you are designing for that is preserving the original order and keeping the first observed value. Which by the way differs between the naive implementation and the others, the naive one keeps the last of the dupes.



          If keeping the order of the elements is not important I would probably sort the source array and use std unique or something. Because sorting can make use of move construction to avoid possibly expensive copy construction of all non duplicate elements which is necessary in your implementation, requires no additional space, and is O(nlogn). But again it depends on if copy construction is expensive and of the hash function is expensive.



          I would also put the implementation in a separate namespace, like "detail" but this is just personal style.






          share|improve this answer














          Your naive solution is quadratic time, O(n^2) and O(1) additional memory. Note, polynomial time doesn't say much about the actual time complexity other than it's not exponential or factorial.



          Your improved version is O(nlogn) time for the implementation with set and O(n) for the one with unordered_set and both come with the additional cost of O(n) memory.



          However the hash version is not necessarily faster than the set version depending on the cost of the hash function and value of n, you should allow the user the option of choosing a specific implementation if they want.



          I would probably call the functions something like remove_duplicates_stable to imply the property you are designing for that is preserving the original order and keeping the first observed value. Which by the way differs between the naive implementation and the others, the naive one keeps the last of the dupes.



          If keeping the order of the elements is not important I would probably sort the source array and use std unique or something. Because sorting can make use of move construction to avoid possibly expensive copy construction of all non duplicate elements which is necessary in your implementation, requires no additional space, and is O(nlogn). But again it depends on if copy construction is expensive and of the hash function is expensive.



          I would also put the implementation in a separate namespace, like "detail" but this is just personal style.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited 3 hours ago

























          answered 3 hours ago









          Emily L.

          11.1k12372




          11.1k12372






















              up vote
              1
              down vote













              Regardless of which kind of set you use to keep track, you'll be spending a lot of time in std::rotate when given an input like this:



              std::array<int, 10000>()


              Yes, all the elements compare equal, and we dutifully rotate each one individually, resulting in (I think) ¼n² move operations. This is a problem with the "naive method", too, but not with std::unique(). If we're going to use extra storage, it might be more productive to use at least some of it to selectively copy input values to a new collection, and then swap() that new collection back into place.




              I know it's not really the code for review, but I can't resist pointing out that you could have written the naive version more clearly with std::remove(), like this:



              template <typename Iterator>
              Iterator remove_duplicates(Iterator first, Iterator last)

              while (first != last)
              auto const& value = *first;
              last = std::remove(first = std::next(first), last, value);

              return last;






              share|improve this answer


























                up vote
                1
                down vote













                Regardless of which kind of set you use to keep track, you'll be spending a lot of time in std::rotate when given an input like this:



                std::array<int, 10000>()


                Yes, all the elements compare equal, and we dutifully rotate each one individually, resulting in (I think) ¼n² move operations. This is a problem with the "naive method", too, but not with std::unique(). If we're going to use extra storage, it might be more productive to use at least some of it to selectively copy input values to a new collection, and then swap() that new collection back into place.




                I know it's not really the code for review, but I can't resist pointing out that you could have written the naive version more clearly with std::remove(), like this:



                template <typename Iterator>
                Iterator remove_duplicates(Iterator first, Iterator last)

                while (first != last)
                auto const& value = *first;
                last = std::remove(first = std::next(first), last, value);

                return last;






                share|improve this answer
























                  up vote
                  1
                  down vote










                  up vote
                  1
                  down vote









                  Regardless of which kind of set you use to keep track, you'll be spending a lot of time in std::rotate when given an input like this:



                  std::array<int, 10000>()


                  Yes, all the elements compare equal, and we dutifully rotate each one individually, resulting in (I think) ¼n² move operations. This is a problem with the "naive method", too, but not with std::unique(). If we're going to use extra storage, it might be more productive to use at least some of it to selectively copy input values to a new collection, and then swap() that new collection back into place.




                  I know it's not really the code for review, but I can't resist pointing out that you could have written the naive version more clearly with std::remove(), like this:



                  template <typename Iterator>
                  Iterator remove_duplicates(Iterator first, Iterator last)

                  while (first != last)
                  auto const& value = *first;
                  last = std::remove(first = std::next(first), last, value);

                  return last;






                  share|improve this answer














                  Regardless of which kind of set you use to keep track, you'll be spending a lot of time in std::rotate when given an input like this:



                  std::array<int, 10000>()


                  Yes, all the elements compare equal, and we dutifully rotate each one individually, resulting in (I think) ¼n² move operations. This is a problem with the "naive method", too, but not with std::unique(). If we're going to use extra storage, it might be more productive to use at least some of it to selectively copy input values to a new collection, and then swap() that new collection back into place.




                  I know it's not really the code for review, but I can't resist pointing out that you could have written the naive version more clearly with std::remove(), like this:



                  template <typename Iterator>
                  Iterator remove_duplicates(Iterator first, Iterator last)

                  while (first != last)
                  auto const& value = *first;
                  last = std::remove(first = std::next(first), last, value);

                  return last;







                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited 1 hour ago

























                  answered 2 hours ago









                  Toby Speight

                  19.2k43697




                  19.2k43697



























                       

                      draft saved


                      draft discarded















































                       


                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f203991%2fgeneric-way-to-remove-all-duplicates-from-a-not-sorted-container%23new-answer', 'question_page');

                      );

                      Post as a guest













































































                      Comments

                      Popular posts from this blog

                      Long meetings (6-7 hours a day): Being “babysat” by supervisor

                      Is the Concept of Multiple Fantasy Races Scientifically Flawed? [closed]

                      Confectionery