Generic way to remove all duplicates from a not-sorted container
Clash Royale CLAN TAG#URR8PPP
up vote
3
down vote
favorite
Here's an interesting article called "How to Remove Elements from a Sequence Container in C++". At some point, the author also explains how to remove duplicates from a container, but only with the very restrictive condition that those duplicates are adjacent. I've given a bit of thoughts on how to make a generic algorithm that would work on any duplicate in the container, and it's a bit more complicated that one might think at first.
The naively obvious solution is:
template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
auto it = std::next(first);
while (first != last)
it = std::find(it, last, *first);
if (it == last) it = std::next(++first);
else std::rotate(it, std::next(it), last--);
return last;
and it looks a lot like an implementation of a STL's algorithm, but it also is a polynomial-time algorithm, which is only acceptable if there is no alternative. But you generally keep track of already encountered values in a separate container in order to attain linear or almost-linear time. Sets are good candidates but they come in two flavors, and the std::unordered_set
being more efficient than the simple std::set
must be used when possible.
I've used concepts (gcc > 6.3 && -fconcepts
) to dispatch to the most efficient overload:
template <typename T>
concept bool Hashable = requires
std::hash<T>();
;
template <typename T>
concept bool Comparable = requires (T a, T b)
a == b -> bool;
;
template <typename T>
concept bool Orderable = requires (T a, T b)
a < b -> bool;
;
unordered_set
s require the key to be hashable and comparable, and set
s that it be orderable. After code factorization it gives:
template <typename Iterator, typename Set>
Iterator remove_duplicates_impl(Iterator first, Iterator last, Set& known_values)
while (first != last)
if (known_values.find(*first) != known_values.end()) std::rotate(first, std::next(first), last--);
else known_values.insert(*first++);
return last;
template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
requires Orderable<Value_type<Iterator>> && !Hashable<Value_type<Iterator>>
std::set<Value_type<Iterator>> known_values;
return remove_duplicates_impl(first, last, known_values);
template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
requires Hashable<Value_type<Iterator>> && Comparable<Value_type<Iterator>>
std::unordered_set<Value_type<Iterator>> known_values;
return remove_duplicates_impl(first, last, known_values);
So what's missing here, be it optimizations or corner-cases handling?
A minimal working example:
#include <iostream>
#include <algorithm>
#include <vector>
#include <unordered_set>
#include <set>
#include <iterator>
#include <type_traits>
template <typename Iterator>
using Value_type = typename std::iterator_traits<Iterator>::value_type;
template <typename T>
concept bool Hashable = requires
std::hash<T>();
;
template <typename T>
concept bool Comparable = requires (T a, T b)
a == b -> bool;
;
template <typename T>
concept bool Orderable = requires (T a, T b)
a < b -> bool;
;
template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
auto it = std::next(first);
while (first != last)
it = std::find(it, last, *first);
if (it == last) it = std::next(++first);
else std::rotate(it, std::next(it), last--);
return last;
template <typename Iterator, typename Set>
Iterator remove_duplicates_impl(Iterator first, Iterator last, Set& known_values)
while (first != last)
if (known_values.find(*first) != known_values.end()) std::rotate(first, std::next(first), last--);
else known_values.insert(*first++);
return last;
template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
requires Orderable<Value_type<Iterator>> && !Hashable<Value_type<Iterator>>
std::set<Value_type<Iterator>> known_values;
return remove_duplicates_impl(first, last, known_values);
template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
requires Hashable<Value_type<Iterator>>
std::unordered_set<Value_type<Iterator>> known_values;
return remove_duplicates_impl(first, last, known_values);
struct Foo ;
bool operator==(Foo, Foo) return true;
int main()
std::vector<int> data 1,2,3,4,5,6,7,8,9,8,7,6,5,2,4,3,1,8 ;
std::vector<int> unik(data.begin(), remove_duplicates(data.begin(), data.end()));
for (auto i : unik) std::cout << i << ' ';
std::cout << std::endl;
std::vector<std::pair<int, int>> test 1,2, 3,4 , 5,6 , 7,8, 7, 8 ;
[[maybe_unused]]
auto it = remove_duplicates(test.begin(), test.end());
std::vector<Foo> test2 Foo(), Foo(), Foo() ;
[[maybe_unused]]
auto it2 = remove_duplicates(test2.begin(), test2.end());
c++ c++20
add a comment |Â
up vote
3
down vote
favorite
Here's an interesting article called "How to Remove Elements from a Sequence Container in C++". At some point, the author also explains how to remove duplicates from a container, but only with the very restrictive condition that those duplicates are adjacent. I've given a bit of thoughts on how to make a generic algorithm that would work on any duplicate in the container, and it's a bit more complicated that one might think at first.
The naively obvious solution is:
template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
auto it = std::next(first);
while (first != last)
it = std::find(it, last, *first);
if (it == last) it = std::next(++first);
else std::rotate(it, std::next(it), last--);
return last;
and it looks a lot like an implementation of a STL's algorithm, but it also is a polynomial-time algorithm, which is only acceptable if there is no alternative. But you generally keep track of already encountered values in a separate container in order to attain linear or almost-linear time. Sets are good candidates but they come in two flavors, and the std::unordered_set
being more efficient than the simple std::set
must be used when possible.
I've used concepts (gcc > 6.3 && -fconcepts
) to dispatch to the most efficient overload:
template <typename T>
concept bool Hashable = requires
std::hash<T>();
;
template <typename T>
concept bool Comparable = requires (T a, T b)
a == b -> bool;
;
template <typename T>
concept bool Orderable = requires (T a, T b)
a < b -> bool;
;
unordered_set
s require the key to be hashable and comparable, and set
s that it be orderable. After code factorization it gives:
template <typename Iterator, typename Set>
Iterator remove_duplicates_impl(Iterator first, Iterator last, Set& known_values)
while (first != last)
if (known_values.find(*first) != known_values.end()) std::rotate(first, std::next(first), last--);
else known_values.insert(*first++);
return last;
template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
requires Orderable<Value_type<Iterator>> && !Hashable<Value_type<Iterator>>
std::set<Value_type<Iterator>> known_values;
return remove_duplicates_impl(first, last, known_values);
template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
requires Hashable<Value_type<Iterator>> && Comparable<Value_type<Iterator>>
std::unordered_set<Value_type<Iterator>> known_values;
return remove_duplicates_impl(first, last, known_values);
So what's missing here, be it optimizations or corner-cases handling?
A minimal working example:
#include <iostream>
#include <algorithm>
#include <vector>
#include <unordered_set>
#include <set>
#include <iterator>
#include <type_traits>
template <typename Iterator>
using Value_type = typename std::iterator_traits<Iterator>::value_type;
template <typename T>
concept bool Hashable = requires
std::hash<T>();
;
template <typename T>
concept bool Comparable = requires (T a, T b)
a == b -> bool;
;
template <typename T>
concept bool Orderable = requires (T a, T b)
a < b -> bool;
;
template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
auto it = std::next(first);
while (first != last)
it = std::find(it, last, *first);
if (it == last) it = std::next(++first);
else std::rotate(it, std::next(it), last--);
return last;
template <typename Iterator, typename Set>
Iterator remove_duplicates_impl(Iterator first, Iterator last, Set& known_values)
while (first != last)
if (known_values.find(*first) != known_values.end()) std::rotate(first, std::next(first), last--);
else known_values.insert(*first++);
return last;
template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
requires Orderable<Value_type<Iterator>> && !Hashable<Value_type<Iterator>>
std::set<Value_type<Iterator>> known_values;
return remove_duplicates_impl(first, last, known_values);
template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
requires Hashable<Value_type<Iterator>>
std::unordered_set<Value_type<Iterator>> known_values;
return remove_duplicates_impl(first, last, known_values);
struct Foo ;
bool operator==(Foo, Foo) return true;
int main()
std::vector<int> data 1,2,3,4,5,6,7,8,9,8,7,6,5,2,4,3,1,8 ;
std::vector<int> unik(data.begin(), remove_duplicates(data.begin(), data.end()));
for (auto i : unik) std::cout << i << ' ';
std::cout << std::endl;
std::vector<std::pair<int, int>> test 1,2, 3,4 , 5,6 , 7,8, 7, 8 ;
[[maybe_unused]]
auto it = remove_duplicates(test.begin(), test.end());
std::vector<Foo> test2 Foo(), Foo(), Foo() ;
[[maybe_unused]]
auto it2 = remove_duplicates(test2.begin(), test2.end());
c++ c++20
You don't say what kind of performance measurement you've done - is there a clear improvement over the naive algorithm that requires no extra storage? I suspect that if duplicates abound, thenstd::rotate
is likely to dominate the work. Two obvious test cases suggest themselves - one with all elements different (e.g. usingstd::iota()
andstd::shuffle()
) and one with all elements identical. Obviously, the timings will depend on the element type, and the cost of its move assignment, so that immediately doubles the test cases to try.
â Toby Speight
2 hours ago
add a comment |Â
up vote
3
down vote
favorite
up vote
3
down vote
favorite
Here's an interesting article called "How to Remove Elements from a Sequence Container in C++". At some point, the author also explains how to remove duplicates from a container, but only with the very restrictive condition that those duplicates are adjacent. I've given a bit of thoughts on how to make a generic algorithm that would work on any duplicate in the container, and it's a bit more complicated that one might think at first.
The naively obvious solution is:
template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
auto it = std::next(first);
while (first != last)
it = std::find(it, last, *first);
if (it == last) it = std::next(++first);
else std::rotate(it, std::next(it), last--);
return last;
and it looks a lot like an implementation of a STL's algorithm, but it also is a polynomial-time algorithm, which is only acceptable if there is no alternative. But you generally keep track of already encountered values in a separate container in order to attain linear or almost-linear time. Sets are good candidates but they come in two flavors, and the std::unordered_set
being more efficient than the simple std::set
must be used when possible.
I've used concepts (gcc > 6.3 && -fconcepts
) to dispatch to the most efficient overload:
template <typename T>
concept bool Hashable = requires
std::hash<T>();
;
template <typename T>
concept bool Comparable = requires (T a, T b)
a == b -> bool;
;
template <typename T>
concept bool Orderable = requires (T a, T b)
a < b -> bool;
;
unordered_set
s require the key to be hashable and comparable, and set
s that it be orderable. After code factorization it gives:
template <typename Iterator, typename Set>
Iterator remove_duplicates_impl(Iterator first, Iterator last, Set& known_values)
while (first != last)
if (known_values.find(*first) != known_values.end()) std::rotate(first, std::next(first), last--);
else known_values.insert(*first++);
return last;
template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
requires Orderable<Value_type<Iterator>> && !Hashable<Value_type<Iterator>>
std::set<Value_type<Iterator>> known_values;
return remove_duplicates_impl(first, last, known_values);
template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
requires Hashable<Value_type<Iterator>> && Comparable<Value_type<Iterator>>
std::unordered_set<Value_type<Iterator>> known_values;
return remove_duplicates_impl(first, last, known_values);
So what's missing here, be it optimizations or corner-cases handling?
A minimal working example:
#include <iostream>
#include <algorithm>
#include <vector>
#include <unordered_set>
#include <set>
#include <iterator>
#include <type_traits>
template <typename Iterator>
using Value_type = typename std::iterator_traits<Iterator>::value_type;
template <typename T>
concept bool Hashable = requires
std::hash<T>();
;
template <typename T>
concept bool Comparable = requires (T a, T b)
a == b -> bool;
;
template <typename T>
concept bool Orderable = requires (T a, T b)
a < b -> bool;
;
template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
auto it = std::next(first);
while (first != last)
it = std::find(it, last, *first);
if (it == last) it = std::next(++first);
else std::rotate(it, std::next(it), last--);
return last;
template <typename Iterator, typename Set>
Iterator remove_duplicates_impl(Iterator first, Iterator last, Set& known_values)
while (first != last)
if (known_values.find(*first) != known_values.end()) std::rotate(first, std::next(first), last--);
else known_values.insert(*first++);
return last;
template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
requires Orderable<Value_type<Iterator>> && !Hashable<Value_type<Iterator>>
std::set<Value_type<Iterator>> known_values;
return remove_duplicates_impl(first, last, known_values);
template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
requires Hashable<Value_type<Iterator>>
std::unordered_set<Value_type<Iterator>> known_values;
return remove_duplicates_impl(first, last, known_values);
struct Foo ;
bool operator==(Foo, Foo) return true;
int main()
std::vector<int> data 1,2,3,4,5,6,7,8,9,8,7,6,5,2,4,3,1,8 ;
std::vector<int> unik(data.begin(), remove_duplicates(data.begin(), data.end()));
for (auto i : unik) std::cout << i << ' ';
std::cout << std::endl;
std::vector<std::pair<int, int>> test 1,2, 3,4 , 5,6 , 7,8, 7, 8 ;
[[maybe_unused]]
auto it = remove_duplicates(test.begin(), test.end());
std::vector<Foo> test2 Foo(), Foo(), Foo() ;
[[maybe_unused]]
auto it2 = remove_duplicates(test2.begin(), test2.end());
c++ c++20
Here's an interesting article called "How to Remove Elements from a Sequence Container in C++". At some point, the author also explains how to remove duplicates from a container, but only with the very restrictive condition that those duplicates are adjacent. I've given a bit of thoughts on how to make a generic algorithm that would work on any duplicate in the container, and it's a bit more complicated that one might think at first.
The naively obvious solution is:
template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
auto it = std::next(first);
while (first != last)
it = std::find(it, last, *first);
if (it == last) it = std::next(++first);
else std::rotate(it, std::next(it), last--);
return last;
and it looks a lot like an implementation of a STL's algorithm, but it also is a polynomial-time algorithm, which is only acceptable if there is no alternative. But you generally keep track of already encountered values in a separate container in order to attain linear or almost-linear time. Sets are good candidates but they come in two flavors, and the std::unordered_set
being more efficient than the simple std::set
must be used when possible.
I've used concepts (gcc > 6.3 && -fconcepts
) to dispatch to the most efficient overload:
template <typename T>
concept bool Hashable = requires
std::hash<T>();
;
template <typename T>
concept bool Comparable = requires (T a, T b)
a == b -> bool;
;
template <typename T>
concept bool Orderable = requires (T a, T b)
a < b -> bool;
;
unordered_set
s require the key to be hashable and comparable, and set
s that it be orderable. After code factorization it gives:
template <typename Iterator, typename Set>
Iterator remove_duplicates_impl(Iterator first, Iterator last, Set& known_values)
while (first != last)
if (known_values.find(*first) != known_values.end()) std::rotate(first, std::next(first), last--);
else known_values.insert(*first++);
return last;
template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
requires Orderable<Value_type<Iterator>> && !Hashable<Value_type<Iterator>>
std::set<Value_type<Iterator>> known_values;
return remove_duplicates_impl(first, last, known_values);
template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
requires Hashable<Value_type<Iterator>> && Comparable<Value_type<Iterator>>
std::unordered_set<Value_type<Iterator>> known_values;
return remove_duplicates_impl(first, last, known_values);
So what's missing here, be it optimizations or corner-cases handling?
A minimal working example:
#include <iostream>
#include <algorithm>
#include <vector>
#include <unordered_set>
#include <set>
#include <iterator>
#include <type_traits>
template <typename Iterator>
using Value_type = typename std::iterator_traits<Iterator>::value_type;
template <typename T>
concept bool Hashable = requires
std::hash<T>();
;
template <typename T>
concept bool Comparable = requires (T a, T b)
a == b -> bool;
;
template <typename T>
concept bool Orderable = requires (T a, T b)
a < b -> bool;
;
template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
auto it = std::next(first);
while (first != last)
it = std::find(it, last, *first);
if (it == last) it = std::next(++first);
else std::rotate(it, std::next(it), last--);
return last;
template <typename Iterator, typename Set>
Iterator remove_duplicates_impl(Iterator first, Iterator last, Set& known_values)
while (first != last)
if (known_values.find(*first) != known_values.end()) std::rotate(first, std::next(first), last--);
else known_values.insert(*first++);
return last;
template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
requires Orderable<Value_type<Iterator>> && !Hashable<Value_type<Iterator>>
std::set<Value_type<Iterator>> known_values;
return remove_duplicates_impl(first, last, known_values);
template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
requires Hashable<Value_type<Iterator>>
std::unordered_set<Value_type<Iterator>> known_values;
return remove_duplicates_impl(first, last, known_values);
struct Foo ;
bool operator==(Foo, Foo) return true;
int main()
std::vector<int> data 1,2,3,4,5,6,7,8,9,8,7,6,5,2,4,3,1,8 ;
std::vector<int> unik(data.begin(), remove_duplicates(data.begin(), data.end()));
for (auto i : unik) std::cout << i << ' ';
std::cout << std::endl;
std::vector<std::pair<int, int>> test 1,2, 3,4 , 5,6 , 7,8, 7, 8 ;
[[maybe_unused]]
auto it = remove_duplicates(test.begin(), test.end());
std::vector<Foo> test2 Foo(), Foo(), Foo() ;
[[maybe_unused]]
auto it2 = remove_duplicates(test2.begin(), test2.end());
c++ c++20
c++ c++20
asked 3 hours ago
papagaga
3,389217
3,389217
You don't say what kind of performance measurement you've done - is there a clear improvement over the naive algorithm that requires no extra storage? I suspect that if duplicates abound, thenstd::rotate
is likely to dominate the work. Two obvious test cases suggest themselves - one with all elements different (e.g. usingstd::iota()
andstd::shuffle()
) and one with all elements identical. Obviously, the timings will depend on the element type, and the cost of its move assignment, so that immediately doubles the test cases to try.
â Toby Speight
2 hours ago
add a comment |Â
You don't say what kind of performance measurement you've done - is there a clear improvement over the naive algorithm that requires no extra storage? I suspect that if duplicates abound, thenstd::rotate
is likely to dominate the work. Two obvious test cases suggest themselves - one with all elements different (e.g. usingstd::iota()
andstd::shuffle()
) and one with all elements identical. Obviously, the timings will depend on the element type, and the cost of its move assignment, so that immediately doubles the test cases to try.
â Toby Speight
2 hours ago
You don't say what kind of performance measurement you've done - is there a clear improvement over the naive algorithm that requires no extra storage? I suspect that if duplicates abound, then
std::rotate
is likely to dominate the work. Two obvious test cases suggest themselves - one with all elements different (e.g. using std::iota()
and std::shuffle()
) and one with all elements identical. Obviously, the timings will depend on the element type, and the cost of its move assignment, so that immediately doubles the test cases to try.â Toby Speight
2 hours ago
You don't say what kind of performance measurement you've done - is there a clear improvement over the naive algorithm that requires no extra storage? I suspect that if duplicates abound, then
std::rotate
is likely to dominate the work. Two obvious test cases suggest themselves - one with all elements different (e.g. using std::iota()
and std::shuffle()
) and one with all elements identical. Obviously, the timings will depend on the element type, and the cost of its move assignment, so that immediately doubles the test cases to try.â Toby Speight
2 hours ago
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
3
down vote
Your naive solution is quadratic time, O(n^2) and O(1) additional memory. Note, polynomial time doesn't say much about the actual time complexity other than it's not exponential or factorial.
Your improved version is O(nlogn) time for the implementation with set
and O(n) for the one with unordered_set
and both come with the additional cost of O(n) memory.
However the hash version is not necessarily faster than the set version depending on the cost of the hash function and value of n, you should allow the user the option of choosing a specific implementation if they want.
I would probably call the functions something like remove_duplicates_stable
to imply the property you are designing for that is preserving the original order and keeping the first observed value. Which by the way differs between the naive implementation and the others, the naive one keeps the last of the dupes.
If keeping the order of the elements is not important I would probably sort the source array and use std unique or something. Because sorting can make use of move construction to avoid possibly expensive copy construction of all non duplicate elements which is necessary in your implementation, requires no additional space, and is O(nlogn). But again it depends on if copy construction is expensive and of the hash function is expensive.
I would also put the implementation in a separate namespace, like "detail" but this is just personal style.
add a comment |Â
up vote
1
down vote
Regardless of which kind of set you use to keep track, you'll be spending a lot of time in std::rotate
when given an input like this:
std::array<int, 10000>()
Yes, all the elements compare equal, and we dutifully rotate each one individually, resulting in (I think) ünò move operations. This is a problem with the "naive method", too, but not with std::unique()
. If we're going to use extra storage, it might be more productive to use at least some of it to selectively copy input values to a new collection, and then swap()
that new collection back into place.
I know it's not really the code for review, but I can't resist pointing out that you could have written the naive version more clearly with std::remove()
, like this:
template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
while (first != last)
auto const& value = *first;
last = std::remove(first = std::next(first), last, value);
return last;
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
3
down vote
Your naive solution is quadratic time, O(n^2) and O(1) additional memory. Note, polynomial time doesn't say much about the actual time complexity other than it's not exponential or factorial.
Your improved version is O(nlogn) time for the implementation with set
and O(n) for the one with unordered_set
and both come with the additional cost of O(n) memory.
However the hash version is not necessarily faster than the set version depending on the cost of the hash function and value of n, you should allow the user the option of choosing a specific implementation if they want.
I would probably call the functions something like remove_duplicates_stable
to imply the property you are designing for that is preserving the original order and keeping the first observed value. Which by the way differs between the naive implementation and the others, the naive one keeps the last of the dupes.
If keeping the order of the elements is not important I would probably sort the source array and use std unique or something. Because sorting can make use of move construction to avoid possibly expensive copy construction of all non duplicate elements which is necessary in your implementation, requires no additional space, and is O(nlogn). But again it depends on if copy construction is expensive and of the hash function is expensive.
I would also put the implementation in a separate namespace, like "detail" but this is just personal style.
add a comment |Â
up vote
3
down vote
Your naive solution is quadratic time, O(n^2) and O(1) additional memory. Note, polynomial time doesn't say much about the actual time complexity other than it's not exponential or factorial.
Your improved version is O(nlogn) time for the implementation with set
and O(n) for the one with unordered_set
and both come with the additional cost of O(n) memory.
However the hash version is not necessarily faster than the set version depending on the cost of the hash function and value of n, you should allow the user the option of choosing a specific implementation if they want.
I would probably call the functions something like remove_duplicates_stable
to imply the property you are designing for that is preserving the original order and keeping the first observed value. Which by the way differs between the naive implementation and the others, the naive one keeps the last of the dupes.
If keeping the order of the elements is not important I would probably sort the source array and use std unique or something. Because sorting can make use of move construction to avoid possibly expensive copy construction of all non duplicate elements which is necessary in your implementation, requires no additional space, and is O(nlogn). But again it depends on if copy construction is expensive and of the hash function is expensive.
I would also put the implementation in a separate namespace, like "detail" but this is just personal style.
add a comment |Â
up vote
3
down vote
up vote
3
down vote
Your naive solution is quadratic time, O(n^2) and O(1) additional memory. Note, polynomial time doesn't say much about the actual time complexity other than it's not exponential or factorial.
Your improved version is O(nlogn) time for the implementation with set
and O(n) for the one with unordered_set
and both come with the additional cost of O(n) memory.
However the hash version is not necessarily faster than the set version depending on the cost of the hash function and value of n, you should allow the user the option of choosing a specific implementation if they want.
I would probably call the functions something like remove_duplicates_stable
to imply the property you are designing for that is preserving the original order and keeping the first observed value. Which by the way differs between the naive implementation and the others, the naive one keeps the last of the dupes.
If keeping the order of the elements is not important I would probably sort the source array and use std unique or something. Because sorting can make use of move construction to avoid possibly expensive copy construction of all non duplicate elements which is necessary in your implementation, requires no additional space, and is O(nlogn). But again it depends on if copy construction is expensive and of the hash function is expensive.
I would also put the implementation in a separate namespace, like "detail" but this is just personal style.
Your naive solution is quadratic time, O(n^2) and O(1) additional memory. Note, polynomial time doesn't say much about the actual time complexity other than it's not exponential or factorial.
Your improved version is O(nlogn) time for the implementation with set
and O(n) for the one with unordered_set
and both come with the additional cost of O(n) memory.
However the hash version is not necessarily faster than the set version depending on the cost of the hash function and value of n, you should allow the user the option of choosing a specific implementation if they want.
I would probably call the functions something like remove_duplicates_stable
to imply the property you are designing for that is preserving the original order and keeping the first observed value. Which by the way differs between the naive implementation and the others, the naive one keeps the last of the dupes.
If keeping the order of the elements is not important I would probably sort the source array and use std unique or something. Because sorting can make use of move construction to avoid possibly expensive copy construction of all non duplicate elements which is necessary in your implementation, requires no additional space, and is O(nlogn). But again it depends on if copy construction is expensive and of the hash function is expensive.
I would also put the implementation in a separate namespace, like "detail" but this is just personal style.
edited 3 hours ago
answered 3 hours ago
Emily L.
11.1k12372
11.1k12372
add a comment |Â
add a comment |Â
up vote
1
down vote
Regardless of which kind of set you use to keep track, you'll be spending a lot of time in std::rotate
when given an input like this:
std::array<int, 10000>()
Yes, all the elements compare equal, and we dutifully rotate each one individually, resulting in (I think) ünò move operations. This is a problem with the "naive method", too, but not with std::unique()
. If we're going to use extra storage, it might be more productive to use at least some of it to selectively copy input values to a new collection, and then swap()
that new collection back into place.
I know it's not really the code for review, but I can't resist pointing out that you could have written the naive version more clearly with std::remove()
, like this:
template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
while (first != last)
auto const& value = *first;
last = std::remove(first = std::next(first), last, value);
return last;
add a comment |Â
up vote
1
down vote
Regardless of which kind of set you use to keep track, you'll be spending a lot of time in std::rotate
when given an input like this:
std::array<int, 10000>()
Yes, all the elements compare equal, and we dutifully rotate each one individually, resulting in (I think) ünò move operations. This is a problem with the "naive method", too, but not with std::unique()
. If we're going to use extra storage, it might be more productive to use at least some of it to selectively copy input values to a new collection, and then swap()
that new collection back into place.
I know it's not really the code for review, but I can't resist pointing out that you could have written the naive version more clearly with std::remove()
, like this:
template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
while (first != last)
auto const& value = *first;
last = std::remove(first = std::next(first), last, value);
return last;
add a comment |Â
up vote
1
down vote
up vote
1
down vote
Regardless of which kind of set you use to keep track, you'll be spending a lot of time in std::rotate
when given an input like this:
std::array<int, 10000>()
Yes, all the elements compare equal, and we dutifully rotate each one individually, resulting in (I think) ünò move operations. This is a problem with the "naive method", too, but not with std::unique()
. If we're going to use extra storage, it might be more productive to use at least some of it to selectively copy input values to a new collection, and then swap()
that new collection back into place.
I know it's not really the code for review, but I can't resist pointing out that you could have written the naive version more clearly with std::remove()
, like this:
template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
while (first != last)
auto const& value = *first;
last = std::remove(first = std::next(first), last, value);
return last;
Regardless of which kind of set you use to keep track, you'll be spending a lot of time in std::rotate
when given an input like this:
std::array<int, 10000>()
Yes, all the elements compare equal, and we dutifully rotate each one individually, resulting in (I think) ünò move operations. This is a problem with the "naive method", too, but not with std::unique()
. If we're going to use extra storage, it might be more productive to use at least some of it to selectively copy input values to a new collection, and then swap()
that new collection back into place.
I know it's not really the code for review, but I can't resist pointing out that you could have written the naive version more clearly with std::remove()
, like this:
template <typename Iterator>
Iterator remove_duplicates(Iterator first, Iterator last)
while (first != last)
auto const& value = *first;
last = std::remove(first = std::next(first), last, value);
return last;
edited 1 hour ago
answered 2 hours ago
Toby Speight
19.2k43697
19.2k43697
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f203991%2fgeneric-way-to-remove-all-duplicates-from-a-not-sorted-container%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
You don't say what kind of performance measurement you've done - is there a clear improvement over the naive algorithm that requires no extra storage? I suspect that if duplicates abound, then
std::rotate
is likely to dominate the work. Two obvious test cases suggest themselves - one with all elements different (e.g. usingstd::iota()
andstd::shuffle()
) and one with all elements identical. Obviously, the timings will depend on the element type, and the cost of its move assignment, so that immediately doubles the test cases to try.â Toby Speight
2 hours ago