How to cluster parts of broken line made of points?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
3
down vote

favorite












I am studying clustering techniques and i am pretty new at this topic. Here is my problem: I created a 5 lines which are made of points. This lines are supposed to be continuous and they look like this:



# Create data
n0, x0, y0 = 17, np.linspace(0, 11, n0), np.linspace(10, 3, n0)
n1, x1, y1 = 35, np.linspace(11, 27, n1), np.linspace(3, 3, n1)
n2, x2, y2 = 15, np.linspace(27, 35, n2), np.linspace(3, 15, n2)
n3, x3, y3 = 4, np.linspace(-1, 0, n3), np.linspace(10, 10, n3)
n4, x4, y4 = 10, np.linspace(35, 46, n4), np.linspace(15, 15.5,n4)

# Plot data
plt.figure()
plt.plot(x0, y0, 'o', color='grey', markersize=1)
plt.plot(x1, y1, 'o', color='grey', markersize=1)
plt.plot(x2, y2, 'o', color='grey', markersize=1)
plt.plot(x3, y3, 'o', color='grey', markersize=1)
plt.plot(x4, y4, 'o', color='grey', markersize=1)
plt.show()


enter image description here



My goal is to use a clustering technique in order to be able to cluster each line that i have created in order to recognize the 5 different lines presented in the plot.



To do so i have opted for the GaussianMixture clustering algorithm which i thought it could be suitable for this sort of data distribution (a line could be seen as a very skewed distribution maybe). Here is what i wrote:



# Prepare data for clustering
X = np.concatenate((x0,x1,x2,x3,x4))
Y = np.concatenate((y0,y1,y2,y3,y4))
data = np.column_stack((X,Y))

# Cluster
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=5, covariance_type='full')
X_ = gmm.fit(data)
y_ = X_.predict(data)
print(set(y_))


And the output that i got is something like this (it changes every time i run the code though...):



# Plot clusters
plt.figure()
plt.plot(data[y_==0][:,0], data[y_==0][:,1], 'o', color='red')
plt.plot(data[y_==1][:,0], data[y_==1][:,1], 'o', color='blue')
plt.plot(data[y_==2][:,0], data[y_==2][:,1], 'o', color='lime')
plt.plot(data[y_==3][:,0], data[y_==3][:,1], 'o', color='orange')
plt.plot(data[y_==4][:,0], data[y_==4][:,1], 'o', color='cyan')
plt.show()


enter image description here



As you can see the colors should represent the 5 clusters (aka lines) that i have originally created but apparently the output is not what i want.



Could you please provide a better way to approach this problem? In case what i am doing is partially correct, what am i mistaken?










share|cite|improve this question









New contributor




Federico Gentile is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.















  • 1




    This appears to be a minor generalization of the problem discussed at stats.stackexchange.com/questions/33078.
    – whuber♦
    4 hours ago










  • @whuber yeah maybe i should highlight the fact that i would really need to use a clustering algorithm such as the ones offered by scikit-learn or tensorflow
    – Federico Gentile
    3 hours ago










  • One problem with your data is that in different lines constituting your "zigzag" density of points is different. Must it be so in your data?
    – ttnphns
    3 hours ago










  • ...in other words, your data consists of clusters-to-be of very different densities.
    – ttnphns
    3 hours ago










  • Please clarify what you mean by "supposed to be continuous:" does that imply you are given a sequence of points rather than an unordered set of them? How accurately are these points known--is there perhaps some random error in them?
    – whuber♦
    2 hours ago
















up vote
3
down vote

favorite












I am studying clustering techniques and i am pretty new at this topic. Here is my problem: I created a 5 lines which are made of points. This lines are supposed to be continuous and they look like this:



# Create data
n0, x0, y0 = 17, np.linspace(0, 11, n0), np.linspace(10, 3, n0)
n1, x1, y1 = 35, np.linspace(11, 27, n1), np.linspace(3, 3, n1)
n2, x2, y2 = 15, np.linspace(27, 35, n2), np.linspace(3, 15, n2)
n3, x3, y3 = 4, np.linspace(-1, 0, n3), np.linspace(10, 10, n3)
n4, x4, y4 = 10, np.linspace(35, 46, n4), np.linspace(15, 15.5,n4)

# Plot data
plt.figure()
plt.plot(x0, y0, 'o', color='grey', markersize=1)
plt.plot(x1, y1, 'o', color='grey', markersize=1)
plt.plot(x2, y2, 'o', color='grey', markersize=1)
plt.plot(x3, y3, 'o', color='grey', markersize=1)
plt.plot(x4, y4, 'o', color='grey', markersize=1)
plt.show()


enter image description here



My goal is to use a clustering technique in order to be able to cluster each line that i have created in order to recognize the 5 different lines presented in the plot.



To do so i have opted for the GaussianMixture clustering algorithm which i thought it could be suitable for this sort of data distribution (a line could be seen as a very skewed distribution maybe). Here is what i wrote:



# Prepare data for clustering
X = np.concatenate((x0,x1,x2,x3,x4))
Y = np.concatenate((y0,y1,y2,y3,y4))
data = np.column_stack((X,Y))

# Cluster
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=5, covariance_type='full')
X_ = gmm.fit(data)
y_ = X_.predict(data)
print(set(y_))


And the output that i got is something like this (it changes every time i run the code though...):



# Plot clusters
plt.figure()
plt.plot(data[y_==0][:,0], data[y_==0][:,1], 'o', color='red')
plt.plot(data[y_==1][:,0], data[y_==1][:,1], 'o', color='blue')
plt.plot(data[y_==2][:,0], data[y_==2][:,1], 'o', color='lime')
plt.plot(data[y_==3][:,0], data[y_==3][:,1], 'o', color='orange')
plt.plot(data[y_==4][:,0], data[y_==4][:,1], 'o', color='cyan')
plt.show()


enter image description here



As you can see the colors should represent the 5 clusters (aka lines) that i have originally created but apparently the output is not what i want.



Could you please provide a better way to approach this problem? In case what i am doing is partially correct, what am i mistaken?










share|cite|improve this question









New contributor




Federico Gentile is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.















  • 1




    This appears to be a minor generalization of the problem discussed at stats.stackexchange.com/questions/33078.
    – whuber♦
    4 hours ago










  • @whuber yeah maybe i should highlight the fact that i would really need to use a clustering algorithm such as the ones offered by scikit-learn or tensorflow
    – Federico Gentile
    3 hours ago










  • One problem with your data is that in different lines constituting your "zigzag" density of points is different. Must it be so in your data?
    – ttnphns
    3 hours ago










  • ...in other words, your data consists of clusters-to-be of very different densities.
    – ttnphns
    3 hours ago










  • Please clarify what you mean by "supposed to be continuous:" does that imply you are given a sequence of points rather than an unordered set of them? How accurately are these points known--is there perhaps some random error in them?
    – whuber♦
    2 hours ago












up vote
3
down vote

favorite









up vote
3
down vote

favorite











I am studying clustering techniques and i am pretty new at this topic. Here is my problem: I created a 5 lines which are made of points. This lines are supposed to be continuous and they look like this:



# Create data
n0, x0, y0 = 17, np.linspace(0, 11, n0), np.linspace(10, 3, n0)
n1, x1, y1 = 35, np.linspace(11, 27, n1), np.linspace(3, 3, n1)
n2, x2, y2 = 15, np.linspace(27, 35, n2), np.linspace(3, 15, n2)
n3, x3, y3 = 4, np.linspace(-1, 0, n3), np.linspace(10, 10, n3)
n4, x4, y4 = 10, np.linspace(35, 46, n4), np.linspace(15, 15.5,n4)

# Plot data
plt.figure()
plt.plot(x0, y0, 'o', color='grey', markersize=1)
plt.plot(x1, y1, 'o', color='grey', markersize=1)
plt.plot(x2, y2, 'o', color='grey', markersize=1)
plt.plot(x3, y3, 'o', color='grey', markersize=1)
plt.plot(x4, y4, 'o', color='grey', markersize=1)
plt.show()


enter image description here



My goal is to use a clustering technique in order to be able to cluster each line that i have created in order to recognize the 5 different lines presented in the plot.



To do so i have opted for the GaussianMixture clustering algorithm which i thought it could be suitable for this sort of data distribution (a line could be seen as a very skewed distribution maybe). Here is what i wrote:



# Prepare data for clustering
X = np.concatenate((x0,x1,x2,x3,x4))
Y = np.concatenate((y0,y1,y2,y3,y4))
data = np.column_stack((X,Y))

# Cluster
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=5, covariance_type='full')
X_ = gmm.fit(data)
y_ = X_.predict(data)
print(set(y_))


And the output that i got is something like this (it changes every time i run the code though...):



# Plot clusters
plt.figure()
plt.plot(data[y_==0][:,0], data[y_==0][:,1], 'o', color='red')
plt.plot(data[y_==1][:,0], data[y_==1][:,1], 'o', color='blue')
plt.plot(data[y_==2][:,0], data[y_==2][:,1], 'o', color='lime')
plt.plot(data[y_==3][:,0], data[y_==3][:,1], 'o', color='orange')
plt.plot(data[y_==4][:,0], data[y_==4][:,1], 'o', color='cyan')
plt.show()


enter image description here



As you can see the colors should represent the 5 clusters (aka lines) that i have originally created but apparently the output is not what i want.



Could you please provide a better way to approach this problem? In case what i am doing is partially correct, what am i mistaken?










share|cite|improve this question









New contributor




Federico Gentile is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











I am studying clustering techniques and i am pretty new at this topic. Here is my problem: I created a 5 lines which are made of points. This lines are supposed to be continuous and they look like this:



# Create data
n0, x0, y0 = 17, np.linspace(0, 11, n0), np.linspace(10, 3, n0)
n1, x1, y1 = 35, np.linspace(11, 27, n1), np.linspace(3, 3, n1)
n2, x2, y2 = 15, np.linspace(27, 35, n2), np.linspace(3, 15, n2)
n3, x3, y3 = 4, np.linspace(-1, 0, n3), np.linspace(10, 10, n3)
n4, x4, y4 = 10, np.linspace(35, 46, n4), np.linspace(15, 15.5,n4)

# Plot data
plt.figure()
plt.plot(x0, y0, 'o', color='grey', markersize=1)
plt.plot(x1, y1, 'o', color='grey', markersize=1)
plt.plot(x2, y2, 'o', color='grey', markersize=1)
plt.plot(x3, y3, 'o', color='grey', markersize=1)
plt.plot(x4, y4, 'o', color='grey', markersize=1)
plt.show()


enter image description here



My goal is to use a clustering technique in order to be able to cluster each line that i have created in order to recognize the 5 different lines presented in the plot.



To do so i have opted for the GaussianMixture clustering algorithm which i thought it could be suitable for this sort of data distribution (a line could be seen as a very skewed distribution maybe). Here is what i wrote:



# Prepare data for clustering
X = np.concatenate((x0,x1,x2,x3,x4))
Y = np.concatenate((y0,y1,y2,y3,y4))
data = np.column_stack((X,Y))

# Cluster
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=5, covariance_type='full')
X_ = gmm.fit(data)
y_ = X_.predict(data)
print(set(y_))


And the output that i got is something like this (it changes every time i run the code though...):



# Plot clusters
plt.figure()
plt.plot(data[y_==0][:,0], data[y_==0][:,1], 'o', color='red')
plt.plot(data[y_==1][:,0], data[y_==1][:,1], 'o', color='blue')
plt.plot(data[y_==2][:,0], data[y_==2][:,1], 'o', color='lime')
plt.plot(data[y_==3][:,0], data[y_==3][:,1], 'o', color='orange')
plt.plot(data[y_==4][:,0], data[y_==4][:,1], 'o', color='cyan')
plt.show()


enter image description here



As you can see the colors should represent the 5 clusters (aka lines) that i have originally created but apparently the output is not what i want.



Could you please provide a better way to approach this problem? In case what i am doing is partially correct, what am i mistaken?







time-series clustering scikit-learn gaussian-mixture tensorflow






share|cite|improve this question









New contributor




Federico Gentile is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|cite|improve this question









New contributor




Federico Gentile is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|cite|improve this question




share|cite|improve this question








edited 3 hours ago









ttnphns

37.4k12132311




37.4k12132311






New contributor




Federico Gentile is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked 5 hours ago









Federico Gentile

1184




1184




New contributor




Federico Gentile is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





Federico Gentile is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






Federico Gentile is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







  • 1




    This appears to be a minor generalization of the problem discussed at stats.stackexchange.com/questions/33078.
    – whuber♦
    4 hours ago










  • @whuber yeah maybe i should highlight the fact that i would really need to use a clustering algorithm such as the ones offered by scikit-learn or tensorflow
    – Federico Gentile
    3 hours ago










  • One problem with your data is that in different lines constituting your "zigzag" density of points is different. Must it be so in your data?
    – ttnphns
    3 hours ago










  • ...in other words, your data consists of clusters-to-be of very different densities.
    – ttnphns
    3 hours ago










  • Please clarify what you mean by "supposed to be continuous:" does that imply you are given a sequence of points rather than an unordered set of them? How accurately are these points known--is there perhaps some random error in them?
    – whuber♦
    2 hours ago












  • 1




    This appears to be a minor generalization of the problem discussed at stats.stackexchange.com/questions/33078.
    – whuber♦
    4 hours ago










  • @whuber yeah maybe i should highlight the fact that i would really need to use a clustering algorithm such as the ones offered by scikit-learn or tensorflow
    – Federico Gentile
    3 hours ago










  • One problem with your data is that in different lines constituting your "zigzag" density of points is different. Must it be so in your data?
    – ttnphns
    3 hours ago










  • ...in other words, your data consists of clusters-to-be of very different densities.
    – ttnphns
    3 hours ago










  • Please clarify what you mean by "supposed to be continuous:" does that imply you are given a sequence of points rather than an unordered set of them? How accurately are these points known--is there perhaps some random error in them?
    – whuber♦
    2 hours ago







1




1




This appears to be a minor generalization of the problem discussed at stats.stackexchange.com/questions/33078.
– whuber♦
4 hours ago




This appears to be a minor generalization of the problem discussed at stats.stackexchange.com/questions/33078.
– whuber♦
4 hours ago












@whuber yeah maybe i should highlight the fact that i would really need to use a clustering algorithm such as the ones offered by scikit-learn or tensorflow
– Federico Gentile
3 hours ago




@whuber yeah maybe i should highlight the fact that i would really need to use a clustering algorithm such as the ones offered by scikit-learn or tensorflow
– Federico Gentile
3 hours ago












One problem with your data is that in different lines constituting your "zigzag" density of points is different. Must it be so in your data?
– ttnphns
3 hours ago




One problem with your data is that in different lines constituting your "zigzag" density of points is different. Must it be so in your data?
– ttnphns
3 hours ago












...in other words, your data consists of clusters-to-be of very different densities.
– ttnphns
3 hours ago




...in other words, your data consists of clusters-to-be of very different densities.
– ttnphns
3 hours ago












Please clarify what you mean by "supposed to be continuous:" does that imply you are given a sequence of points rather than an unordered set of them? How accurately are these points known--is there perhaps some random error in them?
– whuber♦
2 hours ago




Please clarify what you mean by "supposed to be continuous:" does that imply you are given a sequence of points rather than an unordered set of them? How accurately are these points known--is there perhaps some random error in them?
– whuber♦
2 hours ago










1 Answer
1






active

oldest

votes

















up vote
3
down vote



accepted










The data appear to be a sequence of $(x,y)$ coordinates and the question concerns how to cluster them into linear segments.



Solution



A simple method uses a brute-force (but efficient) version of the Hough Transform.



The basic idea is that every line in the plane can be parameterized by its distance to a fixed origin and the direction perpendicular to it. Each ordered pair of distinct points, $(x_i,y_i)$ and $(x_i+1,y_i+1),$ determines a line and therefore may be encoded as a (distance, angle) ordered pair. Sequences of points lying on the same line can then be easily recognized as stationary sequences of (distance, angle) pairs. Simply cluster these.



For numerical stability, it is wise to use an origin adapted to the data--perhaps their barycenter.



There are many ways to proceed from this point. For instance, a kernel density estimate of the (distance, angle) pairs is tantamount to the Hough transform itself. After removing outliers, any reasonable clustering method for 2D Euclidean points should readily identify all the segments.



Remark



If the data do not have a definite sequence, compute the Hough transform for all possible pairs $(x_i,y_i)$ and $(x_j,y_j).$ This is more computationally demanding and will produce a much messier transform, but nevertheless any lines formed by more than a few of the points will correspond to the highest-density regions of the transform.




Example



The question presents a sequence of 81 points (not all distinct). Here are portions of the derived information described above:



 x y id Distance Angle
1 0.0000000 10.000000 1 7.999634 -2.1375255
2 0.6875000 9.562500 1 7.999634 -2.1375255
3 1.3750000 9.125000 1 7.999634 -2.1375255
...
16 10.3125000 3.437500 1 7.999634 -2.1375255
17 11.0000000 3.000000 1 NaN 0.0000000
18 11.0000000 3.000000 2 3.703704 -1.5707963
19 11.4705882 3.000000 2 3.703704 -1.5707963
...
51 26.5294118 3.000000 2 3.703704 -1.5707963
52 27.0000000 3.000000 2 NaN 0.0000000
53 27.0000000 3.000000 3 7.812028 -0.5880026
54 27.5714286 3.857143 3 7.812028 -0.5880026
...
66 34.4285714 14.142857 3 7.812028 -0.5880026
67 35.0000000 15.000000 3 6.164931 1.7088024
68 -1.0000000 10.000000 4 -3.296296 -1.5707963
69 -0.6666667 10.000000 4 -3.296296 -1.5707963
70 -0.3333333 10.000000 4 -3.296296 -1.5707963
71 0.0000000 10.000000 4 -6.102943 -1.4288993
72 35.0000000 15.000000 5 -7.610268 -1.5253730
...
80 44.7777778 15.444444 5 -7.610268 -1.5253730


The regions of stationary (distance, angle) values are clear and the breaks between them are obvious.



Here is a plot of the (distance, angle) pairs, colored according to the original segments created in the question:



Figure 1



I applied a hierarchical clustering solution (using the hclust function in R with the default complete linkage) and used that to classify the original data (except for the last point). Here is the classification, shown with colors:



Figure 2






share|cite|improve this answer




















    Your Answer





    StackExchange.ifUsing("editor", function ()
    return StackExchange.using("mathjaxEditing", function ()
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    );
    );
    , "mathjax-editing");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "65"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );






    Federico Gentile is a new contributor. Be nice, and check out our Code of Conduct.









     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f375787%2fhow-to-cluster-parts-of-broken-line-made-of-points%23new-answer', 'question_page');

    );

    Post as a guest






























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    3
    down vote



    accepted










    The data appear to be a sequence of $(x,y)$ coordinates and the question concerns how to cluster them into linear segments.



    Solution



    A simple method uses a brute-force (but efficient) version of the Hough Transform.



    The basic idea is that every line in the plane can be parameterized by its distance to a fixed origin and the direction perpendicular to it. Each ordered pair of distinct points, $(x_i,y_i)$ and $(x_i+1,y_i+1),$ determines a line and therefore may be encoded as a (distance, angle) ordered pair. Sequences of points lying on the same line can then be easily recognized as stationary sequences of (distance, angle) pairs. Simply cluster these.



    For numerical stability, it is wise to use an origin adapted to the data--perhaps their barycenter.



    There are many ways to proceed from this point. For instance, a kernel density estimate of the (distance, angle) pairs is tantamount to the Hough transform itself. After removing outliers, any reasonable clustering method for 2D Euclidean points should readily identify all the segments.



    Remark



    If the data do not have a definite sequence, compute the Hough transform for all possible pairs $(x_i,y_i)$ and $(x_j,y_j).$ This is more computationally demanding and will produce a much messier transform, but nevertheless any lines formed by more than a few of the points will correspond to the highest-density regions of the transform.




    Example



    The question presents a sequence of 81 points (not all distinct). Here are portions of the derived information described above:



     x y id Distance Angle
    1 0.0000000 10.000000 1 7.999634 -2.1375255
    2 0.6875000 9.562500 1 7.999634 -2.1375255
    3 1.3750000 9.125000 1 7.999634 -2.1375255
    ...
    16 10.3125000 3.437500 1 7.999634 -2.1375255
    17 11.0000000 3.000000 1 NaN 0.0000000
    18 11.0000000 3.000000 2 3.703704 -1.5707963
    19 11.4705882 3.000000 2 3.703704 -1.5707963
    ...
    51 26.5294118 3.000000 2 3.703704 -1.5707963
    52 27.0000000 3.000000 2 NaN 0.0000000
    53 27.0000000 3.000000 3 7.812028 -0.5880026
    54 27.5714286 3.857143 3 7.812028 -0.5880026
    ...
    66 34.4285714 14.142857 3 7.812028 -0.5880026
    67 35.0000000 15.000000 3 6.164931 1.7088024
    68 -1.0000000 10.000000 4 -3.296296 -1.5707963
    69 -0.6666667 10.000000 4 -3.296296 -1.5707963
    70 -0.3333333 10.000000 4 -3.296296 -1.5707963
    71 0.0000000 10.000000 4 -6.102943 -1.4288993
    72 35.0000000 15.000000 5 -7.610268 -1.5253730
    ...
    80 44.7777778 15.444444 5 -7.610268 -1.5253730


    The regions of stationary (distance, angle) values are clear and the breaks between them are obvious.



    Here is a plot of the (distance, angle) pairs, colored according to the original segments created in the question:



    Figure 1



    I applied a hierarchical clustering solution (using the hclust function in R with the default complete linkage) and used that to classify the original data (except for the last point). Here is the classification, shown with colors:



    Figure 2






    share|cite|improve this answer
























      up vote
      3
      down vote



      accepted










      The data appear to be a sequence of $(x,y)$ coordinates and the question concerns how to cluster them into linear segments.



      Solution



      A simple method uses a brute-force (but efficient) version of the Hough Transform.



      The basic idea is that every line in the plane can be parameterized by its distance to a fixed origin and the direction perpendicular to it. Each ordered pair of distinct points, $(x_i,y_i)$ and $(x_i+1,y_i+1),$ determines a line and therefore may be encoded as a (distance, angle) ordered pair. Sequences of points lying on the same line can then be easily recognized as stationary sequences of (distance, angle) pairs. Simply cluster these.



      For numerical stability, it is wise to use an origin adapted to the data--perhaps their barycenter.



      There are many ways to proceed from this point. For instance, a kernel density estimate of the (distance, angle) pairs is tantamount to the Hough transform itself. After removing outliers, any reasonable clustering method for 2D Euclidean points should readily identify all the segments.



      Remark



      If the data do not have a definite sequence, compute the Hough transform for all possible pairs $(x_i,y_i)$ and $(x_j,y_j).$ This is more computationally demanding and will produce a much messier transform, but nevertheless any lines formed by more than a few of the points will correspond to the highest-density regions of the transform.




      Example



      The question presents a sequence of 81 points (not all distinct). Here are portions of the derived information described above:



       x y id Distance Angle
      1 0.0000000 10.000000 1 7.999634 -2.1375255
      2 0.6875000 9.562500 1 7.999634 -2.1375255
      3 1.3750000 9.125000 1 7.999634 -2.1375255
      ...
      16 10.3125000 3.437500 1 7.999634 -2.1375255
      17 11.0000000 3.000000 1 NaN 0.0000000
      18 11.0000000 3.000000 2 3.703704 -1.5707963
      19 11.4705882 3.000000 2 3.703704 -1.5707963
      ...
      51 26.5294118 3.000000 2 3.703704 -1.5707963
      52 27.0000000 3.000000 2 NaN 0.0000000
      53 27.0000000 3.000000 3 7.812028 -0.5880026
      54 27.5714286 3.857143 3 7.812028 -0.5880026
      ...
      66 34.4285714 14.142857 3 7.812028 -0.5880026
      67 35.0000000 15.000000 3 6.164931 1.7088024
      68 -1.0000000 10.000000 4 -3.296296 -1.5707963
      69 -0.6666667 10.000000 4 -3.296296 -1.5707963
      70 -0.3333333 10.000000 4 -3.296296 -1.5707963
      71 0.0000000 10.000000 4 -6.102943 -1.4288993
      72 35.0000000 15.000000 5 -7.610268 -1.5253730
      ...
      80 44.7777778 15.444444 5 -7.610268 -1.5253730


      The regions of stationary (distance, angle) values are clear and the breaks between them are obvious.



      Here is a plot of the (distance, angle) pairs, colored according to the original segments created in the question:



      Figure 1



      I applied a hierarchical clustering solution (using the hclust function in R with the default complete linkage) and used that to classify the original data (except for the last point). Here is the classification, shown with colors:



      Figure 2






      share|cite|improve this answer






















        up vote
        3
        down vote



        accepted







        up vote
        3
        down vote



        accepted






        The data appear to be a sequence of $(x,y)$ coordinates and the question concerns how to cluster them into linear segments.



        Solution



        A simple method uses a brute-force (but efficient) version of the Hough Transform.



        The basic idea is that every line in the plane can be parameterized by its distance to a fixed origin and the direction perpendicular to it. Each ordered pair of distinct points, $(x_i,y_i)$ and $(x_i+1,y_i+1),$ determines a line and therefore may be encoded as a (distance, angle) ordered pair. Sequences of points lying on the same line can then be easily recognized as stationary sequences of (distance, angle) pairs. Simply cluster these.



        For numerical stability, it is wise to use an origin adapted to the data--perhaps their barycenter.



        There are many ways to proceed from this point. For instance, a kernel density estimate of the (distance, angle) pairs is tantamount to the Hough transform itself. After removing outliers, any reasonable clustering method for 2D Euclidean points should readily identify all the segments.



        Remark



        If the data do not have a definite sequence, compute the Hough transform for all possible pairs $(x_i,y_i)$ and $(x_j,y_j).$ This is more computationally demanding and will produce a much messier transform, but nevertheless any lines formed by more than a few of the points will correspond to the highest-density regions of the transform.




        Example



        The question presents a sequence of 81 points (not all distinct). Here are portions of the derived information described above:



         x y id Distance Angle
        1 0.0000000 10.000000 1 7.999634 -2.1375255
        2 0.6875000 9.562500 1 7.999634 -2.1375255
        3 1.3750000 9.125000 1 7.999634 -2.1375255
        ...
        16 10.3125000 3.437500 1 7.999634 -2.1375255
        17 11.0000000 3.000000 1 NaN 0.0000000
        18 11.0000000 3.000000 2 3.703704 -1.5707963
        19 11.4705882 3.000000 2 3.703704 -1.5707963
        ...
        51 26.5294118 3.000000 2 3.703704 -1.5707963
        52 27.0000000 3.000000 2 NaN 0.0000000
        53 27.0000000 3.000000 3 7.812028 -0.5880026
        54 27.5714286 3.857143 3 7.812028 -0.5880026
        ...
        66 34.4285714 14.142857 3 7.812028 -0.5880026
        67 35.0000000 15.000000 3 6.164931 1.7088024
        68 -1.0000000 10.000000 4 -3.296296 -1.5707963
        69 -0.6666667 10.000000 4 -3.296296 -1.5707963
        70 -0.3333333 10.000000 4 -3.296296 -1.5707963
        71 0.0000000 10.000000 4 -6.102943 -1.4288993
        72 35.0000000 15.000000 5 -7.610268 -1.5253730
        ...
        80 44.7777778 15.444444 5 -7.610268 -1.5253730


        The regions of stationary (distance, angle) values are clear and the breaks between them are obvious.



        Here is a plot of the (distance, angle) pairs, colored according to the original segments created in the question:



        Figure 1



        I applied a hierarchical clustering solution (using the hclust function in R with the default complete linkage) and used that to classify the original data (except for the last point). Here is the classification, shown with colors:



        Figure 2






        share|cite|improve this answer












        The data appear to be a sequence of $(x,y)$ coordinates and the question concerns how to cluster them into linear segments.



        Solution



        A simple method uses a brute-force (but efficient) version of the Hough Transform.



        The basic idea is that every line in the plane can be parameterized by its distance to a fixed origin and the direction perpendicular to it. Each ordered pair of distinct points, $(x_i,y_i)$ and $(x_i+1,y_i+1),$ determines a line and therefore may be encoded as a (distance, angle) ordered pair. Sequences of points lying on the same line can then be easily recognized as stationary sequences of (distance, angle) pairs. Simply cluster these.



        For numerical stability, it is wise to use an origin adapted to the data--perhaps their barycenter.



        There are many ways to proceed from this point. For instance, a kernel density estimate of the (distance, angle) pairs is tantamount to the Hough transform itself. After removing outliers, any reasonable clustering method for 2D Euclidean points should readily identify all the segments.



        Remark



        If the data do not have a definite sequence, compute the Hough transform for all possible pairs $(x_i,y_i)$ and $(x_j,y_j).$ This is more computationally demanding and will produce a much messier transform, but nevertheless any lines formed by more than a few of the points will correspond to the highest-density regions of the transform.




        Example



        The question presents a sequence of 81 points (not all distinct). Here are portions of the derived information described above:



         x y id Distance Angle
        1 0.0000000 10.000000 1 7.999634 -2.1375255
        2 0.6875000 9.562500 1 7.999634 -2.1375255
        3 1.3750000 9.125000 1 7.999634 -2.1375255
        ...
        16 10.3125000 3.437500 1 7.999634 -2.1375255
        17 11.0000000 3.000000 1 NaN 0.0000000
        18 11.0000000 3.000000 2 3.703704 -1.5707963
        19 11.4705882 3.000000 2 3.703704 -1.5707963
        ...
        51 26.5294118 3.000000 2 3.703704 -1.5707963
        52 27.0000000 3.000000 2 NaN 0.0000000
        53 27.0000000 3.000000 3 7.812028 -0.5880026
        54 27.5714286 3.857143 3 7.812028 -0.5880026
        ...
        66 34.4285714 14.142857 3 7.812028 -0.5880026
        67 35.0000000 15.000000 3 6.164931 1.7088024
        68 -1.0000000 10.000000 4 -3.296296 -1.5707963
        69 -0.6666667 10.000000 4 -3.296296 -1.5707963
        70 -0.3333333 10.000000 4 -3.296296 -1.5707963
        71 0.0000000 10.000000 4 -6.102943 -1.4288993
        72 35.0000000 15.000000 5 -7.610268 -1.5253730
        ...
        80 44.7777778 15.444444 5 -7.610268 -1.5253730


        The regions of stationary (distance, angle) values are clear and the breaks between them are obvious.



        Here is a plot of the (distance, angle) pairs, colored according to the original segments created in the question:



        Figure 1



        I applied a hierarchical clustering solution (using the hclust function in R with the default complete linkage) and used that to classify the original data (except for the last point). Here is the classification, shown with colors:



        Figure 2







        share|cite|improve this answer












        share|cite|improve this answer



        share|cite|improve this answer










        answered 1 hour ago









        whuber♦

        199k32430795




        199k32430795




















            Federico Gentile is a new contributor. Be nice, and check out our Code of Conduct.









             

            draft saved


            draft discarded


















            Federico Gentile is a new contributor. Be nice, and check out our Code of Conduct.












            Federico Gentile is a new contributor. Be nice, and check out our Code of Conduct.











            Federico Gentile is a new contributor. Be nice, and check out our Code of Conduct.













             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f375787%2fhow-to-cluster-parts-of-broken-line-made-of-points%23new-answer', 'question_page');

            );

            Post as a guest













































































            Comments

            Popular posts from this blog

            Long meetings (6-7 hours a day): Being “babysat” by supervisor

            Is the Concept of Multiple Fantasy Races Scientifically Flawed? [closed]

            Confectionery