1. Learn
  2. /
  3. Courses
  4. /
  5. Helsinki Open Data Science

Connected

Exercise

K-means: determine the k

K-means needs the number of clusters as an argument. There are many ways to look at the optimal number of clusters and a good way might depend on the data you have.

One way to determine the number of clusters is to look at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes (the calculation of total WCSS was explained in the video before). When you plot the number of clusters and the total WCSS, the optimal number of clusters is when the total WCSS drops radically.

K-means might produce different results every time, because it randomly assigns the initial cluster centers. The function set.seed() can be used to deal with that.

Instructions

100 XP
  • Set the max number of clusters (k_max) to be 10
  • Execute the code to calculate total WCSS. This might take a while.
  • Visualize the total WCSS when the number of cluster goes from 1 to 10. The optimal number of clusters is when the value of total WCSS changes radically. In this case, two clusters would seem optimal.
  • Run kmeans() again with two clusters and visualize the results