Density-Based And Graph-Based Clustering

Arun Jagota Arun Jagota
March 1, 2021 AI & Machine Learning

What they are. When they are useful. How they relate to each other.

Consider the data set

Density-Based And Graph-Based Clustering

What do you see? I see three definite clusters and 4 outliers. What makes the clusters ‘definite’? Each is dense with many points in it. What makes the singleton clusters outliers? Each is a singleton and far from any of the dense clusters.

Density-based clustering is an approach that fits this example. Its aim is to find maximal clusters, each sufficiently dense. Maximal just means a cluster cannot be grown without decreasing the density significantly. For example, growing the rightmost dense cluster in the figure above by adding the point near the top right corner would decrease the density significantly, so we leave it out, preferring to think of it as an outlier instead.

Once the clusters have been found, clusters that have very few points in them can be deemed outliers.

Before diving into density-based clustering, let’s ask the question: why not use K-means clustering? We’ll assume the reader is familiar with this algorithm. If not, check out [1].

K-means clustering is ill-suited to this problem. For one, what K would you use? Think of it this way. Imagine that as we sampled from the data’s population, the three dense clusters remained largely the same but the outliers changed, including the number of them. This is a reasonable model of a population that inherently forms a small fixed number of clusters but is prone to large random errors, i.e. has an unknown number of outliers.

Using K-means clustering in the above setting would be very sensitive to the choice of K. This is because the total number of clusters can vary significantly from sample to sample. However, the number and characteristics of the dense clusters largely remain the same. So in principle, a density-based approach would be more robust.

Okay, at an intuitive level, the notion of dense clusters is simple and clear. But how do we actually find the dense clusters?

Let’s view the situation as follows. For a given distance d we can define a graph whose edges capture pairs of points that are at most distance d apart. That is, the points are the nodes of this graph. Two points are connected by an edge if they are within distance d.

How does this help us? If we can somehow just choose the right d, the graph’s connectivity structure will capture the cluster structure. This is depicted in the example below.

Density-Based And Graph-Based Clustering
(a): data. (b): graph for a good d. (c): graph is fully-connected for very large d.

For just the right choice of d, we get a good graph. That is a graph that captures the clusters. When d is too large, the graph is fully-connected (i.e., a clique). This graph is useless for clustering. When d is too small, there may be no edges in the graph. Again, useless for clustering.

Okay so how do we choose a good d? Not an easy question to answer. Still, we’ve made progress. We have transformed the problem into one involving graph clustering. So now we just have to choose a good d. Well, sometimes we can, using domain knowledge or exploratory analysis on the data set. What if neither helps? We’ll defer the discussion on this, to later.

In this particular example, for a good choice of d, the clusters are just the connected components of the graph. (A connected component is a maximal connected piece of the graph.) There are easy algorithms to find the connected components in a graph.

Below is another example on which the connected components algorithm works quite well. This data set has two clusters. Their shapes are such that 2-means clustering wouldn’t work well. (2-means clustering would work better with elliptical clusters.) For the right d, connected components will find these two clusters.

Two Clusters
Two clusters, with non-elliptical shapes.

Recap

Okay, so this is what we have so far. If somehow we could choose a good distance threshold d, we can transform the clustering problem into one on graph clustering. Even the simplest graph clustering algorithm, connected components clustering, often works well for the right choice of d.

Viewed this way,

density-based clustering = find good d + graph-based clustering

Beyond Connected Components Clustering

With the right choice of d, connected components clustering can often work quite well. But not always.

Consider the following image.

Two dense clusters
Two dense clusters plus an outlier cluster

I see two dense clusters plus a singleton cluster (the top right point). (The singleton cluster may be interpreted as an outlier.) With just the right d, the graph’s connected components will mirror these three clusters. However, the range of “right” d’s is quite narrow. With a d even slightly larger, the outlier would be absorbed into its neighboring dense cluster.

Now comes the key point. Even for this d, there is information in the graph that allows us to keep the outlier out of the dense cluster. Specifically, the outlier has degree 0. (The degree of a node is the number of edges in the graph that touch it.) By contrast, each of the nodes in the dense cluster has a degree of at least three.

Okay, so why don’t we replace the “connected component” property with a more stringent “dense subgraph” property. There are different ways we could define “dense subgraph”. One is to require that the number of edges in the subgraph is a certain minimum percentage of the number of edge slots in the subgraph. (An edge slot is a slot in which an edge can fit.) A different way to define “dense subgraph” is to require that every node in the subgraph has a certain minimum degree within the subgraph. The latter is a stricter definition as it requires each of the nodes to have a minimum density. The former only requires the entire subgraph to have a certain minimum density.

In principle, we can partition the graph into maximal subgraphs each of which is sufficiently dense according to whatever definition of “dense” we have chosen. In practice, these types of partitioning problems are generally intractable. That is, very time consuming to solve for large graphs.

This is where heuristic algorithms come in.

DBScan

This is a widely-used density-based clustering method. it heuristically partitions the graph into subgraphs that are dense in a particular way.

It works as follows. It inputs the graph derived using a suitable distance threshold d chosen somehow. The algorithm takes a second parameter D. Any node in the graph whose degree is at least D is termed a core node. A core node is called such because it will form the core of a cluster.

For a given choice of D, the approach works as follows. First, we find the highest-degree node u in the graph. We start a new cluster at this node. Next, we expand this cluster to contain the neighbors of u. Next, if any neighbor v of u is a core node (this is where D comes in) we add the neighbors of v to this same cluster. We keep going until the cluster stops growing. We then freeze the cluster, and repeat this process on the remainder of the graph.

This process is depicted below. In step 1, we find the first core which becomes a new cluster. This core’s degree is 4, which is greater than our setting D = 3. In step 2, we expand this core to include its neighbors into the cluster. This is shown by the dashed circle. In step 3, we discover that one of these neighbors is itself a core. In step 4, we expand out the cluster further to include the neighbors of the second core. So the entire graph shown is now in a single cluster.

Density-Based And Graph-Based Clustering
(1). First core found. (2). First core’s neighbors found. (3) Second core detected in first core’s neighbors. (4) Cluster expanded to include second core’s neighbors.

Once we have found all the clusters, we deem the ones with sufficiently-many points in them to be dense and the rest to be outliers.

An Alternative Visualization

One way to visualize this process is to imagine that from the first graph (which was derived from the data for a given d) we derive a second graph. The nodes of the second graph are the core nodes in the first graph. Two nodes in the second graph are connected by an edge if there is an edge between the same two nodes in the first graph.

Now we find the connected components in the second graph. Each of the resulting clusters contains one or more (core) nodes. We then expand out each of these clusters further using the edges in the original graph. Specifically for every core node in a cluster, we also put its neighbors in the same cluster.

This process may be summarized as follows.

  1. Build G2 from G.
  2. Find connected components in G2.
  3. Expand each connected component using the edges in G.
  4. If some nodes in G are still unassigned, find connected components in the subgraph of G induced by these nodes. These connected components become additional clusters. Moreover, we also know that these clusters are ‘secondary’ clusters as they don’t have any core points in them.

This view opens the door to using connected component algorithms or implementations.

We illustrate this process in the figure below.

Component Algorithms
Exhibit (a) shows the data set. (b) shows the graph G derived from it for a certain d (not shown). (c) shows the graph G2 derived from G for D = 3. (d) shows the connected components in G2 expanded using the additional edges in G. (e) depicts the secondary cluster composed of nodes in G that remained unexamined after (d).

Choosing the Hyperparameters

So the graph-based clustering has one hyperparameter d. DBscan’s particular algorithm has a second: D. How might we choose these when we have no domain knowledge to guide us?

There is no magical algorithm to find the best values. That said, there are some heuristic ways that ease the burden of finding good candidates. We will focus on d, which is more important as it defines the graph.

The number of edges in the graph is a monotone function of d. When d is smaller than the minimum distance between two points in the data set, there is no edge. As we increase d, we start creating edges. Eventually, when d is large enough, we get a fully-connected graph.

Consider this approach. Track the number of edges as a function of d, and pick the first d where there is a significant jump. This is depicted below.

Number of edges

The intuition behind this is that if there are dense clusters in the data, the graph will suddenly get a lot of edges as d approaches the right value.

Consider the example below. It’s not hard to visualize the d at which suddenly lots of edges emerge.

edges emerge

In fact, the graph of the number of edges as a function of d for this picture might look like this.

graph of the number

This is because the first cluster is a bit denser than the second one.

HCS

This may be viewed as a conservative variant of connected components clustering (CCC). To see this let’s rename CCC to connected subgraphs clustering. (This is just semantics.) HCS stands for highly connected subgraphs clustering.

Naming aside, how does HCS work? (Previously we said this type of clustering is generally intractable.) It uses the notion of min-cuts. There are many heuristic algorithms to find min-cuts or approximations.

HCS starts with the entire graph and checks if it is highly connected. We can define highly connected however we like so long as we can test for this condition efficiently. If the graph is not highly connected, it partitions the graph into two non-empty subgraphs in such a way that the number of edges that cross the partition is minimized. We might possibly add more constraints on the two graphs. Such as neither is relatively too small, i.e. the partition is not too imbalanced. Such a set of edges is called a min-cut. There are algorithms to find min-cuts under additional constraints. We won’t discuss these.

We then recurse on the two subgraphs.

CCC may be viewed as a special case in which the highly connected condition is just connected and the min-cut must be of size 0.

Example

Consider the graph depicted below. For an appropriate definition of “highly connected”, HCS should find the two clusters below. There is only one edge crossing the clusters. That is, the cut is small.

Density-Based And Graph-Based Clustering
Two highly-connected subgraphs connected by a small cut

One could reasonably make the case that the top node in the first cluster — the one whose degree is 1 — should go into its own (singleton) cluster. For an appropriate definition of “highly connected,” this can be made to happen.

Summary

In this post, we covered density-based clustering and explained that it is a better fit than K-means clustering to scenarios in which (i) the number of clusters is not known in advance, (ii) there are outliers, and (iii) the clusters are not elliptical.

During the process, we also revealed that more generally graph-based clustering has these attractive properties. In fact, the most popular algorithm for density-based clustering, DBSCAN, is graph-based.

Finally, we covered DBSCAN and a few other graph-based clustering algorithms (CCC, HCS) in some detail.

Further Reading

  1. K-means Clustering and Variants. The clustering problem is to group a… | by Arun Jagota | Dec, 2020
  2. https://en.wikipedia.org/wiki/DBSCAN
  3. https://en.wikipedia.org/wiki/HCS_clustering_algorithm

  • Experfy Insights

    Top articles, research, podcasts, webinars and more delivered to you monthly.

  • Arun Jagota

    Tags
    ClusteringConnected ComponentsDbscanDense Subgraphs
    Leave a Comment
    Next Post
    CIOs: Are You Ready Today For Tomorrow’s Technologies?

    CIOs: Are You Ready Today For Tomorrow's Technologies?

    Leave a Reply Cancel reply

    Your email address will not be published. Required fields are marked *

    More in AI & Machine Learning
    AI & Machine Learning,Future of Work
    AI’s Role in the Future of Work

    Artificial intelligence is shaping the future of work around the world in virtually every field. The role AI will play in employment in the years ahead is dynamic and collaborative. Rather than eliminating jobs altogether, AI will augment the capabilities and resources of employees and businesses, allowing them to do more with less. In more

    5 MINUTES READ Continue Reading »
    AI & Machine Learning
    How Can AI Help Improve Legal Services Delivery?

    Everybody is discussing Artificial Intelligence (AI) and machine learning, and some legal professionals are already leveraging these technological capabilities.  AI is not the future expectation; it is the present reality.  Aside from law, AI is widely used in various fields such as transportation and manufacturing, education, employment, defense, health care, business intelligence, robotics, and so

    5 MINUTES READ Continue Reading »
    AI & Machine Learning
    5 AI Applications Changing the Energy Industry

    The energy industry faces some significant challenges, but AI applications could help. Increasing demand, population expansion, and climate change necessitate creative solutions that could fundamentally alter how businesses generate and utilize electricity. Industry researchers looking for ways to solve these problems have turned to data and new data-processing technology. Artificial intelligence, in particular — and

    3 MINUTES READ Continue Reading »

    About Us

    Incubated in Harvard Innovation Lab, Experfy specializes in pipelining and deploying the world's best AI and engineering talent at breakneck speed, with exceptional focus on quality and compliance. Enterprises and governments also leverage our award-winning SaaS platform to build their own customized future of work solutions such as talent clouds.

    Join Us At

    Contact Us

    1700 West Park Drive, Suite 190
    Westborough, MA 01581

    Email: support@experfy.com

    Toll Free: (844) EXPERFY or
    (844) 397-3739

    © 2023, Experfy Inc. All rights reserved.