When working with data, it is sometimes important to know where a data point’s relative costs to increase some tunable parameter is no longer worth the corresponding performance benefit. The algorithm “kneedle” detects those beneficial data points showing the best balance inherent tradeoffs — called “knees” (curves that have negative concavity) or sometimes “elbows” (curves that have positive concavity) — in discrete data sets based on the mathematical definition of curvature for continuous functions. With this article, I want to demonstrate the benefits of this algorithm and its applications with the Python package “kneed”.
The “kneedle” algorithm has been published by Satopää, Albrecht, Irwin, and Raghavan  using the concept of curvature as a mathematical measure of much a function differs from a straight line. In consequence, the maximum curvature captures the levelling off effects operators use to identify knees.
Finding a "Kneedle" in a Haystack: Detecting Knee Points in System Behavior
Computer systems often reach a point at which the relative cost to increase some tunable parameter is no longer worth…
For continuous functions, the curvature is described as shown here:
Curvature - Wikipedia
In mathematics, curvature is any of several strongly related concepts in geometry. Intuitively, the curvature is the…
Let’s create for instance N = 10,000 random standard normal distributed data points:
Now we can sort the values in an ascending order to get a cumulative distribution function (CDF) showing each x-value on the x-axis and its probability of occurrence on the y-axis. The maximum of the curvature for this function is roughly at x = 1.16:
While curvature is well defined for continuous functions, there is no clear definition of curvature for discrete data. In this instance, the curvature can be fitted to a continuous function — but when the data is noisy, this fitting can become even more difficult. This is where the “kneedle” algorithm comes into play.
Lets’ take a deeper look how this algorithm works. For this, I have used the original discrete data set from Figure 2 from the conference paper from Satopää et alia . It’s provided by the Python package “kneed”:
This is the original data being plotted:
Let’s get through the steps the “kneedle” algorithm follows:
- Smoothing spline
Using a spline to “smooth” the shape of the original data.
2. Normalizing the point of the smooth curve to the unit square
Range of x- and y-values will be normalized to 0 to 1 (see axes).
3. Calculating the difference curve changes
Calculating perpendicular distances (Euclidean distances from the data points to given line from first to last data point)…
… and rotating them by 45 degrees clockwise. The magnitude of these rotated perpendicular lines indicates the difference values.
4. Calculating the local maxima of the difference curve
Finding the “knee” points at the local maxima of the difference curve in the normalized curve (for detecting elbow points, the graph will be inverted).
5. Calculating threshold for each local maximum in the difference curve
For each local maximum in the difference curve, there is a unique threshold value that is based on the average difference between the x-values and a sensitivity parameter (please see original conference paper for further reference how this threshold specifically is computed). This sensitivity parameter measures how many “flat” points are expected to be seen in the original unmodified data before declaring a “knee” and can be adjusted: A small sensitivity value detects knees quicker, whereas a larger one is more conservative.
6. Each difference value is compared with threshold
If a difference value drops below the threshold before the local maximum is reached, the algorithm is declaring a “knee”. Conversely, the threshold value is reset to zero.
Now we are taking a deeper look into the Python package “kneed” written by Kevin Arvai that uses the “kneedle" algorithm:
Knee-point detection in Python This repository is an attempt to implement the kneedle algorithm, published here. Given…
Interactive knee point detection using kneed! Contribute to arvkevi/ikneed development by creating an account on…
With this, we can for instance directly plot the normalized difference curve including the “knee” point:
Welcome to kneed's documentation! - kneed 0.6.0 documentation
This is the documentation for the kneed Python package. Given x and y arrays, kneed attempts to identify the knee/elbow…
We can use the method KneedLocator() in order to find a knee/elbow in our data:
# calculate and show knee/elbow
kneedle = kneed.KneeLocator(…)
knee_point = kneedle.knee #elbow_point = kneedle.elbow
print(‘Knee: ‘, knee_point) #print(‘Elbow: ‘, elbow_point)
These are the most important parameters:
- curve: ”concave” for a “knee”, convex for an “elbow” — based on the negative/positive concavity of the function
- direction: “increasing” for a positive slope, “decreasing” for a negative slope of the function
- online: Knee as first element detected (False) or correcting “old” knee values if necessary if points are received (True)
- S: Sensitivity for knee detection (S=0 or bigger); Satopää et alia  state that “kneedle” has perfect information in offline setting when sensitivity is 0. In online settings, overall a sensitivity of 1 shows the best overall performance (but it can vary from the data points received).
Detecting elbows with the “kneedle” algorithm becomes super handy for clustering algorithms:
With the elbow method, you can calculate the within-cluster sum of squares as a measure of the within-cluster variance and detect the best value for the number of clusters to form (k):
When calculating the distance metric (usually Euclidean distance) for each data point and sorting them in an ascending order, they can be plotted in a k-distances graph to a find a threshold value for defining a cluster:
Even for classification, “kneedle” becomes extremely useful when looking at the True Positive Rate as well as the False Positive Rate. Both their trade-offs are represented in a receiver operating characteristic curve . The “knee” in a ROC curve can provide a good measure for a threshold value as the TPR’s increase slows down and the FPR starts increasing faster at this point:
“kneed” also provides an interactive API called “ikneed” on Streamlit, where you can test (small) data sets for “knees” respectively “elbows”. There you can also play around with the parameters and see their impacts directly:
In my opinion, “kneed” is currently the most powerful Python package for knee detection as it provides a lot of useful methods and parameters. Nevertheless, there are other similar Python packages provided for finding knees of a curve, such as “kneebow”:
 V. Satopää, J. Albrecht, D. Irwin and B. Raghavan, “Finding a “Kneedle” in a Haystack: Detecting Knee Points in System Behavior” (2011), 31st International Conference on Distributed Computing Systems Workshops, 2011, pp. 166–171, doi: 10.1109/ICDCSW.2011.20.