When building a histogram, how many bins should it have? There’s no definitive answer to the question, but the number of bins used for a histogram can drastically impact how visualized data is interpreted. Details such as data distribution patterns and sample quantity can impact how data is displayed in bins, and quite often the number of bins is arbitrarily chosen.

In a Business Intelligence solution, determining the number of bins is more difficult than with static reports since the volume and distribution of data can change with updates. This article intends to share one method for building bins that will automatically compensate for updates in total data quantity and range.

Numerous methods exist for determining the number of bins in a histogram. For this example, I’ve chosen Sturges’ formula. I wish that I had a mathematical basis for choosing it, but unfortunately I do not. I chose Sturges’ formula because it works and it can easily be coded into a BI solution. The equation is: *k = [log2N + 1]* where *k *represents the number of bins. The equation works well with almost any quantity of data in a typical BI solution. A count of 100 will have 7 bins and a count of 1 billion will have 30 bins. Either of those bin quantities is visually consumable on a histogram.

For this example, I’ll use the FAA Wildlife Strike Data model which I built for another blog and video series that can be found at this link. The following DAX equation (a custom column named [BINS]) will use Sturges’ formula to determine a recommended number of bins for the data set:

=ROUND(LOG((COUNTROWS(‘Fact StrikeReport’)),2),0) – What does this DAX expression accomplish? It calculates the log2 for the count of rows, and then rounds it to the closest integer. I should note an inherent limitation in this method because it calculates the number of bins based upon the total row count, and it does not recalculate when a data set is filtered in a solution. It also assumes that data is normally distributed. That being said, it provides a useful method to dynamically update the number of bins for a histogram.

The count of bins in [BINS] can then be used to determine the width of histogram bins. We’ll use the metric [SPEED] for our histogram, which was the speed at which an aircraft was moving when it hit a bird or other creature. The second custom DAX column (named [SPEED RANGE]), to determine bin width, is:

=CEILING((MAX([SPEED]) – MIN([SPEED])) / [BINS],1) – This calculation finds the difference in the range of the minimum and maximum ranges of Speed measurements, divides by the number of bins, and rounds up to the nearest integer.

Next, let’s create a column using DAX (named [BIN NUM]) to assign bin numbers that can also be used to order the bins:

=(FLOOR(([SPEED] / [SPEED RANGE]),1)+1) – This calculation finds the number of bins by dividing the [SPEED] on each row by the [SPEED RANGE] for the total data set, rounds down to the nearest integer, and adds 1. Thus the first bin (with all non-negative data) will be “1”, the second bin will be “2”, etc.

Finally, let’s give the bins names for the histogram. A calculated column (named [BIN NAME]) can be created with the following DAX code:

=CONCATENATE(CEILING((([BIN NUM] – 1) * [SPEED RANGE]),1),CONCATENATE(” to “,FLOOR([BIN NUM] * [SPEED RANGE],1))) – While it looks complicated, all that this DAX expression does is take the lowest integer value in the bin range, the highest value in the bin range, and creates a text description for the bin. Using the “Sort By Column” feature on the ribbon, set the column [BIN NAME] to be sorted by [BIN NUM].

The four new calculated columns should appear as follows in PowerPivot:

Finally, the solution can be deployed to a PivotChart in Excel. Filters, slicers, and interactive tools can then be used to analyze the chart:

Too many bins in the chart? Or is the data distributed in a manner that could use fewer bins? Instead of using log2 for Sturges’ formula, try a larger value. For example, if the demo solution uses log4 for the bin count the histogram will appear as follows:

[…] method I use is no different from what others have already blogged and wrote about. There is even a solution that calculates the number of bins in a histogram with a formula that is based on the total number […]