Part 4 - Data visualization

Posted on Jan 26, 2024

(Last updated: May 26, 2024)

Introduction

In this part we’ll discuss some principles when it comes to data visualization

Five-point summary

The five-point summary is a very common and effective summary of data to get a quick overview and feeling for the data.

To understand what the five-point summary tells us, let’s first understand quantiles and quartiles.

A quantile is a cut-point of a distribution (or a dataset). Quantiles are usually obtained by sorting the data and selecting an appropriate value at an appropriate rank.

Quartiles are quantiles that are placed such that 25% of the data is between them. This means that the 1st quartile, $Q_1$, is a value such that at least 25% of the dataset is at most $Q_1$. The 2nd quartile, $Q_2$, is a value such that at least 50% of the dataset is at most $Q_2$, or just the median. The 3rd quartile, $Q_3$, is a value such that at least 75% of the dataset is at most $Q_3$.

We could have deciles (10% apart) and centiles (1% apart).

The five-number summary consits of:

The sample minimum
The 1st quartile $Q_1$
The median $Q_2$
The 3rd quartile $Q_3$
The sample maximum

Creating good figures

Human beings are visual creatures, and a message is often better conveyed graphically than numerically.

Plots are good for summarization and showing trends, however, the plots need to be good and correct.

Be sparing: Whenever you choose to employ visualizations, they should support your message and aid in communication.
Keep It Simple Stupid (KISS): A simple plot is better than a cluttered one.

Colors are often used to distinguish between classes and encoding of numerical values in charts. Different kinds of entities should be colored differently if present in the same plot. Use colors that have inherent meaning, if possible, e.g., losses with red, environmental causes with green etc. If colors have a numerical meaning, beware of choosing an appropriate color map, as the color maps tend to be non-linear

Type of charts

There are a lot of different charts that we usually see, let’s cover them.

Line plots

Line plot is the familiar plot of $f(x) = y$. Line plots are appropriate for showing the change of one variable as a function of another.

Place the variable that is varied horizontally (on $x$-axis). Place the function of that variable on the $y$-axis.

Connecting the dots implies that there are values between; it may be appropriate to highlight the data points with markers (unless there is a very large number of points).

A dot plot is obtained when the points are not connected with lines and can be appropriate to avoid confusion.

Scatter plots

Scatter plots place all points of the dataset on a Cartesian plane. This can readily show the bivariate relationship between, such as correlation

Use appropriately-sized dots so that the dots don’t mask the data. Density can be represented with color as in heatmap, or points can be shifted slightly at random to show where the density is the highest.

Bubble plots vary dot sizes to match some third variable, this can be a useful variant in addition to color

Barplot

Barplots can be used to describe the relative proportions of categorical variables.

In addition to grouping along an axis (e.g., year on the $x$-axis), different categories can be shown in different colors.

Histograms

Histograms describe the frequency distribution of a variable.

Divides the range of the data into equal-sized intervals or bins. The surface area of a bar in the histogram is thus proportional to the fraction of elements falling into the bin.

Choosing the number of bins is important. Too small width makes random noise too pronounced. Too large width masks the structure of the data.

There is no correct choice, there are only better and worse choices.

Ways to choose bin width

Suppose the data is $x = \{x_1, x_2, \ldots, x_n \}$. Bin width $h$ and the number of bins $k$ are related via $$ k = \biggr\lceil\dfrac{x_{max} - x_{min}}{h}\biggr\rceil $$

There are some default choices:

Square root rule: $k = \lceil\sqrt{n}\rceil$
Sturges’ formula: $k = \lceil\log_2(n) + 1\rceil$, implicitly assumes normal distribution
Rice’s formula: $k = \lceil\sqrt[\leftroot{-2}\uproot{2}3]{n}\rceil$
Doane’s formula: $k = 1 + \log_2(n) + \log_2\left(1 + \frac{|g_1|}{\sigma_{g_1}}\right)$
- Where $g_1 = \frac{1}{n} \sum_{i = 1}{n} \left(\frac{x - \hat{\mu}}{\hat{\sigma}}\right)^3$ is an estimator for sample skewness.
- $\hat{\mu} = \frac{1}{n} \sum_{i = 1}{n} x_i$, is the sample mean.
- $\hat{\sigma} = \sqrt{\frac{1}{n - 1} \sum_{i = 1}{n} (x - \hat{\mu})^2}$, is the sample standard deviation.
- $\sigma_{g_1} = \sqrt{\frac{6(n - 2)}{(n + 1)(n + 3)}}$
Scott’s rule: $h = \frac{3.49\hat{\sigma}}{\sqrt[\leftroot{-2}\uproot{2}3]{n}}$, where $\hat{\sigma}$ is the sample standard deviation, this is optimal for normally distributed data.
Feedman-Diaconis’ rule: $h = 2 \cdot \frac{IQR(x)}{\sqrt[\leftroot{-2}\uproot{2}3]{n}}$, where $IQR(x) = Q_3 - Q_1$ is the interquartile range.

Boxplot

The boxplot, or the box and whiskers plot, is a concise way to represent the different quartiles of the data. The box shows the upper and lower quartiles of the data (50% of the data lies inside the box, that is, the bottom of the box shows where 25% quartile is, and the top of the box where the 75% quartile is).

The median of the data is shown with a line.