Data Visualization using Box-Plot

Data visualization is the first step in data analysis. DataPandit allows you to visualize boxplots as soon as you segregate categorical data from the numerical data. However, the box plot does not appear until you uncheck  ‘Is this spectroscopic data?’ option in the sidebar layout as shown in Figure 1. 

Figure 1: Boxplot in DataPandit

The box plot is also known as ‘Box – Whisker Plot’. It provides 5-point information, including the minimum score, first (lower) quartile, median, third (upper) quartile, and maximum score. 

When Should You Avoid Boxplot?

The box plot itself provides 5-point information. Hence, you should never use a box plot to visualize data with less than five observations. In fact, I would recommend using a boxplot only if you have more than ten observations.

Why Visualize?

If you want a DataPandit user, you might just ask, ‘Why should I visualize my data in first place? Wouldn’t it be enough if I just analyze my model by segregating the response variable/categorical variable in the data?’ The answer is, ‘No’ because box plots often help you determine the distribution of your data.

Why is Distribution Important?

If your data is not normally distributed, you most likely might induce bias in your model. Additionally, your data may also have some outliers that you might need to remove before proceeding to advanced data analytics approaches. Also, depending on the data distribution, you might want to apply some data pre-treatments to build better models.

Now the question is how to visualize these abnormalities in the data? Don’t worry, we will help you here. Following are the key aspects that you must evaluate while data visualization.

The spread of the data

You can determine the spread of the data by looking at the lowest and highest measurement for a particular variable.  In statistics,  the spread of the data is also known as the range of the data. For example, in the following box plot, the spread of the variable ‘petal.length’ is from 1 to 6.9 units.

Figure 2: Iris raw data boxplot 

Mean and Median

The mean and median of normally distributed data coincide with each other. For example, we can see that the median petal.length is 4.35 units based on the boxplot. However, if you take a look at the data summary for the raw data, then the mean for petal length is  3.75 units as shown in Figure 3. In other words, the mean and median do not coincide which means the data is not normally distributed.

Figure 3: Data summary for Iris raw data

Left Skewed or Right Skewed?

The values for mean and median also help you find out if your data is skewed toward the right or left? If the mean is greater than the median, the data is skewed towards the right. Whereas if the mean is smaller than the median, the data is skewed towards the left. 

Alternatively, you can also observe the interquartile distances visually to see where most of your data lie. If the quartiles are uniformly divided, you most likely have normal data.

Understanding the skewness can help you know if the model will have a bias on the lower side or higher side. You can include more samples to achieve normal distribution depending on the skewness.

Is it an outlier?

Data visualization can help identify outliers. You can identify outliers by looking at the values far away from the plot. For example, the highlighted value (X1, max=100) in Figure 4 could be an outlier. However, in my opinion, you should never label an observation as an outlier unless you have a strong scientific or practical reason to do so.

Figure 4: Spotting outlier in boxplot

Do I need any data pre-treatments?

If the data spread is too different for different variables, or if you see outliers with no scientific or practical reasons, then you might need some data pre-treatments. For example, you can mean center and scale the data as shown in Figure 5 and Figure 6 before proceeding to the model analysis. You can see these dynamic changes in the boxplot only in the MagicPCA application.

Figure 5: Iris mean-centered data boxplot

Figure 5: Iris mean-centered data boxplot

x

Conclusion

Data visualization is crucial to building robust and unbiased models. Boxplots are one of the easiest and most informative ways of visualizing the data in DataPandit. Data visualization can be a very useful tool to spot outliers. It can also help to finalize the data pre-treatments for building robust models.

Need multivariate data analysis software? Apply here to obtain free access to our analytics solutions for research and training purposes!