Correlation Vs Causation

Correlation Vs Causation

The subtle difference in ‘Correlation Vs Causation’ is very important for budding data analysts. Often we get so excited with the patterns in the data that we forget to evaluate if it is a mere correlation or if there is a definite cause. It is very easy to get carried away with the idea of giving some fascinating Insight to our clients or cross-functional teams. In this blog post, let us talk briefly about the difference between correlation and Causation.


Causation

The word Causation means that there is a cause-and-effect relationship between the variables under investigation. The cause and effect relationship causes one variable to change with change in other variables. For example,  if I don’t study, I will fail the exam. Alternatively, if I study, I will pass the exam. In this simple example, the cause is ‘study,’ whereas ‘success in the exam’ is the effect. 

Correlation does not imply causation

Correlation

The word correlation means a statistical relationship exists between the variables under investigation. The statistical relationship indicates that the change in one variable is mathematically related to the change in other variables. The variable with no casual relationship can also show an excellent statistical correlation. For example, my friend found out that a candidate’s success in an exam is positively correlated with the fairness of the candidate’s skin—the fairer the candidate, the better the success.

I am sure you will realize how it doesn’t make sense in the real world. My friend got carried away for sure, right?

Take a Way

Always look for Causation before you start analyzing the data. Remember, Causation and correlation can coexist at the same time. However, correlation does not imply Causation. It is easy to get carried away in the excitement of finding a breakthrough. But it is also essential to evaluate the scientific backing with more information. 

So, how do you cross-check if the causation really exists? What are the approaches you take? Interested in sharing your data analysis skills for the benefit of our audience? Send us your blog post at info@letsexcel.in. We will surely accept your post if it resonates with our audience’s interest.

Need multivariate data analysis software? Apply here to obtain free access to our analytics solutions for research and training purposes!

PCA

Your Ultimate Guide for PCA with DataPandit

Principal component analysis (PCA)  is an unsupervised classification method. However, the PCA method in DataPandit cannot be called an unsupervised data analysis technique as the user interface is defined to make it semi-supervised. Therefore, let’s look at how to perform and analyze PCA in DataPandit with the help of the Iris dataset.

Arranging the data 

 There are some prerequisites for analyzing data in the magicPCA application as follows:

  • First, the data should be in .csv format. 
  • The magicPCA application considers entries in the first row of the data set as column names by default.
  • The entries in the data set’s first column are considered row names by default.
  • Each row in the data set should have a unique name. I generally use serial numbers from 1 to n,  where n equals the total number of samples in the data set. This simple technique helps me avoid the ‘duplicate row names error.’
  • Each column in the data set should have a unique name.
  • As magicPCA is a semi-supervised approach, you need to have a label for each sample that defines its class.
  • There should be more rows than the number of columns in your data.
  • It is preferable not to have any special characters in the column names as the special characters can be considered mathematical operations by the magicPCA algorithm.
  • The data should not contain variables with a constant value for all the samples.
  • The data should not contain too many zeros. 
  • The data must contain only one categorical variable

Importing the data set

The process of importing the data set is similar to the linear regression example. You can use the drag and drop option or the browse option based on your convenience. 

Steps in data analysis of PCA

Step 1: Understanding the data summary

After importing the data, it makes sense to look at the minimum, maximum, mean, median, and the first and third quartile values of the data to get a feel of the distribution pattern for each variable. This information can be seen by going to the ‘Data Summary’ tab beside the ‘Data’ tab in the main menu.

Step 2: Understanding the data structure 

You can view the data type for each variable by going to the ‘Data Structure’ tab beside the ‘Data Summary’ tab. Any empty cells in the data will be displayed in the form of NA values in the data structure and data summary. If NA values exist in your data, you may use data pre-treatment methods in the sidebar layout to get rid of them.

Step 3: Data Visualization with boxplot 

As soon as the data is imported, the boxplot for the data gets automatically populated. Boxplot can be another valuable tool for understanding the distribution pattern of variables in your data set. You can refer to our earlier published article to learn how to use boxplot.

You can mean center and scale data set to normalize the distribution pattern of the variables.

The following picture shows the Iris data when it is mean-centered.

 The picture below shows the Iris data when it is scaled after mean centering.

Step 4: Understanding the multicollinearity

Multicollinearity of the variables is an essential prerequisite for building a good PCA model. To know more about multicollinearity and how to measure it read our article on Pearson’s correlation Coefficient and how to use the multicollinearity matrix.

Step 5: Divide the data in the training set and testing set

After importing the data,  the training and testing data set automatically gets selected based on default settings in the application. You can change the proportion of data that goes in the training set and testing set by Increasing or decreasing the value for the ‘Training Set Probability’ in the sidebar layout, as shown in the picture below.

Suppose the value of the training set probability is increased. In that case, a larger proportion of data goes into the training set whereas, if the value is decreased, the relatively smaller proportion of data remains in the training set. For example, if the value is equal to 1, then 100% of the data goes into the training set living testing set empty.

As a general practice, it is recommended to use the training set to build the model and the testing set to evaluate the model. 

Step 6: Select the column with a categorical variable

This is the most crucial step for building the PCA model using DataPandit. First, you need to select the column which has a categorical variable in it. As soon as you make a selection for the column which has a categorical variable, the model summary, plots, and other calculations will automatically appear in under the PCA section of the ‘Modal Inputs’ tab in the Navigation Panel.

Step 7: Understanding PCA model Summary

The summary of PCA can be found below the model summary tab. 

The quickest way to grasp information from the model summary is to look at the cumulative explained variance, which is shown under ‘Cumexpvar,’ and the corresponding number of components shown as Comp 1, Comp 2, Comp 3, and so on. The cumulative explained variance describes the percentage of data represented by each component. In the case of the Iris data set, the first component describes 71.89% of the data (See Expvar). At the same time, the second component represents 24.33% of the data. Together component one and component two describe 96.2 2% of the data. This means that we can replace the four variables which describe one sample in the Iris data set with these two components that describe more than 95% of the information representing that sample in the data set. And this is the precise reason why we call principal component analysis as a dimensionality reduction technique.

Step 8: Understanding PCA summary plot

The PCA summary plot shows scores plot on the top left side, loadings plot on the top right side, distance plot on the bottom left side, and cumulative variance plot on the bottom right side. The purpose of the PCA summary plot is to give a quick glance at the possibility of building a successful model. The close association between calibration and test samples in all the plots indicates the possibility of creating a good model. The scores plot shows the distribution of data points with respect to the first two components of the model.

In the following PCA summary plot, the Scores plot shows two distinct data groups. We can use the loadings plot to understand the reason behind this grouping pattern. We can see in the loading plot that ‘sepal width’ is located father from the remaining variables. Also, it is the only variable located at the right end of the loadings plot. Therefore we can say that the group of samples located on the right side of the scores plot is associated with the sepal width variable. These samples have higher sepal width as compared to other samples. To reconfirm this relationship, we can navigate to the individual scores plot and loadings plot.

Step 9: Analyzing Scores Plot in PCA

The model summary plot only gives an overview of the model. It is essential to take a look at the individual scores plot to understand the grouping patterns in more detail. For example, in the model summary plot, we could only see two groups within the data. However, in the individual scores plot we can see three different groups within the data: setosa, versicolor, and virginica. The three groups can be identified with three different colors indicated by the legends at the top of the plot.

Step 10: Analyzing loadings plot in PCA

It is also possible to view individual loadings plot. To view it, select ‘Loadings Plot’ option under the ‘Select the Type of Plot’ in the sidebar layout.

The loadings plot will appear as shown in the picture below. If we compare the individual scores plot and loading plot, we can see that the setosa species samples are far away from the verginica and the Versicolor species. The location of the Setosa species is close to the location of sepal width on the loadings plot, which means that the setosa species has higher sepal with as compared to the other two species.

Step 11: Analyzing distance plot in PCA

You can view the distance plot by selecting the ‘Distance Plot’ option under the ‘Select the Type of Plot’ in the sidebar layout. The distance plot is used to identify outliers in the data set. If there is an outlier, it will be located far away from the remaining data points on this plot. However, the present data set does not have any outliers. Hence we could not spot any. Ideally, you should never label a sample as an outlier unless and until you know a scientific or practical reason which makes it an outlier.

Step 12: Analyzing explained variance plot in PCA

You can view explained variance plot by selecting the ‘Explained Variance Plot’ option under the ‘Select the Type of Plot’ in the sidebar layout. It shows the contribution of each principal component in describing the data points. For example, in this case, the first principal component represents 71.9 % of the data whereas the second principal component describes 24.3% of the data. This plot is used to find out the number of principal components that can optimally describe the entire data. It is expected that the optimal number of components should be lower than the total number of columns in your existing data set because the very purpose of a PCA model is to reduce the dimensionality of the data. In the case of the Iris data, we can say that two principal components are good enough to describe more than 95% of the data. Also, addition of more principal components does not result in a significant addition to the information (<5%). Pictorially,  we can also come down to this conclusion by identifying the elbow point on the plot. The elbow point, in this case, is at principal component number 2. 

Step 13: Analyzing biplot in PCA 

The biplot for PCA shows scores and the loading information on the same plot. For example, in the following plot, we can see that loadings are shown in the form of lines that originate from a common point at the center of the plot. At the same time, scores are shown as scattered points. The direction of the loadings line indicates the root cause for the location of the samples on the plot. In this case, we can see that setosa samples are located in the same direction as that of the sepal width loading line, which means that setosa species have higher sepal width than the other two species. It reconfirms our earlier conclusion drawn based on individual scores and loadings plot. 

Step 14: Saving the PCA model

If you are satisfied with the grouping patterns in the PCA model, then you can go on to build an individual PCA model for each categorical level. Therefore, in the case of the Iris data, we need to make 3 PCA models, namely,  setosa, virginica, and Versicolor. To do this, we need to navigate to the SIMCA option under the ‘Model Inputs’ option in the navigation panel.

After going to the SIMCA option, select the species for which you want to build the individual model using the drop-down menu under ‘Select One Group for SIMCA Model.’ As soon as you select one group, the SIMCA model summary appears under ‘SIMCA Summary’ in the main menu. Depending on the  cumulative explained variance shown under the ‘Simca Summary’, select the number of components for SIMCA using the knob in the sidebar layout. Save individual model files for each categorical level using ‘Save File’ button under the ‘Save Model’ option in the sidebar layout.

Step 16: Uploading PCA Models for SIMCA predictions

You can upload the saved individual model files using the ‘Upload Model’ feature in the sidebar layout.

To upload the model files, browse to the location of the files in your computer, select the ‘All files’ option in the browsing window, Press’ ctrl’, and select all the model files for the individual models as shown in the picture below. 

Step 17: Understanding the result for the train set

As soon as the individual model files are uploaded, the predictions for the train set and test set populate automatically. Go to the ‘Train SIMCA’ tab to view the predictions for the train set . You can see the predictions for individual sample by looking at the table which displays on the screen under the ‘Train SIMCA’ tab. In the prediction table, 1 indicates the successful classification of the sample into the corresponding category represented by column name. However, it may not be convenient to check the classification of each sample in the training data. Therefore, to avoid this manual work, you can scroll down to see the confusion matrix.

The confusion matrix for the train set of Iris data is shown in the figure below. The rows of the confusion matrix represent the actual class of the sample, whereas the columns of the confusion Matrix represent the predicted class of the sample. Therefore,  the quickest way to analyze the confusion matrix is to look at the diagonal elements and non-diagonal elements of the matrix. Every non-diagonal element in the confusion matrix is misclassified and contributes to the classification error. If there are more non-diagonal elements than the diagonal elements in your confusion Matrix, then that means the models cannot distinguish between different classes in your data. For example, in the case of Iris, data following confusion matrix shows that four Versicolor samples are misclassified as Verginaca, and one sample from each class could not be classified into any species. The model’s accuracy can be calculated by performing a sum of correctly classified samples and dividing it by the total number of samples in the training set. In this case, accuracy will be equal to the sum of diagonal elements divided by the total number of samples in the train set. At the same time, the misclassification error can be found by subtracting the accuracy from 1.

The closer the accuracy value to 1, the better the model’s predictability.

The Confusion Matrix can be pictorially seen by going to the ‘Train SIMCA plot’ option in the main menu. The plot shows the three species in three different colours, represented by the legend at the top.

You can view cooman’s plot by selecting Cooman’s Plot’ or the ‘Model Distance Plot’ option in the ‘Model Plot Inputs’ in the sidebar layout.

Cooman’s plot shows squared orthogonal distance from data points to the first two selected SIMCA (individual PCA) models. The points are color grouped according to their respective class in the case of multiple result objects.

‘Model Distance Plot’ is a generic tool for plotting distance from the first model class to other class models.

Step 18: Understanding the result for the test set

The process of analyzing the results for the test set is the same as that of the train set. The results for the test set can be found under the ‘Test SIMCA’ and the ‘Test SIMCA Plot.’

Step 19: Predict the unknown

If you are happy with the train set and test set results, you can go ahead and navigate to the ‘Predict’ option in the Navigation Panel. Here you need to upload the file with samples from unknown class and the individual models using the process similar to step 16. And the Prediction Plot will populate automatically.

Conclusion 

Principal Component Analysis is a powerful tool for material characterization, root cause identification, and differentiating between the groups in the data. In addition, DataPandit’s magicPCA makes it possible to predict of the unknown class of data with the help of a dataset with known classes of samples.

Need multivariate data analysis software? Apply here for free access to our analytics solutions for research and training purposes!

Linear regression with examples

Linear Regression with Examples

Introduction to linear regression

Whenever you come across a few variables that seem to be dependent on each other, you might want to explore the linear regression relationship between the variables.  linear regression relationship can help you assess:

  • The strength of the relationship between the variables
  •  Possibility of using predictive analytics to measure future outcomes

This article will discuss how linear regression can help you with examples. 

Advantages of linear regression


Establishing the linear relationship can be incredibly advantageous if measuring the response variables is either time-consuming or too expensive. In such a scenario, linear regression can help you make soft savings by reducing the consumption of resources. 

Linear regression can also provide scientific evidence for establishing a relationship between cause and effect.  Therefore the method is helpful in submitting evidence to the regulatory agencies to justify your process controls. In the life-science industry, linear regression can be used as a scientific rationale in the quality by design approach.

Types of linear regression

There are three major types of linear regression as below:

  1. Simple linear regression: Useful when there is one independent variable and one dependent variable
  2. Multiple linear regression:  Useful when  there are multiple independent variables and one dependent variable

Both types of linear regression methods mentioned above need to meet assumptions for Linear regression. You can find these assumptions in our previous article here

This article will see one example of simple linear regression and one example of multiple linear regression.

Simple linear regression

To understand how to model the relationship between one independent variable and one dependent variable, let’s take the simple example of the BMI dataset. We will explore if there is any relationship between the height and weight of the individuals. Therefore, our Null hypothesis is that ‘There is no relationship between weight and height of the individuals’.


Step I


Let’s start by Importing the data. To do this drag and drop your data in the Data Input fields. You can also browse to upload data from your computer.

DataPandit divides your data into train set and test set using default settings where (~59%) of your data gets randomly selected in the train set and the remaining goes into the test set. You have the option to change these settings in the sidebar layout. If your data is small you may want to increase the value higher than 0.59 to include more samples in your train set.

Step II

The next step is to give model inputs. Select the dependent variable as the response variable and the independent variable as the predictors. I wanted to use weight as a dependent variable and height as an independent variable hence I made the selections as shown in the figure below.


Step III

Next refer to articles for Pearson’s correlations matrix, box-plots, and models assumptions plot for pre-modeling data analysis. In this case, Pearson’s correlation matrix for two variables won’t display as it is designed for more than two variables. However, if you still wish to see it you can select the height and weight both as independent variables and it will display. After you are done, just remove weight from the independent variables to proceed further.

Step IV

The ANOVA table displays automatically as soon as you select the variables for the model. Hence, after the selection of variables you may simply check the ANOVA table by going to the ANOVA Table tab as shown below:

The p-value for Height in the above ANOVA table is greater than 0.05 which indicates that there are no significant differences in weights of individuals with different heights. Therefore, we fail to reject the null hypothesis. The R squared value and Adjusted R Squared value are also close to zero indicating that the model may not have a high prediction accuracy. The small F- statistic also supports the decision to reject the null hypothesis.

Step V

If you have evidence of a significant relationship between the two variables, you can proceed with train set predictions and test set predictions. The picture below shows the train set predictions for the present case. You can consider this as a model validation step where you evaluating the accuracy of predictions. You can select confidence intervals or prediction intervals in the sidebar layout to understand the range in which future predictions may lie. If you are happy with the train and test predictions you can save your model using the ‘save file’’ option in the sidebar layout.

Step VI

It is the final step in which you use the model for future prediction. In this step you need to upload the saved file using the Upload Model option in the sidebar layout. Then you need to add the data of predictors for which you want to predict the response. In this case, you need to upload the CSV file with the data for the model to predict weights. While uploading the data to make predictions for unknown weights, please ensure that you don’t have the weights column in your data. 

Select the response name as Weight and the predictions will populate along with upper and lower limits under the ‘Prediction Results’ tab.

Multiple linear regression

The steps for multiple regression are the same as that of Simple linear regression except that you can choose multiple variables as independent variables. Let’s take an example of detection of the age of carpet based on Chemical levels. The data contains the Age of 23 Old Carpet and Wool samples, along with corresponding levels of chemicals such as Cysteic Acid, Cystine, Methionine, and Tyrosine. 

Step I

Same as simple linear regression.

Step II

In this case, we wish to predict the age of the carpet hence, select age as the response variable. Select all other factors as independent variables.

Step III

Same as simple linear regression.

Step IV

The cystine level and tyrosine level do not have a significant p-value hence they can be eliminated from the selected independent variables to improve the model.

The Anova table automatically updates as soon as you make changes in the ‘Model Inputs’. Based on the p-value, F-statistic, multiple R-square and adjusted R square the model shows a good promise for making future predictions.


Step V

Same as simple linear regression.

Step VI

Same as simple linear regression.

Conclusion

Building linear regression models with DataPandit is a breeze. All you need is well-organized data with a strong scientific backing. Because correlation does not imply causation!

Need multivariate data analysis software? Apply here to obtain free access to our analytics solutions for research and training purposes!

Linear Regression Assumptions

Top 7 Linear Regression Assumptions You Must Know

The theory of linear regression is based on certain statistical assumptions. It is crucial to check these regression assumptions before modeling the data using the linear regression approach. In this blog post, we describe the top 7 assumptions and you should check in DataPandit before analyzing your data using linear regression. Let’s take a look at these assumptions one by one.

#1 There is a Linear Model

The constant terms in a linear model are called parameters whereas the independent variable terms are called predictors for the model. A model is called linear when its parameters are linear. However, it is not necessary to have linear predictors to have a linear model. 

To understand the concept, let’s see how a general a linear model can be written? The answer is as follows:

Response = constant + parameter * predictor + … + parameter * predictor
Or

Y = b o + b1X1 + b2X2 + … + bkXk

In the above example, it is possible to obtain various curves by transforming the predictor variables (Xs)  using power transformation, logarithmic transformation, square root transformation, inverse transformation, etc. However, the parameter must remain linear always. For example, the following equation represents a linear model because the parameters ( b o,b1, and b2) are linear and only X1 is raised to the power of 2.

Y = b o + b1X1 + b2X12

In DataPandit the algorithm automatically picks up the linear model when you try to build a linear or a multiple linear regression relationship. Hence you need not check this assumption separately.

#2 There is no multicollinearity

If the predictor variables are correlated among themselves, then the data is said to have a multicollinearity problem.  In other words, if the independent variable columns in your data set are correlated with each other, then there exists multicollinearity within your data. In DataPandit we use Pearson’s Correlation Coefficient to measure the multicollinearity within data. The assumption of no multicollinearity in the data can be easily visualized with the help of the collinearity matrix

Figure 1: High level of multicollinearity in the data

Figure 2: No multicollinearity in the data

#3 Homoscedasticity of Residuals or Equal Variances

The linear regression model assumes that there will be always some random error in every measurement. In other words, no two measurements are going to be exactly equal to each other. The constant parameter (b o ) in the linear regression model represents this random error. However linear regression model does not account for systematic errors which may occur during a process. 

Systematic error is an error with a non-zero mean. In other words, the effect of the systematic error is not reduced when the observations are averaged. For example, loosening of upper and lower punches during the tablet compression process results in lower tablet hardness over a period of time. The absence of such an error can be determined by looking at the Residuals versus fitted values plot in DataPandit. In presence of systematic error, the residuals Vs Fitted values plot will look like Figure 3.

If the Residuals are equally distributed on both sides of the trend line in the residual versus fitted values plot as in Figure 4, then it means there is an absence of systematic error. The idea is that equally distributed residuals or equally distributed variances will average out themselves to zero. Therefore, one can safely assume that measurements only have a random error that can be accounted for by the linear model and there is the absence of systematic error. 

Figure 3: Residuals Vs Fitted Exhibiting Heteroscedasticity

Figure 4: Residuals Vs Fitted Exhibiting Homoscedasticity

#4 Normality of Residuals

It is important to confirm the normality of residuals for reaffirming the absence of systematic errors as stated above.  It is assumed that if the residuals are normally distributed they are unlikely to have an external influence (systematic error) that will cause them to increase or decrease consistently over a period of time. In DataPandit you can check the assumption for normality of residuals by looking at the Normal Q-Q plot

Figure 5 and Figure 6 demonstrate the case when the assumption of normality is not met and the case when the assumption of normality is met respectively.

Figure: 5 Residuals do not follow Normal Distribution

Figure 6: Residuals follow Normal Distribution

#5 Number of observations > number of predictors

For a minimum viable model,

Number of observations= Number of Predictors + 1

However greater the number of observations better the model performance. Therefore, to build a linear regression model you must have more observations than the number of independent variables (predictors)  in the data set.

For example, if you are interested in predicting the density based on mass and volume, then you must have data from at least three observations because in this case, you have two predictors namely, mass and volume. 

#6 Each observation is unique

It is also important to ensure that each observation is independent of the other observation.  Meaning each observation in the data set should be recorded/measured separately on a unique occurrence of the event that caused the observation. 

For example, if you want to include two observations to measure the density of a liquid with 2 Kg mass and 2 l volume, then you must perform the experiment twice to measure the density for the two independent observations. Such observations are called replicates of each other. It would be wrong to use same measurement for both observations, as you will disregard the random error.

#7 Predictors are distributed Normally

This assumption ensures that you have evenly distributed observations for the range of each predictor. For example, if you want to model the density of a liquid as a function of temperature, then it will make sense to measure the density at different temperature levels within your predefined temperature range. However, if you make more measurements at lower temperatures than at higher temperatures then your model may perform poorly in predicting density at high temperatures. To avoid this problem it will make sense to take a look at the boxplot for checking the normality of predictors. Read this article to know how boxplots can be used to evaluate the normality of variables. For example, in Figure 7, all predictors except ‘b-temp’ are normally distributed.

Figure 7: Checking Normality assumption for the predictors

Closing

So, this was all about assumptions for linear regression. I hope that this information will help you to better prepare yourself for your next linear regression model. 

Need multivariate data analysis software? Apply here to obtain free access to our analytics solutions for research and training purposes!

Data Visualization

Data Visualization using Box-Plot

Data visualization is the first step in data analysis. DataPandit allows you to visualize boxplots as soon as you segregate categorical data from the numerical data. However, the box plot does not appear until you uncheck  ‘Is this spectroscopic data?’ option in the sidebar layout as shown in Figure 1. 

Figure 1: Boxplot in DataPandit

The box plot is also known as ‘Box – Whisker Plot’. It provides 5-point information, including the minimum score, first (lower) quartile, median, third (upper) quartile, and maximum score. 

When Should You Avoid Boxplot?

The box plot itself provides 5-point information. Hence, you should never use a box plot to visualize data with less than five observations. In fact, I would recommend using a boxplot only if you have more than ten observations.

Why Visualize?

If you want a DataPandit user, you might just ask, ‘Why should I visualize my data in first place? Wouldn’t it be enough if I just analyze my model by segregating the response variable/categorical variable in the data?’ The answer is, ‘No’ because box plots often help you determine the distribution of your data.

Why is Distribution Important?

If your data is not normally distributed, you most likely might induce bias in your model. Additionally, your data may also have some outliers that you might need to remove before proceeding to advanced data analytics approaches. Also, depending on the data distribution, you might want to apply some data pre-treatments to build better models.

Now the question is how to visualize these abnormalities in the data? Don’t worry, we will help you here. Following are the key aspects that you must evaluate while data visualization.

The spread of the data

You can determine the spread of the data by looking at the lowest and highest measurement for a particular variable.  In statistics,  the spread of the data is also known as the range of the data. For example, in the following box plot, the spread of the variable ‘petal.length’ is from 1 to 6.9 units.

Figure 2: Iris raw data boxplot 

Mean and Median

The mean and median of normally distributed data coincide with each other. For example, we can see that the median petal.length is 4.35 units based on the boxplot. However, if you take a look at the data summary for the raw data, then the mean for petal length is  3.75 units as shown in Figure 3. In other words, the mean and median do not coincide which means the data is not normally distributed.

Figure 3: Data summary for Iris raw data

Left Skewed or Right Skewed?

The values for mean and median also help you find out if your data is skewed toward the right or left? If the mean is greater than the median, the data is skewed towards the right. Whereas if the mean is smaller than the median, the data is skewed towards the left. 

Alternatively, you can also observe the interquartile distances visually to see where most of your data lie. If the quartiles are uniformly divided, you most likely have normal data.

Understanding the skewness can help you know if the model will have a bias on the lower side or higher side. You can include more samples to achieve normal distribution depending on the skewness.

Is it an outlier?

Data visualization can help identify outliers. You can identify outliers by looking at the values far away from the plot. For example, the highlighted value (X1, max=100) in Figure 4 could be an outlier. However, in my opinion, you should never label an observation as an outlier unless you have a strong scientific or practical reason to do so.

Figure 4: Spotting outlier in boxplot

Do I need any data pre-treatments?

If the data spread is too different for different variables, or if you see outliers with no scientific or practical reasons, then you might need some data pre-treatments. For example, you can mean center and scale the data as shown in Figure 5 and Figure 6 before proceeding to the model analysis. You can see these dynamic changes in the boxplot only in the MagicPCA application.

Figure 5: Iris mean-centered data boxplot

Figure 5: Iris mean-centered data boxplot

x

Conclusion

Data visualization is crucial to building robust and unbiased models. Boxplots are one of the easiest and most informative ways of visualizing the data in DataPandit. Data visualization can be a very useful tool to spot outliers. It can also help to finalize the data pre-treatments for building robust models.

Need multivariate data analysis software? Apply here to obtain free access to our analytics solutions for research and training purposes!

Correlation Matrix

How to use the Correlation Matrix?

The correlation matrix in DataPandit shows the relationship of each variable in the dataset with every other variable in the dataset. It is basically, a heatmap of Pearson correlation values between corresponding variables.

For example, in the correlation matrix above, the first element on X-axis is high_blood_pressure while that on the Y-axis is high_blood_pressure too. Therefore, it should show a Perfect correlation with itself with Pearson’s correlation coefficient value of 1. If we refer to the legend at the top right side of the correlation matrix, we can see that Red Color shows the highest value (1) in the heatmap while the blue color shows the lowest value in the heatmap. Theoretically, the lowest possible value for Pearson’s correlation is -1. However, the lowest value in the heatmap may vary from data to data. However, every heatmap will show the highest correlation value of 1 owing to the presence of the diagonal elements.

The diagonal elements of the correlation matrix are the relationship of each variable with itself and hence show a perfect relationship (Pearson’s Correlation Co-efficient of 1).

However, it doesn’t make much sense to see the relationship of any variable with itself. Therefore, while analyzing the correlation matrix treat these diagonal elements as points of reference. 

You can hover over the matrix elements to see the X and Y variable along with the numerical value of Pearson’s correlation coefficient to know the exact coordinates.

There are options to zoom in, zoom out, add toggle spikes, autoscale and save the plot at the top right corner of the plot. Toggle spikes draws perpendicular lines on the X and Y axis and shows the exact coordinates with value of Pearson’s correlation.

In the above correlation matrix, the toggled spike lines show that diabetes and serum_creatinine have a Pearson’s correlation coefficient of -0.05 indicating no relationship between the two variables.

Read our blog post here to know more about Pearson’s correlation. Apply here if you are interested in obtaining free access to our analytics solutions for research and training purposes? 

Pearson's correlation Matrix

What is Pearson’s Correlation Co-efficient?

Introduction

Pearson’s correlation is a statistical measure of the linear relationship between two variables. Mathematically,  it is the ratio of covariances of the two variables And the product of their standard deviations. Therefore the formula for Pearson’s correlation can be written as follows:

Pearson's Correlation Coefficient Formula
Mathematical Expression for Pearson’s Correlation

The result for Pearson’s correlation always varies between -1 and + 1. Pearson’s correlation can only measure linear relationships and it does not apply to higher-order relationships which are Non-linear in nature.

Assumptions for Pearson’s correlation

Following are the assumptions for proceeding to data analysis using Pearson’s correlation:

  1. Independent of the case: Pearson’s correlation should be measured on cases that are independent of each other. For example, it does not make sense to measure Pearson’s correlation for the same variable measured in two different units or with the same variable itself. even if  Pearson’s correlation is measured for a variable that is not independent of the other variable there is a high chance that the correlation will be a perfect correlation of 1. 
  2. Linear relationship: The relationship between two variables can be assessed for its linearity by plotting the values of variables on a scatter diagram and checking if the plot yields a relatively straight line. The picture below demonstrates the difference between the trend lines of linear relationships and nonlinear relationships.
Learn Vs Non-linear relationship
Linear relationship Vs. Non-linear relationship

  1. Homoscedasticity: Two variables show homoscedasticity if the variances of the two variables are equally distributed. It can be evaluated by looking at the scatter plot of Residuals. The scatterplot of the residuals should be roughly rectangular-shaped as shown in the picture below.
Homoscedasticity Vs. Heteroscedasticity
Homoscedasticity Vs. Heteroscedasticity

Properties of Pearson’s Correlation

  • Limit: Coefficient values can range from +1 to -1, where +1 indicates a perfect positive relationship, -1 indicates a perfect negative relationship, and a 0 indicates no relationship exists..
  • Pure number: Pearson’s correlation comes out to be a dimensionless number because of its formula. Hence its value remains unchanged even with changes in the unit of measurement.  For example, if one variable’s unit of measurement is in grams and the second variable is in quintals, even then, Pearson’s correlation coefficient value does not change.
  • Symmetric: The Pearson’s correlation coefficient value remains unchanged for the relationship between X and Y or Y and X, hence it is called a Symmetric measure of a relationship.

Positive correlation

Pearson’s correlation coefficient indicates a positive relationship between two variables if its value ranges from 0 to 1. This means that when the value of one of the variables among the two variables increases, the value of the other variable increases too.

An example of, a positive correlation is a relationship between the height and weight of the same individual. Because naturally the increase in height is associated with the increase in length of bones of the individual, and the larger bones would contribute to the increased weight of the individual. Therefore, if Pearson’s correlation for height and weight data of the same individual is calculated, then it would indicate a positive correlation. 

Negative correlation

Pearson’s correlation coefficient indicates a negative relationship between two variables if its value ranges from 0 to -1. This means that when the value of one of the variables among the two variables increases, the value of the other variable decreases.

An example of a negative correlation between two variables is the relationship between height above sea level and temperature. The temperature decreases as the height above the sea level increase therefore there exists a negative relationship between these two variables.

Degree of correlation:

The strength of the relationship between two variables is measured by the value of the correlation coefficient. The statisticians use the following degrees of correlations to indicate the relationship:

  1. Perfect relationship: If the value is near ± 1, then there is a perfect correlation between the two variables as one variable increases, the other variable tends to also increase (if positive) or decrease (if negative).
  2. High degree relationship: If the correlation coefficient value lies between ± 0.50 and ± 1, then there is a strong correlation between the two variables.
  3. Moderate degree relationship: If the value of the correlation coefficient lies between ± 0.30 and ± 0.49, then there is a medium correlation between the two variables.
  4. Low degree relationship: When the value of the correlation coefficient lies below + .29, then there is a weak relationship between the two variables.
  5. No relationship: There is no relationship between two variables if the value of the correlation is 0.

Pearson’s Correlation in Multivariate Data Analysis

In addition to finding relationships between two variables, Pearson’s correlation is also used to understand the multicollinearity in the data for multivariate data analysis. This is because the suitability of the data analysis method depends on the multicollinearity within the data set. If there is high multicollinearity within the data then Multivariate Data Analysis techniques such as Partial Least Square Regression, Principal Component Analysis, and Principal Component Regression are most suitable for modeling the data. Whereas, if the data doesn’t show a multicollinearity problem, then it can be used for data analysis using multiple linear regression and linear discriminant analysis. That is the reason why you should take a good look at your Pearson correlation Matrix while choosing data analytics models using the DataPandit platform. Read this article to know more about how to use the correlation matrix in DataPandit.

Conclusion

Pearson’s correlation coefficient is an important measure of the strength of the relationship between two variables. Additionally, it can be also used to assess the multicollinearity within the data.

Did you know that Let’s Excel Analytics Solutions provides free access to its analytics SaaS applications for research and training purposes? All you have to do is fill up this form if you are interested.

Finding the Data Analytics Method that Works for You

Last week I met John, a process expert who works at a renowned cosmetic manufacturing company. John was pretty frustrated over a data scientist who could not give him a plot using the data analytics technique of his choice. He was interested in showing grouping patterns in his data using PCA plots.

When I got to know John was dealing with a data problem, I got curious. So I asked him, can I see the data? And he gladly shared the data with me, looking for a better outcome.

But it was in vain. Even I couldn’t create a PCA plot out of John’s data. The reason was that John was trying to make a PCA plot using a dataset that could be easily visualized without a dimensionality reduction method. In other words, it was data that could be easily visualized in a two-dimensional space without using any machine learning algorithm.

But then why was John after the PCA? After we talked for a few more minutes, John said that he saw this method in a research paper and believed it would solve his problem. This explanation helped me to identify the root cause. At the same time, it triggered me to write down this article. I am writing this article for all the Johns who need a helping hand in selecting the most appropriate analytics approach to solve your problem.

Data Analytics Method for 2-Dimensional Data

Try the simplest approach first. If it can be done in Excel, then do it in excel! Taking a lesson from John’s experience, always try to do the simplest step first. Ask yourself, ‘Can I plot this in Excel?’ If the answer is yes, just do it right away. You can either choose to just plot the data for exploratory analysis or build a simple linear regression model for quantitative modeling depending on the use case.

Data Analytics Method for Slightly Dimensional Data

These are simple but tricky cases where the problem you are trying to solve may not need dimensionality reduction, but plotting the data wouldn’t be as simple as plotting an XY chart in Excel. In such cases, you can get help from data analysts who can suggest statistical software like Minitab and JMP to select the appropriate data analytics technique. In case you can’t access them, you can hire your data analyst friend to write a code for you to visualize that data. An example of such a exploratory data analytics method is shown below:

Pharma-Life Science Case Studies
This graphic helps in visualizing the Particle Size Distribution of Material as it is getting processed in a similar manner for three different batches. It was a simple yet slightly tricky data with 4 columns (Median Diameter-Batch 1, Median Diameter-Batch 2, Median Diameter-Batch 3, and TimePoint)

Data Analytics Method for Highly Dimensional Data with Grouping Patterns

Suppose your data is highly dimensional with too many rows and columns that can not be plotted on an XY plot or even with the help of your data analyst friend, then you need a data analytics method for dimensionality reduction. For example methods like PCA or LDA can help you manage such data. However, the grouping pattern in the data can be visualized if you can assign a group to each observation in your data set. These methods don’t only give you an option of visualizing your data but also give you a chance to determine the group of an unknown sample.

PCA plot
It is a PCA plot that shows two groups in the data. The group labeled ‘Yes’ is miscible with the drug and the group labeled ‘No’ is immiscible with the drug. In the future, this model can predict if an unknown material is miscible with the drug or not.

For example, suppose you used data from four mango species by assigning them to four different groups corresponding to their species. In that case, you can train a PCA or LDA model to predict the species of a mango sample whose species is not yet determined.

Similar to the Mango problem, here the LDA model predicts the species of an Iris flower.

However, it should be noted that LDA models do better when the variables are not highly correlated with each other. Whereas the PCA model works better with multilinear data.

The multicollinearity or correlations between variables occurs when one variable increases or decreases with other variables. For example, if the height and weight of individuals are collected in the form of variables that describe an individual, then it is likely that an increase in height will result in an increase in weight. Therefore, we can say that the data has a multicollinearity problem in such a case.

The multicollinearity of variables can be judged on the basis of this heatmap. The higher the positive relationship between variables closer the color to the red, the higher the negative relationship between variables closer the color is to blue. If the color is closer to yellow then there is no collinearity issue.

Data Analytics Method for Highly Dimensional Data with Numerical Response

When highly dimensional data is being represented in the form of a number instead of a group, then quantitative data analytics techniques such as PCR, PLS, and MLR come to your rescue. Out of these, PCR and PLS work best on highly correlated data, whereas MLR works best for non-correlated data that follows normality assumptions. That is the reason PCR and PLS (and even PCA) techniques work well with sensor data from spectroscopes.

Quantitative Anatylics Techniques
PCR, PLS, and MLR methods can predict the quantitative value of the response. The model performance is judged based on the closeness of the predicted value with the reference value in the known samples. If the predicted and reference values are aligned well as shown in the above picture then the model can be used for future predictions of unknown samples.

If you are using DataPandit’s smartMLR application, then you can even build a linear regression model using 2-dimensional data as it can handle small data (widthwise) as well as big data (widthwise).

All these quantitative data analytics methods help you predict future outcomes in numerical format. For example, if you have data of 10 different metals alloyed by mixing in varying proportions and the resultant tensile strength of the alloy. Then, you can build a model to predict the tensile strength of future alloy that can be made by changing the proportion of component alloys.

To Summarize

More data analytics techniques can be mentioned here, but I am mentioning the ones available for the DataPandit users. However, the key takeaway is that whenever you face a data analytics problem, then only start searching for a solution. Don’t be like John, who figured out the solution and then tried to fit his problem into the solution. My two cents would be to let the data analytics method work for you rather than you working for the data analytics method! Don’t stop here, share it with all the Johns who would like to know this!

What is Data Analytics as a Service?

Introduction

Data Analytics is very diverse in the solutions it offers. It covers a range of activities that add value to businesses. It has secured a foothold in every industry that ever existed. Eventually carving a niche for itself known as Data Analytics as a Service (DAaaS)

DAaaS is an operating model platform where a service provider offers data analytics services that add value to a clients’ business. Companies can use DAaaS platforms to analyze patterns within the data using ready to use interface. Alternatively, companies can also outsource the whole data analytics task to the DAaaS providers.  

How does DAaaS Help Organizations?

Have you ever wondered how CEOs make big decisions? A potential game-changer that makes large companies trade high on the NYSENASDAQ, etc. A surprising statistic shows that organizations rely on intuition-based decision-making. High stake business decisions are made solely based on gut feelings and speculative explanations. However, there is an element of uncertainty associated with such decisions as long as that uncertainty is assessed. Data Analytics offers solutions on how data can be used to mitigate the associated risks and enable well-grounded decision-making. 

Organizations collect data constantly on competitors, customers, and other factors that contribute to a business’s competitive advantage. This data helps them in strategic planning and decision-making. But the million-dollar question is whether organizations choose to build data analytics capabilities or outsource to Data Scientists with deep technical expertise. The answer to this question lies in the digital maturity of the organization. Most organizations prefer focusing on the core businesses rather than donning multiple hats at the same time. More and more organizations are turning to outsource their Data Science work to make most of their data. DAaaS furnishes the most relevant information extracted from data to help organizations make the best possible data-driven decisions. 

Why Organizations Should Outsource Data Analytics

For many reasons, organizations, particularly start-ups, are turning to outsourced Data Analytics. Outsourcing has long been undertaken as a cost-cutting measure and is an integral part of advanced economies. Some of the main reasons why companies should opt for outsourcing Data Analytics include: 

  • Organizations can focus on core business.
  • Outsourcing offers flexibility as the service can be availed only when it is required. 
  • Organizations don’t have to maintain a large infrastructure for data management.
  • Organizations can advantage from high-end analytics services.
  • Outsourcing has lower operational costs.
  • It improves risk management.

What Can DAaaS Do for You?

 Data Import

Data import is the first step towards building actionable insights. It helps organizations import data from their systems into the DAaaS platform. Data is an asset for organizations as it influences their strategic decision-making. Managing data is vitally important to ensure data is accurate, readily available, and usable by the organization. 

Translate Data into Actionable Insights

Data is useful only when it is acted upon to derive useful insights that add value. Connecting and joining dots between data is important to put the facts and figures together. Data is nothing if the dots between them can’t be connected. The outcome of connecting and joining helps us answer one of the following bottom-line questions. 

  1. What happened? Descriptive Analysis
  2. Why happened? Diagnostic Analysis
  3. What is likely to happen? Predictive Analysis
  4. What should be done? Prescriptive Analysis

Testing of ‘Trained Models’

Testing the accuracy of a model is the primary step in the implementation of the model. To test the accuracy of the model, data is divided into three subsets: Training Data, Testing Data, and Validation Data. A model is built on the training dataset that comprises a larger proportion of the data. Training data is subsequently run against test data to evaluate how the model will predict future outcomes. Validation data is used to check the accuracy and efficiency of the model. The validation dataset is usually the one not used in the development of the model. 

Prediction and forecasting using ‘Trained Models’

Future events can be predicted using analytical models that has come to be known as predictive analytics. The analytical models are fit (also known as trained) using historical data. But such models constantly add data and eventually improve the accuracy of their prediction. Predictive analytics has been using advanced techniques like Machine Learning and Artificial Intelligence to improve the reliability and automation of the prediction. 

Deploy Proven Analytical ‘Models’

Training a model is not quite as difficult as deploying a model. Deploying a model is the process of utilizing the trained model for the purpose it was developed for. It involves how the end-user interacts with the prediction of the model. The end-user can interact with the model using web services, a mobile application, and software. This is the phase that reaps the benefits of predictive modeling adding value to the business needs. 

Conclusion

Data Analytics as a Service (DAaaS) companies enables access to high-tech resources without actually owning them. Organizations can reach out to DAaaS providers for their services only when it is required, eventually cutting huge costs on maintaining Data Analytics infrastructure and rare to find Data Scientists. This has enabled us to usher into a new world of the Gig Economy. 

Let’s Excel Analytics Solutions LLP is a DAaaS company that offers a solution to all your Data Analytics problems. 

Curious to know more?

Internet Of Things-Few Insightful Facts

Introduction

The internet has revolutionized our modern society. It has simplified everything that we do. It has brought us all the good things of the world at our fingertips. There has been a wave of internet transformation lately. The traditional internet has evolved into the Internet of Things(IoT) by convergence into diversified technologies. This evolution has broadened its applications beyond general consumer usage and has driven dramatic changes at the industrial platforms. This blog tries to explain the basic idea behind IoT and its applicability in diverse fields.

What is the Internet of Things?

IoT is defined as the network of objects (IoT devices) embedded with computing devices that enable them to exchange data over the internet. These objects range from general consumer items to industrial applications. The IoT for industrial applications is also known as the Industrial Internet of Things (IIoT).  

How does the IoT work?

An IoT device is comprised of three main components: sensor, microprocessor, and communication hardware. The sensor constantly collects data from the environment. The microprocessor analyzes the collected data using machine learning algorithms. And the communication hardware is used to communicate with other IoT devices. Most of the IoT devices are controlled remotely through an app or software.

Applications of IoT

  • Home improvement devices

IoT has realized the concept of smart homes. Most of the home appliances can be programmed remotely using IoT features. This has enhanced the quality of human life significantly. It includes air conditioning and lighting systems, alarm and home security systems, refrigerators, robotic vacuum cleaners, and TVs, etc., all of which can be remotely controlled by an app installed on a smartphone.

  • IoT in industrial manufacturing

The implementation of IoT has ushered the manufacturing industry into a new era of smart factories. It has numerous applications in manufacturing right from supply chain management through core manufacturing operations to distribution of the finished product. IoT-enabled manufacturing employs advanced sensors that collect data across all the critical operations of the production flow. This data is fed into cloud computing to get valuable insights that eliminate waste and unnecessary reworks and encourage continuous process improvements. It also alerts operators of any potential breakdowns and performs preventive maintenances to avert downtimes.

  • IoT in healthcare

Many wearable devices are available that monitor vital signs like blood pressure, heart rate, calorie checks, etc. These devices are used by Athletes to track the intensity of the workout sessions. These bands can also track the sleep patterns of individuals. Some of these devices have automatic fall detection systems that can predict the likelihood of fainting particularly in the case of elderly people. In case of a potential fall situation, these devices can send SOS signals to family members or ambulatory services.

The physicians have also been using IoT smart devices to track the health status of patients. The device can alert physicians of any need for immediate medical attention. In addition,  physicians can also track patient’s adherence to treatment regimes and monitor the prognosis of the treatment.

  • Smart cities

Smart-Cities employ advanced technologies to build highly efficient and sustainable infrastructure. For example, Smart lightings can drastically reduce energy consumption by switching ON and OFF when someone walks past them. Air quality tools continuously monitor air pollution data in real-time data and forecast emissions. Sensors installed on streets can give real-time updates on traffic management. 

  • IoT in the automotive industry

Nowadays, Autonomous cars are installed with IoT sensors to eliminate human errors and judgments during driving. This can avoid car accidents and makes driving safe and comfortable.

Advantages of IoT

  1. IoT automates processes and improves the quality of life.
  2. Enables the access of information from anywhere at any time in the world.
  3. It enables communication between devices without any human intervention.
  4. It Saves capital, resources, and time.
  5. Enhances efficiency and productivity.

Disadvantages of IoT

  • As IoT devices are connected over a network, it predisposes them to security attacks.
  • IoT devices continually share a substantial amount of data; it risks personal information of the users
  • IoT systems are very complex and are vulnerable to failures.

Future of IoT

According to IoT analytics, there were over 4.7 billion devices connected to the IoT in 2016. These figures are expected to grow up to 11.6 billion by the end of 2021. It is estimated that these numbers are anticipated to increase up to 21 billion by 2025. The total market value of IoT was worth $389 billion in 2020 and it is forecasted to rise to $1 trillion in 2030.

Conclusion

Internet of Things has transformed and simplified everything we do right from our household activities to commercial manufacturing operations. It has automated processes without human interventions. Owing to the vast applicability of IoT almost all the devices that we use are turning smart today.

Curious to know more?