Demystifying Data Science Terms

Data Science: Demystifying the Terminologies


Data Science related terminologies are buzzing around the internet. It marks the onset of the Industry 4.0 revolution. Data Science is a discipline that studies big data, uses modern tools and techniques for data mining and data analysis to find its applications across a wide variety of domains. For example, Google AI retinal scan collected retinal images from thousands of patients across South India. Finally, it analyzed the data to get information about the patients’ disposition to cardiovascular diseases in the next five years.

The statisticians, chemometricians and mathematicians have been breathing and living the data science concepts for years and may not be calling these terms with exactly the same buzz words. Perhaps the spread of the new terminologies is an outcome of massive online Data Science courses or the rebranding strategies of various companies that are trying to bank on the ‘Data Science’ capabilities. Whatever be the reason we need to prepare ourselves for the Industry 4.0 revolution, we should get familiar with these new terms. In this article, we broadly segregated the meaning of these terminologies based on interaction with various clients.

Big Data

Big data, put simply, refers to a collection of data from a wide variety of sources at a colossal scale. The data collected may be quantitative or qualitative, unknown or known, structured or unstructured and so on. As the scale of data collected is enormous, it is stored in specialized databases, known as big databases, that are developed using advanced computer programs such as SQL, MySQL etc. The collections of big data are also referred to as data warehouses. Many big databases are open source, e.g., Cassandra, HBase, MongoDB, Neo4j, CouchDB, OrientDB, Terrstore, etc. However, most of the popular databases are big-budgeted as well, e.g., Oracle, MySQL, Microsoft SQL, SAP HANA, etc. It is essential to state that the database choice is the fundamental and most critical step in the Data Science workflow. The storage requirements of the Big Data can range anywhere between MBs to TBs. Sometimes the data volume may be small, but the data complexity can be high. That is where data engineers pitch in to make things easy.

Data Engineering

The process of building a workflow to store the data in Big Data Warehouse and then extracting the relevant information is called Data Engineering.

Data Mining

Data mining is the process of extracting patterns from large datasets by combining methods from statistics and machine learning with database management. These techniques include association rule learning, cluster analysis, classification, and regression. Applications include mining customer data to determine segments most likely to respond to an offer, mining human resources data to identify characteristics of most successful employees or market basket analysis to model customers’ purchase behaviour.

Data Analysis

Data analysis is the exercise of analyzing, visualizing, and interpreting data to get relevant information that helps organizations make informed business decisions. It also involves data cleaning, outlier analysis, data preprocessing, and transformation to make data amenable to analysis. Data analysis is a very broad term that encompasses at least five different types of analyses. A data scientist chooses the most appropriate data analysis method based on the end goal of the analysis. Sometimes the same method of analysis can be used for the different end goal. Hence, another name may be used to call the technique despite involving the same mathematical and statistical concept. Therefore data analysis takes up various forms described briefly as below:

  • Descriptive statistical analysis is the fundamental step for performing any data analysis. It is also known as summary statistics and gives an idea of the basic structural features of the data like measures of central tendency, dispersion, skewness, etc.
  • Inferential statistical analysis is a type of statistical analysis that uses the information contained in a sampled data to make inferences about the corresponding larger population. It uses hypothesis testing of the data to draw statistically valid conclusions about the population. As the sampling process is always associated with an element of error, statistical analysis tools should also account for the sampling error so that a valid inference is drawn from the data.
  • Chemometrics is the science of extracting and analyzing Physico-chemical information by using spectroscopic sensors and other material characterization instruments. Chemometrics is interdisciplinary, using methods frequently employed in core data-analytic disciplines such as multivariate statisticsapplied mathematics, and computer science to address problems in chemistrybiochemistrymedicinebiology, food, agriculture and chemical engineering. Chemometrics generally utilizes information from spectrochemical measurements such as FTIR, NIR, Raman and other material characterization techniques to control product quality attributes. It is being used for building Process Analytical Technology tools.
  • Predictive analysis models patterns in the big data to predict the likelihood of the future outcome. The models built are less likely to have 100% accuracy and are always associated with an intrinsic prediction variance. However, the data’s accuracy is refined each time more and more data is taken into account. Predictive analysis can be performed using linear regression, multiple linear regression, principal component analysis, principal component regression, partial least square regression, and linear discriminant analysis.
  • Diagnostic analysis, as the name suggests, is used to investigate what caused something to happen. The diagnostic analysis uses the historical data to look for the answers that caused the same something in the past. It is more of an investigative type of data analysis. This involves four main steps: data discovery, drill down, data mining and correlations. Data discovery is the process of identifying similar sources of data that underwent the same sequence of events in the past. Data drill down is about focusing on a particular attribute of the data that interests us. This is followed by data mining activity that ends with looking for strong correlations in the data to lead us to the event’s cause. Diagnostic analysis can be performed using all techniques mentioned for predictive analysis. However, the end goal of the diagnostic analysis is only to identify the root cause to improve the process or product.
  • Prescriptive analysis is the sum of all the data analysis techniques discussed above, but this form of analysis is more oriented towards making and influencing business decisions. Specifically, prescriptive analytics factors information about possible situations or scenarios, available resources, past performance, current performance and suggests a course of action or strategy. It can be used to make decisions on any time horizon, from immediate to long term.

Data Analytics

Data analytics is the sum of all the mentioned activities, right from big data, data engineering, data mining to the analysis of the data.

Machine Learning

Machine Learning creates new programs that can predict future events with little supervision by humans. Machine learning analytics is an advanced and automated form of data analytics. A Machine Learning algorithm is called Artificial Intelligence when the prediction accuracy is improved each time new data is added to it.

Machine learning algorithms that uses layered networks capable of unsupervised learning from the data are called Deep Learning algorithms. Examples of Deep Learning algorithms include Deep neural networks or Artificial Neural Networks inspired by the brain’s structure and function. These type of algorithms are designed to be analogous to human intelligence. The major difference between machine learning and deep learning is human supervision, i.e., deep learning algorithms are a completely unsupervised form of learning in contrast to machine learning algorithms.


The field of Data Science is more oriented towards the application side of the modern AI/ML tools that employ advanced algorithms to build predictive models that can transform the future of what we do and how we do it.

Curious to know about our machine learning solutions ?

Predictive Analytics in Cancer Diagnosis

Predictive Analytics in Cancer Diagnosis


GLOBOCON 2020, one of the key cancer surveillance projects of the International Agency for Research on Cancer (IARC), published recent statistics of global cancer epidemiology. According to this report, 19,292,789 new cancer cases were reported in 2020 i.e., a two-fold increase in the number of cases as reported in 2018. For over 19 million cases of cancer reported, 9,958,133 cancer-related deaths were reported in the same year.  As per the estimates of the International Agency for Research on Cancer (IARC),  every 1 person in 5 persons is likely to develop cancer during their lifetime. In this article we are going to discuss how Predictive Analytics can play a major role in changing Cancer Statistics.

Cancer Statistics: 2020


Males    Females




Number of new cancer cases



Number of cancer deaths

5 year prevalent cases24,828,480


Top 5 most cancers excluding non-melanoma skin cancer Lung, Prostrate, Colorectum, Stomach, Liver

Breast, Lung, Colorectum, Prostrate, Stomach

Data taken from GLOBOCON 2020

Estimated Number of Cases Worldwide

These alarming and constantly rising figures have refocused the attention of medical scientists on the early screening and diagnosis of cancers using Predictive Analytics. Because cancer mortality and morbidity can be reduced by early detection and treatment of cancer.

American Cancer Society (ACS) issues updated guidelines and guidances related to an early screening of cancers to assist in making well informed decisions about the tests for early detection of some of the most prevalent cancers (breast cancer, colon and rectal cancer, cervical cancer, endometrial cancer, lung cancer, and prostate cancer). The early detection of precancerous lesions and cancers is broadly divided into three categories:

Early cancer diagnosis

Cancers respond very well to the treatment only if diagnosed early that, in turn, increases the chances of cancer survival. As per WHO guidance, early diagnosis is a three-step process that must be integrated and provided in a timely manner.

  1. Awareness of cancers and accessing care as early as possible
  2. Clinical evaluation of cancers, appropriate diagnosis and staging of cancers
  3. Access to the right treatment at the right stage.

Screening of cancers

Screening identifies specific markers of cancers that are suggestive of particular cancer. For example, visual inspection with Acetic Acid (VIA) test can be used for early screening of cervical cancers in women. Cervical lesions turn white for a few minutes after application of acetic acid.

However, the early diagnosis and screenings of cancer suffer from drawbacks like false positives, false negatives, and overdiagnosis which may lead to more invasive tests and procedures. To overcome this problem, scientists are using the power of Predictive Analytics based on Artificial Intelligence and Machine Learning.

Introduction to Artificial Intelligence (AI)

Artificial Intelligence (AI) is a great tool for Predictive Analytics and it is defined, in Webster’s dictionary, as a branch of computer science dealing with the simulation of intelligent behaviour in computers. In other words, it is the capability of machine to imitate intelligent human behaviour.

One of the early pioneers of Artificial Intelligence, Alan Turings, published an article in 1950 entitled “Computing Machinery and Intelligence.” It introduced the so-called, Turing test, to determine if a computer can exhibit the same level of intelligence as demonstrated by humans. The term “Artificial Intelligence” was coined by John McCarthy at the Artificial Intelligence (AI) conference at Dartmouth College in 1956. It was Allen Newell, J.C. Shaw, and Herbert Simon who introduced the first AI-based software program namely, The logic Theorist.

Majority of Artificial Intelligence (AI) applications use Machine Learning (ML) algorithms to find patterns in the datasets. These patterns are used to predict the future outcomes.

The basic framework of Artificial Intelligence (AI) consists of three main steps:

  1. Collecting input data
  2. Deciphering the relationship between input data
  3. Identifying unique features of sample data

Introduction to Machine Learning (ML)

Machine learning (ML) is also another tool for Predictive Analytics and is defined in Webster’s dictionary as the process by which a computer is able to improve its own performance by continuously incorporating new data into an existing statistical model. It allows the system to reprogram itself as more data is added and eventually increasing the accuracy of the task assigned.

In the case of Machine learning (ML), it’s an iterative process so that the predictability of the system is improved each time. Most Machine Learning (ML) algorithms are mathematical equations in which sample data is plotted to observed variables, termed as features, and the outcomes termed as labels. The labels and features are used for the classification of different ML tools and techniques. Based on the label type, Machine Learning (ML) algorithms can be categorised into:

Supervised Learning

Unsupervised Learning

Reinforcement Learning

In supervised learning, models are trained based on labelled datasets. For the purpose of prediction, the model needs to map the input variables with the output variables using a know mathematical function. Supervised learning can be used for understanding, Classification and Regression problems.

In unsupervised learning, data patterns are found in the un-labelled data and the endpoint of the unsupervised learning is to find characteristics patterns in the data. Unsupervised learning is used for identifying Clustering and Association in datasets.

Reinforcement learning is the ‘learning’ by interacting with the environment. A reinforcement learning algorithm makes decisions based on its past experiences and also by making new explorations.

The PCA part of MagicPCA 1.0.0. is an unsupervised Machine Learning Approach, whereas the SIMCA part of it is a Supervised Classification Technique.

Rising Interest in Biomedical Research

In the initial years, the journey of AI was not so easy, as can be seen in the period of 1974-1980, which is known as AI winter. During this period the field experienced its low in terms of researcher’s interests and government funding. Today, after decades of advances in data management and superfast computers, and renewed interests of government and corporate bodies, it is a practical reality and finds its applications in a wide variety of fields like e-commerce, medical sciences, cybersecurity, agriculture, space science, automobile industry, etc.

As the phrase “Data Science is everywhere” picked up, biomedical researchers started delving into Artificial Intelligence (AI) and Machine Learning (ML) to look for a better solution through Predictive Analytics. One inspiring story of Regina Barzilay, a renowned professor of Artificial Intelligence (AI) and a breast cancer survivor, portrays how her diagnosis of breast cancer reshaped her research interests. She hypothesized that AI and ML tools can extract more clinical information that helps clinicians make knowledgeable decisions. She collected data from medical reports and developed Machine Learning algorithms to interpret the radio diagnostic images for clinicians. One of the models developed by her has also been implemented in clinical practice that helps radiologists to read diagnostic images very well.

Current scenario: Predictive Analytics in Cancer Diagnosis

The concept of AI/ML has long been employed as Predictive Analytics tool in the radiodiagnosis of precancerous lesions and tumours.

The AI system reads the images generated by various radiological techniques like MRI, PET scan, etc., and processes the information contained in them to assist clinicians make conscious decisions on the diagnosis and progression of the cancers.

Breast Cancer Diagnosis with QuantX

The FDA’s Center for Devices and Radiological Health (CDRH) has approved the first AI-based breast cancer diagnosis system for Predictive Analytics. QuantX was developed by Qlarity Imaging (Paragon Biosciences LLC). QuantX is described as a computer-aided (CAD) diagnosis software system that assists radiologists in the assessment and characterization of breast anomalies using Magnetic Resonance Imaging (MRI) data. The software automatically registers images and segmentations (T1, T2, FLAIR, etc.), and analyses user-directed regions of interest (ROI). QuantX extracts this data from the ROI to provide computer-aided analytics based on morphological and contrast enhancement characteristics. These imaging analytics are then used by an artificial intelligence algorithm to get a single value, known as QI score, which is analysed relative to the reference database. The QI score is based on the machine learning algorithm that is generated from a training subset of features calculated on segmented lesions.

Cervical Cancer Diagnosis with CAD

National Cancer Institute (NCI) has also developed a computer aided program (CAD) that analyses digital images taken of women’s cervix and identify potentially precancerous changes that require immediate medical attention. This Artificial Intelligence-based approach is called Automated Visual Evaluation (AVE). A large set of data, around 60000 cervical images, was generated using the precancerous and cancerous lesions to develop a machine learning algorithm. This algorithm recognizes patterns in visual images that lead to precancerous lesions in cervical cancers. The algorithm-based visualization of images has been reported to provide better insight into precancerous lesions, with a reported accuracy of 0.9, than routine screening tests.

Lung Cancer Diagnosis with Deep Learning Technique

NCI funded researchers of New York University used Deep Learning (DL) algorithms to identify gene mutations from pathophysiological images of lung tumors using Predictive Analytics. The pathophysiological images of lung tumours were collected from the Cancer Genome Atlas and used to build an algorithm that can predict specific gene mutations by visual inspections of the pathophysiological images. This method can very accurately predict the different types of lung cancers and the corresponding gene mutations from the analyses of the images.

Thyroid Cancer with Deep Convoluted Neural Network

Deep Convoluted Neural Network (DCNN) models were used to develop an accurate diagnostic tool for thyroid cancers by analysing images from ultrasonography. 1,31,731 ultrasound images from 17,627 patients with thyroid cancer and 1,80,668 images from 25,325 controls were collected from the thyroid imaging database of Tianjin Cancer Hospital. Those ultrasound images were modelled into a DCNN algorithm. The DCNN model showed similar sensitivity and improved specificity in identifying patients with thyroid cancer compared with a group of skilled radiologists.

AI/ML for Personalized Medicines

Researchers at Aalto University, University of Helsinki and the University of Turku developed a machine learning algorithm that can accurately predict how combinations of different antineoplastic drugs can kill various types of cancerous cells. This algorithm was obtained from data collected from a study that investigated the association between different drugs and their effectiveness in treating cancers. The model developed was found to show associations between different combination of drugs and cancer cells with high accuracy; the correlation coefficient of the model fitted was reported to be 0.9. This AI model can help cancer researchers to prioritize which combination of drugs to choose from a plethora of options for further research investigation. This depicts how AI and ML can be used for the development of personalized medicines.

Future challenges of AI/ML in cancer diagnosis

Data Science is shaping the future of the health care industry like never before. There has been a spurt of growing interests in AI and ML for the diagnosis of precancerous lesions and surveillance of cancerous lesions. The researchers are exploring to develop AI algorithms that help in the diagnosis of many other cancers. However, each type of cancer behaves differently and the consequent changes would be a significant challenge for the algorithms. Machine learning tools can overcome these challenges by training algorithm of these subtle changes. This would drastically improve decision making for clinicians.

One of the biggest challenges of the Artificial Intelligence today is the acceptance of the technology in the real world, particularly related to medical diagnoses of terminally ill patients where decision making plays a critical role in the longevity of the patient. The AI black box problem augments this problem further. AI black box refers to the fact that programmers can see input and output data only but how does an algorithm work is not known.

Regulatory aspects of AI/ML in cancer diagnosis

In 2019, US FDA publishing a discussion paper entitled “Proposed Regulatory Framework for Modifications to Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device (SaMD) – Discussion Paper and Request for Feedback.” The intention of FDA was to develop a regulatory framework for the medical software by issuing draft guidance on the Predetermined Change Control Plan outlined in the discussion paper. The Predetermined Change Control Plan mapped out a regulatory premarket review for AI/ML-based SaMD modifications.

In 2021, FDA published a draft guidance document entitled Artificial Intelligence and Machine Learning (AI/ML) Software as a Medical Device Action Plan.” The FDA encouraged the development of the harmonized Good Machine Learning Practices of AI/ML-based SaMD through the participation of industrial and other stakeholders in consensus standards development efforts. This guidance was built upon the October 2020 Patient Engagement Advisory Committee (PEAC) meeting focused on patient trust in AI/ML technologies.

The FDA supports regulatory science efforts on the development of methodology for the evaluation and improvement of machine learning algorithms, including for the identification and elimination of bias, and on the robustness and resilience of these algorithms to withstand changing clinical inputs and conditions.


The employment of Predictive Analytics in cancer diagnosis has answered major challenges experienced in cancer diagnosis and treatment. It can help early screening of precancerous lesions and avert the mortality rate in cancer patients. AI/ ML provides accurate detection and prognosis of cancers, thereby reducing the incidents of false positives, false negatives and overdiagnosis. These techniques can also be used to track the prognosis of the cancers in the case of immunotherapies and radiotherapies. The AI/ ML has also potential applications in the development of personalized medicines by developing specific therapies for each specific cancers.

By detecting cancers early and accurately, prognosis of cancer treatment would be greatly improved. The early detections of cancers will have a huge impact on the cost-saving of the complicated cancer treatments. This could also have a huge impact on the cancer survival rates as the mortality rates could be drastically decreased on early detection of the cancers.

If you are struggling to make use of cancer data and need help to develop Machine Learning Models, then feel free to reach out to us. At Let’s Excel Analytics Solutions, we help our clients by developing cloud-based software solutions for predictive analytics using Machine Learning.