Data-sets for Machine Learning Practice

Looking for Data-sets to build machine learning models? Here are top 6 data-sets for classification & Prediction models to end your search.

Iris Data-set

The Iris Dataset contains four features (length and width of sepals and petals) of 50 samples of three species of Iris (Iris setosa, Iris virginica, and Iris versicolor). The dataset is often used in data mining, classification and clustering examples and to test algorithms. This is perhaps the best-known database to be found in the pattern recognition literature. Fisher’s paper is a classic in the field and is referenced frequently to this day. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other


Perfume Fake-Original Data-set

The data is sourced from C.L. Gomes, A.C.A. de Lima, A.R. Loiola, A.B.R. da Silva, M.L. Candido, R.F. Nascimento (2016). “Multivariate Classification of Original and Fake Perfumes by Ion Analysis and Ethanol Content,” Journal of Forensic Sciences, Vol. 61, #4, pp. 1074-1079.

Analysis of chemicals and ethanol contents for 25 fake and 25 original perfume samples. All measurements were done in duplicate. Fake – Samples 1-10 from unauthorized dealers, 11-25 from police forensics dept Original – Samples 1-10 from Authorized Dealer A, 11-25 from B Analyses included PCA, comparison of means within types, discriminant analysis.

Variable Names

  • perfID (Includes F and/or O and ID number) Original (1 if Original, 0 if Fake)
  • ID_Type (ID within Type)
  • Src_Type (Source within Type)
  • ClMn (Chloride Mean Score from 2 measurements)
  • ClSD (Chloride SD from 2 measurements)
  • NaMn (Sodium Mean Score from 2 measurements)
  • NaSD (Sodium SD from 2 measurements)
  • KMn (Potassium Mean Score from 2 measurements)
  • KSD (Potassium SD from 2 measurements)
  • EthnolMn (Ethanol Mean Score from 2 measurements)
  • EthnolSD (Ethanol SD from 2 measurements)


Seeds Data-set

Data Set Information:

The examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each, randomly selected for
the experiment. High-quality visualization of the internal kernel structure was detected using a soft X-ray technique. It is non-destructive and considerably cheaper than other more sophisticated imaging techniques like scanning microscopy or laser technology. The images were recorded on 13×18 cm X-ray KODAK plates. Studies were conducted using combine harvested wheat grain originating from experimental fields, explored at the Institute of Agrophysics of the Polish Academy of Sciences in Lublin.

The data set can be used for the tasks of classification and cluster analysis.

Attribute Information:

To construct the data, seven geometric parameters of wheat kernels were measured:
1. area A,
2. perimeter P,
3. compactness C = 4*pi*A/P^2,
4. length of kernel,
5. width of kernel,
6. asymmetry coefficient
7. length of kernel groove.
All of these parameters were real-valued continuous.

Relevant Papers:

M. Charytanowicz, J. Niewczas, P. Kulczycki, P.A. Kowalski, S. Lukasik, S. Zak, ‘A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images’, in: Information Technologies in Biomedicine, Ewa Pietka, Jacek Kawa (eds.), Springer-Verlag, Berlin-Heidelberg, 2010, pp. 15-24.


Spanish Grape Wine Data-set

Source

E. Revilla, M.M. Losada, E. Gutierrez (2016). “Phenolic Composition and Color of Single Cultivar Young Red Wines Made with Mencia and Alicante-Bouschet Grapes in AOC Valdeorras (Galicia, NW Spain),” Beverages, Vol. 2, #3, doi:10.3390/beverages2030018

Description

Measurements were made on 20 young red wines with Mencia grapes and 10 red wines made with Alicante-Bouschet grapes. Measurements are mean of 2 replicates. Analyses were conducted separately among variable groups.

Variable Groups

  • Names grapeType (M=Mencia, T=Alicante-Bouschet)
  • grapeID *** Group 1 Variables
    • TPI (Total Phenols Index)
    • CI_Sud (Color Intensity (420+520))
    • CI_Glo ( ” ” (420+520+620))
    • Hue ChemAge (Chemical Age Index)
    • anthcyan (Total Anthocyans)
    • t_anthcyanin (Total Anthocyanins)
    • c_anthcyanin (Coolored ” )
    • tannin (Total Tannins)
  • *** Group 2 variables – Anthocyanins
    • DpGl
    • CyGl
    • PtGl
    • PnGl
    • MvGl
    • DpGlAc
    • PtGlAc
    • PnGlAc
    • MvGlAc
    • DpGlCm
    • PtGlCm
    • PnGlCm
    • MvGlCm
    • MvGlCf


Milk Data-set

Source

A. Bogomolov, A. Melenteva (2013). “Scatter-based quantitative spectroscopic analysis of milk fat and total
protein in the region 400–1100 nm in the presence of fat globule size variability”, Chemometrics and Intelligent Laboratory Systems, vol 126:, #129–139, DOI:10.1016/j.chemolab.2013.02.006

Description

The data contains visible and short-wave near-infrared spectra (400–1100 nm) of a designed sample set with systematically varied nutrient composition and homogenization degree. The data can be used for the prediction of Fat and Protein percentage in milk.


Gasoline

Source

Kalivas, John H. (1997) Two Data Sets of Near Infrared Spectra Chemometrics and Intelligent Laboratory Systems37, 255–259.

Description

A data set with NIR spectra and octane numbers of 60 gasoline samples. The NIR spectra were measured using diffuse reflectance as log(1/R) from 900 nm to 1700 nm in 2 nm intervals, giving 401 wavelengths.