In my previous post, I provided an explanation on the differences between a traditional algorithm and a machine learning algorithm. If you haven't had the chance to read it yet, I highly recommend checking it out to gain a better understanding. You can access the post through this link at your convenience. Happy reading!

In this post, I am sharing my understanding on datasets in the context of machine learning. Dataset is the entry point of any machine learning model and plays a critical role in its success for the purpose. Please join me on this exploration and let's demystify few jargons around data and dataset together in a simple way.

At the end of this tutorial, you will have acquired the knowledge of:

  • What is a data set in machine learning context?
  • What are the different data types from statistics widely used in machine learning?
  • Basic characteristics of a dataset
  • Real world sources for dataset
  • How data is collected from different sources
  • How data is getting stored for future processing
  • Different data storage format for a dataset
  • Example of dataset 
What is a dataset?

Before getting into the formal definition of a dataset, let's grasp the concept through a simple example. Imagine you have a box filled with various toys, each differing in shape, size, and colour. These toys collectively represent our dataset. Now, our goal is to organise those toys in a manner that allows us to effortlessly pick any specific toy from the box.

Using this dataset, our objective is to train a smart helper (which can be relate to a machine learning model), to effectively sort toys based on their shape, size, and colour. The smart helper needs to observe the similarities and differences among the toys accurately. We will guide the smart helper by presenting a few toys and indicating which group they belong to. For example, we might pick three blue cubes and say, "These are all blue cubes". We will also correct the outcome if the smart helper is classifying something into a wrong bucket. Over a period of time, our smart helper will observes these examples and learns from them. Once the learning process is complete, the helper gains the ability to autonomously sort new toys. We can present it with toys it has never encountered before, and it will confidently categorise them into their respective groups. Hope with this simple example you are getting the essence of how a machine learning dataset operates.

Let's now establish a formal definition of a dataset. A dataset is a structured collection of data specially designed for machine learning and statistical analysis. It contains multiple data points or samples, with each sample representing a unique observation or instance. These data points are associated with specific features or attributes that provide important information about each observation.

The following aspects of dataset plays a significant role in determining the quality and performance of the machine learning model outcome. 

  • Data Quality: High quality data leads to more accurate models.
    • Data from reliable and trustworthy sources.
    • Accurate and error-free data.
    • Complete and no missing values.
    • Consistent data format and conventions.
  • Data Diversity: Diverse data covering various aspects and scenarios relevant to the problem domain improves model generalisation.
    • Dataset with images of animals and birds, each class having different breeds and from different regions including images taken indoors, outdoors, in different seasons, and at different times of the day.
  • Data Balance: Balanced data prevents bias towards dominant classes.
    • Dataset of credit card transactions with only 100 fraudulent transactions (class 1) and 9,900 non-fraudulent transactions (class 0) is very dominant towards the class 0.
  • Data Relevance: Relevant features improve model performance.
    • Dataset for predicting customer churn in a telecom company without including irrelevant data like customer preferences for food or entertainment.
  • Outliers: Data points that deviate significantly from the general trend.
    • Identifying and handling data points in a housing price dataset that are unusually high or low compared to similar houses.
  • Missing Data: Handling missing data ensures complete analysis.
    • A customer feedback dataset missing age and feedback score.
  • Noise in Data: Minimising data noise improves model accuracy.
    • In a dataset for recognising handwritten digits, removing images with low quality or distortion to improve model performance.

Different types of data used in machine learning

Data can be classified into various types, depending on their nature and level of measurement. These different data types are essential for various statistical analyses and machine learning algorithms. Proper handling and understanding of data types are critical for choosing appropriate data preprocessing techniques and selecting suitable machine learning models for a given task.
  • Numerical (Continuous): Represents continuous numerical data with infinite values.
    • Example: Age, Temperature
  • Numerical (Discrete): Represents numerical data with finite or countable values.
    • Example: Number of students in class
  • Categorical (Nominal): Represents data with distinct categories without order.
    • Example: Colours (Red/Blue/Green)
  • Categorical (Ordinal): Represents data with distinct categories with meaningful order.
    • Example: T-shirt size (S, M, L, XL)
  • Text (Textual Data): Represents unstructured text data, such as sentences.
    • Example: Tweets, Email content
  • Image (Image Data): Represents image data using pixel values.
    • Example: Photographs, Satellite Images, Digital Art
  • Audio (Audio Data): Represents audio data, such as sound waves.
    • Example: Speech, Music, Sound Effects
  • Time Series: Represents data collected over time with temporal order.
    • Example: Stock Prices, Temperature over a day, Website Traffic
  • Boolean: Represents data with two possible values (True or False).
    • Example: True/False
  • Geographic Coordinates: Represents location data using latitude and longitude values.
    • Example: GPS Coordinates of cities, Geotagged images
Basic characteristics of a dataset:
The basic characteristics of a dataset provide essential information about its structure and contents. These characteristics help data scientists and analysts understand the dataset's properties and determine appropriate analysis and modelling approaches. Here are the key characteristics of a dataset:

  • Number of Instances (Rows): Refers to the total number of data points or observations in the dataset. Each row typically represents a single instance or sample.
  • Number of Features (Columns): Indicates the number of variables or attributes present in the dataset. Each column represents a specific feature of the data.
  • Data Types: Describes the types of data stored in each column, such as numeric (integer, float), categorical (e.g., strings or labels), date/time, or boolean (True/False).
Real world data sources for machine learning dataset:
Now we have some understanding on what dataset is all about in a machine learning context. But still we have a missing link. From where the data is coming to a model for processing? In a real machine learning system, data comes from various sources depending on the application and domain. Here are common sources of data in a machine learning system:
    • Databases:
      • Relational databases (MySQL, PostgreSQL)
      • NoSQL databases (MongoDB, Cassandra)
      • Example: Structured data for e-commerce transactions, User profiles and preferences in a social media platform.
    • Logs and Event Streams:
      • Log collection and aggregation tools
      • Event processing frameworks
      • Example: Capturing user interactions and server activities in a web application, 
    • APIs and Web Services:
      • RESTful API integration libraries
      • Example: Collecting social media data (e.g., tweets) for sentiment analysis, 
    • Sensor Data:
      • IoT data platforms
      • Sensor data collection tools
      • Example: Gathering temperature and humidity data for smart home automation, Monitoring machine data in industrial IoT applications.
    • Text and Documents:
      • Text processing libraries
      • Natural Language Processing (NLP) tools
      • Example: Analysing customer feedback from emails to improve product features, Sentiment analysis of social media posts for brand reputation.
    • Images and Videos:
      • Computer vision libraries
      • Deep learning frameworks
      • Example: Object detection in surveillance camera feeds for security applications, Facial recognition in video streams for access control.
    • Open Data Repositories:
      • Data repository websites (e.g., Kaggle, UCI ML Repository)
      • Example: Using publicly available datasets for research on healthcare prediction, Prototyping machine learning models with datasets from government portals.
How data is collected from different sources?
This is another interesting aspect. We have unlimited sources of data but how to collect the same in a machine consumable format? Here are few mechanisms using which we can collect data from various sources effectively:
  • Data Collection Services: Custom data collection services may be built to gather data from various sources. These services can use APIs, web scraping, or direct database connections.
  • Message Brokers: For real-time data streams, message brokers like Apache Kafka or RabbitMQ are used to ingest and manage the flow of data.
  • Data Pipelines: Data pipelines are designed to move and process data from source to storage, ensuring data integrity and reliability.
How data is getting stored for future processing?
Just pulling data from various sources may not be sufficient for building a model. We need to process both  historical data as well as real time data. Here are some technologies which can help storing data:
  • Distributed File Systems: For big data, distributed file systems like Hadoop Distributed File System (HDFS) is used to store large datasets.
  • Data Warehouses: Structured data can be stored in data warehouses like Amazon Redshift or Google BigQuery for easy analytics and reporting.
  • Object Storage: Unstructured data like images or documents can be stored in object storage services like Amazon S3 or Google Cloud Storage.
Data storage format for a dataset:

The choice of data storage format depends on factors such as data size, complexity, ease of access, and compatibility with the machine learning framework being used. Proper data storage is essential for efficient data retrieval and processing during machine learning model training and inference.

  • CSV (Comma-Separated Values)
  • JSON (JavaScript Object Notation)
  • TFRecords (TensorFlow Records)
  • Parquet
  • HDF5 (Hierarchical Data Format)
  • Databases
  • Cloud Object Storage

Example dataset:

Now is the right time to take a closer look at an example dataset . 

Few observations from the first level of analysis of this dataset based on the understanding we got from this post so far:

- This is a Time series data set having collection of observations obtained through repeated measurements over a period of time.
- No null values found in the dataset.
- We have data from 2010 to 2018
- We have 8 features (columns) in this data set - Date,Open,High,Low,Last,Close,Total Trade Quantity,Turnover (Lacs)

The observations mentioned above might not be enough to predict the future stock price for a particular date. We need to follow a step-by-step approach to prepare the data for a model that can help make those predictions. This might require additional tutorial series to go into more detail. 

As we reached the end of this tutorial, let's have a quiz to assess your dataset knowledge. Below is a list of few common problem statements and examples that can be addressed using machine learning. Your task is to identify the dataset type for each example. Good luck!
  • Time Series Forecasting - Example: Stock Price Prediction
  • Classification -Example: Disease Diagnosis
  • Clustering - Example: Customer Segmentation
  • Anomaly Detection - Example: Intrusion Detection
  • Recommender Systems - Example: Product Recommendations
  • Text Classification - Example: Sentiment Analysis
  • Image Classification - Example: Medical Image Analysis