In my previous post, I provided an explanation on the differences between a traditional algorithm and a machine learning algorithm. If you haven't had the chance to read it yet, I highly recommend checking it out to gain a better understanding. You can access the post through this link at your convenience. Happy reading!
In this post, I am sharing my understanding on datasets in the context of machine learning. Dataset is the entry point of any machine learning model and plays a critical role in its success for the purpose. Please join me on this exploration and let's demystify few jargons around data and dataset together in a simple way.
At the end of this tutorial, you will have acquired the knowledge of:
- What is a data set in machine learning context?
- What are the different data types from statistics widely used in machine learning?
- Basic characteristics of a dataset
- Real world sources for dataset
- How data is collected from different sources
- How data is getting stored for future processing
- Different data storage format for a dataset
- Example of dataset
Before getting into the formal definition of a dataset, let's grasp the concept through a simple example. Imagine you have a box filled with various toys, each differing in shape, size, and colour. These toys collectively represent our dataset. Now, our goal is to organise those toys in a manner that allows us to effortlessly pick any specific toy from the box.
Using this dataset, our objective is to train a smart helper (which can be relate to a machine learning model), to effectively sort toys based on their shape, size, and colour. The smart helper needs to observe the similarities and differences among the toys accurately. We will guide the smart helper by presenting a few toys and indicating which group they belong to. For example, we might pick three blue cubes and say, "These are all blue cubes". We will also correct the outcome if the smart helper is classifying something into a wrong bucket. Over a period of time, our smart helper will observes these examples and learns from them. Once the learning process is complete, the helper gains the ability to autonomously sort new toys. We can present it with toys it has never encountered before, and it will confidently categorise them into their respective groups. Hope with this simple example you are getting the essence of how a machine learning dataset operates.
Let's now establish a formal definition of a dataset. A dataset is a structured collection of data specially designed for machine learning and statistical analysis. It contains multiple data points or samples, with each sample representing a unique observation or instance. These data points are associated with specific features or attributes that provide important information about each observation.
The following aspects of dataset plays a significant role in determining the quality and performance of the machine learning model outcome.
- Data Quality: High quality data leads to more accurate models.
- Data from reliable and trustworthy sources.
- Accurate and error-free data.
- Complete and no missing values.
- Consistent data format and conventions.
- Data Diversity: Diverse data covering various aspects and scenarios relevant to the problem domain improves model generalisation.
- Dataset with images of animals and birds, each class having different breeds and from different regions including images taken indoors, outdoors, in different seasons, and at different times of the day.
- Data Balance: Balanced data prevents bias towards dominant classes.
- Dataset of credit card transactions with only 100 fraudulent transactions (class 1) and 9,900 non-fraudulent transactions (class 0) is very dominant towards the class 0.
- Data Relevance: Relevant features improve model performance.
- Dataset for predicting customer churn in a telecom company without including irrelevant data like customer preferences for food or entertainment.
- Outliers: Data points that deviate significantly from the general trend.
- Identifying and handling data points in a housing price dataset that are unusually high or low compared to similar houses.
- Missing Data: Handling missing data ensures complete analysis.
- A customer feedback dataset missing age and feedback score.
- Noise in Data: Minimising data noise improves model accuracy.
- In a dataset for recognising handwritten digits, removing images with low quality or distortion to improve model performance.
Different types of data used in machine learning
- Numerical (Continuous): Represents continuous numerical data with infinite values.
- Example: Age, Temperature
- Numerical (Discrete): Represents numerical data with finite or countable values.
- Example: Number of students in class
- Categorical (Nominal): Represents data with distinct categories without order.
- Example: Colours (Red/Blue/Green)
- Categorical (Ordinal): Represents data with distinct categories with meaningful order.
- Example: T-shirt size (S, M, L, XL)
- Text (Textual Data): Represents unstructured text data, such as sentences.
- Example: Tweets, Email content
- Image (Image Data): Represents image data using pixel values.
- Example: Photographs, Satellite Images, Digital Art
- Audio (Audio Data): Represents audio data, such as sound waves.
- Example: Speech, Music, Sound Effects
- Time Series: Represents data collected over time with temporal order.
- Example: Stock Prices, Temperature over a day, Website Traffic
- Boolean: Represents data with two possible values (True or False).
- Example: True/False
- Geographic Coordinates: Represents location data using latitude and longitude values.
- Example: GPS Coordinates of cities, Geotagged images
- Number of Instances (Rows): Refers to the total number of data points or observations in the dataset. Each row typically represents a single instance or sample.
- Number of Features (Columns): Indicates the number of variables or attributes present in the dataset. Each column represents a specific feature of the data.
- Data Types: Describes the types of data stored in each column, such as numeric (integer, float), categorical (e.g., strings or labels), date/time, or boolean (True/False).
- Databases:
- Relational databases (MySQL, PostgreSQL)
- NoSQL databases (MongoDB, Cassandra)
- Example: Structured data for e-commerce transactions, User profiles and preferences in a social media platform.
- Logs and Event Streams:
- Log collection and aggregation tools
- Event processing frameworks
- Example: Capturing user interactions and server activities in a web application,
- APIs and Web Services:
- RESTful API integration libraries
- Example: Collecting social media data (e.g., tweets) for sentiment analysis,
- Sensor Data:
- IoT data platforms
- Sensor data collection tools
- Example: Gathering temperature and humidity data for smart home automation, Monitoring machine data in industrial IoT applications.
- Text and Documents:
- Text processing libraries
- Natural Language Processing (NLP) tools
- Example: Analysing customer feedback from emails to improve product features, Sentiment analysis of social media posts for brand reputation.
- Images and Videos:
- Computer vision libraries
- Deep learning frameworks
- Example: Object detection in surveillance camera feeds for security applications, Facial recognition in video streams for access control.
- Open Data Repositories:
- Data repository websites (e.g., Kaggle, UCI ML Repository)
- Example: Using publicly available datasets for research on healthcare prediction, Prototyping machine learning models with datasets from government portals.
- Data Collection Services: Custom data collection services may be built to gather data from various sources. These services can use APIs, web scraping, or direct database connections.
- Message Brokers: For real-time data streams, message brokers like Apache Kafka or RabbitMQ are used to ingest and manage the flow of data.
- Data Pipelines: Data pipelines are designed to move and process data from source to storage, ensuring data integrity and reliability.
- Distributed File Systems: For big data, distributed file systems like Hadoop Distributed File System (HDFS) is used to store large datasets.
- Data Warehouses: Structured data can be stored in data warehouses like Amazon Redshift or Google BigQuery for easy analytics and reporting.
- Object Storage: Unstructured data like images or documents can be stored in object storage services like Amazon S3 or Google Cloud Storage.
The choice of data storage format depends on factors such as data size, complexity, ease of access, and compatibility with the machine learning framework being used. Proper data storage is essential for efficient data retrieval and processing during machine learning model training and inference.
- CSV (Comma-Separated Values)
- JSON (JavaScript Object Notation)
- TFRecords (TensorFlow Records)
- Parquet
- HDF5 (Hierarchical Data Format)
- Databases
- Cloud Object Storage
Now is the right time to take a closer look at an example dataset .
Few observations from the first level of analysis of this dataset based on the understanding we got from this post so far:
- This is a Time series data set having collection of observations obtained through repeated measurements over a period of time.- We have data from 2010 to 2018
- We have 8 features (columns) in this data set - Date,Open,High,Low,Last,Close,Total Trade Quantity,Turnover (Lacs)
- Time Series Forecasting - Example: Stock Price Prediction
- Classification -Example: Disease Diagnosis
- Clustering - Example: Customer Segmentation
- Anomaly Detection - Example: Intrusion Detection
- Recommender Systems - Example: Product Recommendations
- Text Classification - Example: Sentiment Analysis
- Image Classification - Example: Medical Image Analysis