Complete End to End Data Science Pipeline- Part 1

Loading

In this blog we will break down the entire lifecycle of a Data Science Project. From understand the problem, collecting the data and finally deploying the model on the edge devices.

1️⃣ Problem Statement

This is the foundation of any data science or Machine learning project. The problem statement includes numerous items like project goals, obstacles, and expected results.

Key Aspects of a Problem Statement:
✔️ Objective – What are we trying to achieve? (e.g., Predict customer churn, detect fraud, optimize supply chain)
✔️ Challenges – What are the challenges exist? (e.g., Imbalanced data, missing values, noisy data)
✔️ Impact – Why does solving this matter? (e.g., Reducing costs, improving efficiency, enhancing decision-making)

2️⃣ Domain Knowledge

Domain knowledge is an important skills in Data Science. It helps you to understand the problem statement, selecting relevant features, and interpreting results effectively.

For example– You are a Data Scientist in a medical company. You must have understanding of medical terminologies, disease patterns, and diagnostic procedures to build predictive models for disease detection.

Basically on which domain you are working you must have knowledge related to that domain.

3️⃣ Data Management & Preparation

3.1 Data Collection

Data collection is an important stage in Data Science. There are mainly three main types of Data collection are there

Publicly available Sources

There are many publicly available sources through which you can download the dataset for free. Some of the most popular are

  • Google Dataset Search (https://datasetsearch.research.google.com/)
  • Kaggle
  • Data.gov

If you want to know about all the 50+ data sources through which you want the download the for free. Check out our blog for that

API

You can download the dataset freely from the APIs also. However it comes with the restrictions along with the API. Some of the most popular APIs are

  • Yahoo finance (For downloading the stock market dataset)
  • Zillow (For downloading the real estate data)
  • Open Weather Map (For downloading the weather related information)

If you want to know how to download or retrieve the data from these API. You can check out this blog which contains the python code.

Social media

You can collect the dataset from Social media platforms like Twitter, Reddit, Facebook, Instagram, Quora by performing web scrapping.

Surveys or feedback forms

Conducting surveys or taking feedback is another way of collecting the data. However it is a time consuming process and requires financial resources.

Web Scrapping

You can also collect the data through web scrapping. There are many tools which you can use for performing web scrapping. One of the most important library is FireCrawl through which you can scrap the data in minutes of time in less than 20 lines of code. I have written a blog on FireCrawl library which you can check it out here.

3.2 Data Loading (ETL | ELT)

There are three ways mainly to load the dataset

Data Warehouse

Data Lake

Data LakeHouse

3.3 Data Quality Checking & Cleaning

The next step is to check the quality of the quality and then perform the data preprocessing steps to clean it. There are mainly three types of data we have

  • Structured data (tabular format)
  • Unstructured data (text, images, audio, video)
  • Semi-Structured data

Steps to perform tabular data cleaning

  • Check the data manually.
  • Check for the structural error like spelling error, negative values inside a column.
  • Check for the incorrect data types.
  • Check for the presence of missing values in each column.
  • Check for the duplicate values
  • Check for the outliers in the numerical column.
  • Check for the data imbalance problem in the target column (applicable only if it is classification problem)
  • Check for the skewness in the numerical column
  • Check for the multicollinearity problem in the dataset.
  • Check for the categorical column which have low feature cardinality
  • Check for the timestamp column
  • Check for the JSON types of column and processing the JSON types of columns.

You can watch the above video to learn about the complete Data Preprocessing in Python. If you want to perform Automated Data cleaning then you can download this Python library. With the help of this library you can do Data Cleaning in just 3 lines of code.

Steps to perform text data cleaning

  • Convert your text into lowercase
  • Perform word tokenization
  • Replace contraction words or slang words with their full form
  • Hashtag & Mention Handling
  • Emoji to Text Conversion
  • Correcting Elongated Words
  • Remove URLs, HTML Tags, punctuation, and extra whitespaces and stopword removal
  • Disfluency Removal (Fillers, Repetitions, etc.)
  • Perform lemmatization to reduce the original word into their root forms

Steps to perform image data Preprocessing

  • The first step is to download and store the images in a structured directory.
  • The second step is to read the images and handle the corrupt images.
  • The third step is to resize all the images to a fixed resolution (e.g. 224*224) and perform standardization.
  • Convert all images to a uniform format and make sure that they are either RGB format or grayscale images.
  • The next step is to remove the noise and detect the blurry images and fixes it.
  • The next step is to check for the duplicate images.
  • Perform data augmentation if needed in case of small image data. You can use Albumentations a python library to perform Image data Augmentation.
  • Save the preprocessed images.

Steps to perform Audio data Preprocessing

  • The first step is to collect the raw audio dataset and load it using the libraries like librosa, torchaudio, and scipy. Make sure the audio files is in mp3, wav or FLAC format.
  • The second step is to remove the background noise from the audio using spectral gating or band-pass filtering.
  • The third step is to check for the sampling rates of the audio. If it is varying then resample them to a common rate.
  • The fourth step is to remove the silent sections from the audio which will improves the model training.
  • The fifth step is to perform the audio normalization in order to avoid the variation in volume.
  • The sixth step is to perform speech enhancement techniques like spectral subtraction or Wiener filtering in order to enhance the audio quality.
  • The seventh step is to extract the numerical features from the audio using MFCC, Spectrogram and chroma related features.
  • The eighth step is to perform the audio data augmentation techniques like Time Stretching, and Pitch Shifting.
  • The last step is to save the processed audio.

3.4 Data Version Control

Data version control help us to track the different version of the data by assigning unique hashes or timestamps. At any point in time you can restore the previous version of your data.

It also maintains the logs of any changes which is happening inside the data such as data source, preprocessing steps, and feature modifications. There are various tools through which you can implement data version control such as DVC, GitLFS etc.

4️⃣ Exploratory Data Analysis and Feature Engineering

4.1 Feature Store

Feature store is an important techniques which help us to efficiently manage, store, and serve features for machine learning models.

  • Feast (Google)
  • Vertex AI Feature Store
  • Databricks Feature Store

4.2 Exploratory Data Analysis (EDA)

Exploratory data analysis is a technique which helps to find the meaningful insights from the data with the help of graphical and statistical data analysis method.

Graphical Analysis involves making various types of chart such as histogram, bar chart, pie chart etc. to find out the crucial insights from the data.

Statistical Analysis involves conducting various types of descriptive and inferential statistical test like t-test, chi squared test, analysis of variance etc. in order to find meaningful conclusion from the data.

In the below video you will learn about how to perform descriptive statistics in Python

In the below video you will learn about how to perform Inferential statistics in Python.

4.3 Feature Engineering

There are two types of Feature engineering available

Feature selection

Suppose you have given 100’s of column in a dataset and you want to find which are the features are most important in predicting the target variable. This is called as Feature Selection.

Feature extraction

4.4 Train/Test Split & K-Fold Cross Validation

The next step is to split the dataset into training, testing and validation. It is important because

1️⃣ Training Set – It is used to train the model by learning patterns from the data.
2️⃣ Validation Set – it helps in tuning hyperparameters and preventing overfitting.
3️⃣ Testing Set – It evaluates the model’s performance on unseen data to ensure generalization.

Train/Test Split

K-Fold Cross Validation

5️⃣ Model Development & Training

5.1 Implementation of Machine Learning or Deep Learning Algorithms

The next step is to try out the various machine learning or deep learning algorithm to find out the best model for your task. All the algorithms are divided into 4 types

  • Supervised learning algorithm (Regression and Classification)
  • Unsupervised learning algorithm (Clustering and Association rule algorithm)
  • Semi-Supervised learning algorithm
  • Reinforcement learning

5.2 Hyperparameter Tuning

These are the following techniques through which you can perform the hyperparameter tunning.

5.3 Evaluating the Model on Test Set

The next step is to evaluate the performance on the test set. There are following metrics which we use to evaluate the performance of model.

  • Regression metrics (Root mean Squared error, Mean Absolute error, R Squared)
  • Classification metrics (Precision, Recall, F1 score, Accuracy)
  • Clustering metrics (Silhouette score, Bouldin index)

Leave a Reply