
Predictive modelling is a technique used to predict future outcomes based on historical data. It involves creating models that can forecast future trends, behaviours, or events by analysing patterns in existing data. Imagine having a crystal ball that uses past information to give you insights about what might happen next. That’s essentially what predictive modelling does, but instead of magic, it relies on data and mathematics.
The first step in predictive modelling is to gather relevant data. This data could come from various sources like sales records, customer feedback, or sensor readings. The more accurate and comprehensive the data, the better the predictions will be. Before building a predictive model, itโs important to understand the data. EDA involves summarising the main characteristics of the data, often using visual methods like charts and graphs. This helps in identifying patterns, trends, and anomalies.
- Regression Analysis
- Linear Regression: Models the relationship between a dependent variable and one or more independent variables using a straight line.
- Logistic Regression: Used for binary classification problems, modelling the probability of a binary outcome.
- Polynomial Regression: Extends linear regression by fitting a polynomial equation to the data.
- Ridge and Lasso Regression: Regularization techniques to prevent overfitting by adding penalty terms to the regression model.
- Classification
- Decision Trees: A tree-like model used to make predictions based on the input features.
- Random Forest: An ensemble of decision trees to improve predictive accuracy and control overfitting.
- Support Vector Machines (SVM): A classification method that finds the hyperplane which best separates different classes.
- Naive Bayes: A probabilistic classifier based on Bayes’ theorem, assuming independence between predictors.
- K-Nearest Neighbours (KNN): A non-parametric method used for classification by comparing the distance between data points.
- Time Series Forecasting
- ARIMA (Autoregressive Integrated Moving Average): A statistical method for time series forecasting.
- Exponential Smoothing (ETS): Models time series data for forecasting by considering trends and seasonality.
- Prophet Model: An open-source tool for time series forecasting developed by Facebook.
- Long Short-Term Memory Networks (LSTM): A type of recurrent neural network (RNN) used for forecasting time series data with long-term dependencies.
- Clustering
- K-Means Clustering: Partitions data into K distinct clusters based on distance to the centroid.
- Hierarchical Clustering: Builds a hierarchy of clusters either through agglomerative (bottom-up) or divisive (top-down) approaches.
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Clusters data points that are closely packed together, marking outliers.
- Ensemble Methods
- Bagging (Bootstrap Aggregating): Combines multiple models to reduce variance and improve stability and accuracy. Commonly used with decision trees.
- Boosting: Combines weak learners to form a strong learner by focusing on errors of prior models. Includes methods like AdaBoost, Gradient Boosting, and XGBoost.
- Stacking: Uses multiple models and combines their predictions through a meta-model, often improving predictive performance.
- Survival Analysis
- Cox Proportional Hazards Model: Evaluates the time until an event occurs and the effect of explanatory variables on that time.
- Kaplan-Meier Estimator: Non-parametric statistic used to estimate the survival function from lifetime data.
- Accelerated Failure Time Model (AFT): Models the relationship between survival time and covariates, assuming that the effect of covariates accelerates or decelerates the lifetime of the subject.