← Back to projects
AcademiccompleteDecember 2023

Data Science Portfolio

A collection of machine learning projects implementing classification, clustering, and association rule mining algorithms on real-world datasets.

Data Science Portfolio

A collection of data science projects built for CSCI 5523 (Introduction to Data Mining), each carrying a dataset through the full machine learning pipeline from exploratory analysis to model evaluation.

Project 1: Classification & Analysis

Exploratory Data Analysis

  • Telecom customer churn dataset analysis
  • Feature distribution visualization
  • Correlation analysis and data quality assessment

Decision Trees & kNN

  • Implementation of classification algorithms
  • Hyperparameter tuning and cross-validation
  • Performance comparison across algorithms

Naive Bayes Spam Classification

  • Text preprocessing and feature extraction
  • Probabilistic classification for spam detection
  • Precision/recall analysis

Multi-Dataset ML Analysis

  • Applied ML pipelines to iris, diabetes, and thyroid datasets
  • Comparative model performance evaluation
  • ROC curve analysis and model selection

Project 2: Advanced Analytics

Apriori Algorithm

  • Market basket analysis implementation
  • Association rule mining with support/confidence metrics
  • Frequent itemset discovery

Instacart Transaction Analysis

  • Large-scale retail transaction data
  • Customer purchase pattern identification
  • Product association recommendations

Cluster Analysis

  • K-means and hierarchical clustering
  • Cluster quality evaluation (silhouette scores)
  • Dendrogram visualization

COVID-19 Literature Clustering

  • CORD-19 research paper analysis
  • Text embedding and similarity measures
  • Research topic discovery through unsupervised learning

Techniques

Data preprocessing (missing-value handling, normalization, encoding), supervised learning (classification, regression), unsupervised learning (clustering, dimensionality reduction), association mining (Apriori, frequent-pattern discovery), and visualization.

Stack

Built in Python with Jupyter notebooks, pandas and NumPy for data manipulation, scikit-learn for the ML algorithms, and matplotlib and seaborn for visualization.