A collection of data science projects built for CSCI 5523 (Introduction to Data Mining), each carrying a dataset through the full machine learning pipeline from exploratory analysis to model evaluation.
Project 1: Classification & Analysis
Exploratory Data Analysis
- Telecom customer churn dataset analysis
- Feature distribution visualization
- Correlation analysis and data quality assessment
Decision Trees & kNN
- Implementation of classification algorithms
- Hyperparameter tuning and cross-validation
- Performance comparison across algorithms
Naive Bayes Spam Classification
- Text preprocessing and feature extraction
- Probabilistic classification for spam detection
- Precision/recall analysis
Multi-Dataset ML Analysis
- Applied ML pipelines to iris, diabetes, and thyroid datasets
- Comparative model performance evaluation
- ROC curve analysis and model selection
Project 2: Advanced Analytics
Apriori Algorithm
- Market basket analysis implementation
- Association rule mining with support/confidence metrics
- Frequent itemset discovery
Instacart Transaction Analysis
- Large-scale retail transaction data
- Customer purchase pattern identification
- Product association recommendations
Cluster Analysis
- K-means and hierarchical clustering
- Cluster quality evaluation (silhouette scores)
- Dendrogram visualization
COVID-19 Literature Clustering
- CORD-19 research paper analysis
- Text embedding and similarity measures
- Research topic discovery through unsupervised learning
Techniques
Data preprocessing (missing-value handling, normalization, encoding), supervised learning (classification, regression), unsupervised learning (clustering, dimensionality reduction), association mining (Apriori, frequent-pattern discovery), and visualization.
Stack
Built in Python with Jupyter notebooks, pandas and NumPy for data manipulation, scikit-learn for the ML algorithms, and matplotlib and seaborn for visualization.
