Back to Projects
MLOps & Distributed ML

Movie Review Sentiment Analysis (MapReduce + ML)

Movie Review Sentiment Analysis (MapReduce + ML)

Overview

A comprehensive MLOps project for sentiment analysis of movie reviews using Apache Spark MapReduce and machine learning. Includes Spark-based TF-IDF feature extraction, MLlib models (Naive Bayes, Logistic Regression, Random Forest), PyTorch deep learning variants (LSTM/Transformer/BERT), MLflow tracking, Docker, and Makefile-driven workflows.

Challenges & Solutions

Designing a unified pipeline supporting both Spark MLlib and PyTorch while maintaining reproducibility and comparability. Implementing TF-IDF at scale, robust experiment tracking with MLflow, and containerized, repeatable training/inference flows. Balancing performance and resource usage across distributed and GPU workloads.

Technical Achievements

  • Hybrid Pipeline: Unified Spark MapReduce feature engineering with parallel PyTorch deep learning path
  • Model Zoo: Implemented Naive Bayes, Logistic Regression, Random Forest, plus LSTM/Transformer/BERT variants
  • Experiment Tracking: Full MLflow integration with metrics, params, and artifact logging
  • Reproducibility: Makefile + Docker workflows for install, train, evaluate, and predict
  • Scalability: Distributed TF-IDF and training with Spark, GPU-ready PyTorch training scripts
  • CI-ready Structure: Config-driven scripts, tests, and modular code layout for extension

Technologies Used

Apache Spark Python PyTorch MLflow Docker Make

Let's Talk Engineering.

Always happy to trade notes on AI, ML, and distributed systems, or to talk through any of the work shown here. Reach out anytime.

enockmecheo@nyu.edu