Dec 26, 2022GreenBuildings 2: Imputing Missing Values With Scikit-Learn1. Introduction 2. Analyzing Distributions & Correlations 3. Imputing Missing Values With Scikit-Learn 4. Next Steps Originally Published in 2019. Introduction This is the second post in a series of blog posts about building a predictive model of green house gas emissions of buildings in NYC. …Scikit Learn12 min readScikit Learn12 min read

Dec 26, 2022GreenBuildings 1: Exploratory Analysis & Outlier Removal1. Introduction 2. Exploratory Data Analysis 3. Connecting To BigQuery 4. Removing Visual Outliers 5. Removing Outliers With Isolation Forests 6. Recommendations & Next Steps Introduction Originally Published in 2019 I started this project a while back with a goal of taking the 2016 NYC Benchmarking Law building energy usage data…Data Science19 min readData Science19 min read

Dec 6, 2022An AI-Powered JFK Speech Writer: Part 1Introduction One of the most quintessential projects to complete when getting started with Deep Learning and Natural Language Processing is using text generation with Recurrent Neural Networks. The internet is littered with examples of people training a neural network on works of Shakespeare and using the network to generate new text…Data Science4 min readData Science4 min read

Dec 3, 2022Writing A Scikit-Learn Compatible Clustering Algorithm1. Introduction 2. The k-means clustering algorithm 3. Writing the k-means algorithm with NumPy 4. Writing a Scikit-Learn compatible estimator 5. Using the elbow method and Pipelines 6. Summary & References Introduction Clustering algorithms and unsupervised learning methods have been gaining popularity recently. This is partly because the amount of data…Kmeans15 min readKmeans15 min read

Nov 30, 2022Sentiment Analysis Part 3:Machine Learning With Spark On Google Cloud — 1. Introduction 2. Creating A GCP Hadoop Cluster 3. Getting Data From An Atlas Cluster 4. Basic Models With Spark ML Pipelines 5. Stemming With Custom Transformers 6. N-Grams & Parameter Tuning Using A Grid Search 7. Conclusions Originally Published in 2019. Introduction In the first part of this three part…Pyspark25 min readPyspark25 min read

Nov 12, 2022Sentiment Analysis Part 2Exploring Text Data with MongoDB Using PyMongo — Originally published in 2019. In the last post we covered how to preform ETL on text data using PySpark and MongoDB. In this post we learn how to use PyMongo to explore text in our NoSQL database. MongoDB is a document based NoSQL database that is fast, easy to use…Python8 min readPython8 min read

Nov 12, 2022Sentiment Analysis Part 1Extract-Transform-Load with PySpark and MongoDB — Originally published in 2019. Introduction I’ve been itching to learn some more Natural Language Processing and thought I might try my hand at the classic problem of Twitter sentiment analysis. I found labeled twitter data with 1.6 million tweets on the Kaggle website here. While 1.6 million tweets is not a…Pyspark12 min readPyspark12 min read