Data Science Projects with R

On this page, you'll find projects that I've done with R!
Please click on the project title to access my Github codes.

Table of Content

Employee Predictions with Naive-Bayes & Linear Regression in R

DDSAnalytics is an analytics company that specializes in talent management solutions for Fortune 100 companies. Talent management is defined as the iterative process of developing and retaining employees. It may include workforce planning, employee training programs, identifying high-potential employees and reducing/preventing voluntary employee turnover (attrition). To gain a competitive edge over its competition, DDSAnalytics is planning to leverage data science for talent management. The executive leadership has identified predicting employee turnover as its first application of data science for talent management.

PROJECT INFORMATION

In this project, I will identify the top three factors that contribute to turnover (backed up by evidence provided by analysis). The business is also interested in learning about any job role specific trends that may exist in the data set. You can also provide any other interesting trends and observations from your analysis. The analysis is backed up by robust experimentation and appropriate visualization.

There are two goals to this project. First, we want to find factors that lead to attrition and determine monthly income with EDA. Secondly , we want to build two prediction models with our chosen factors that can accurately predict both of the variables mentioned above. Lastly, we want to use our model to predict the value of these variables on 2 different test datasets. The two models that we found work best are the Naive-Bayes model for attrition and Linear Regression model for salary. We also tried the KNN model. Additionally, a video presentation was also made to address our findings.

Video Presentation R Markdown

Home Sale Price Prediction with SAS & R

Scatter Plot

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this Kaggle competition's dataset proves that much more influences price negotiations than the number of bedrooms or the presence of a white-picket fence! With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home. The entire process will be done in R and SAS. SAS is used for the first analysis question and variable selection.

PROJECT INFORMATION

Analysis 1: Century 21 Ames only sells houses in the NAmes, Edwards and BrkSide neighborhoods and would like to simply get an estimate of how the SalePrice of the house is related to the square footage of the living area of the house (GrLIvArea) and if the SalesPrice (and its relationship to square footage) depends on which neighborhood the house is located in. Build and fit a model that will answer this question, keeping in mind that realtors prefer to talk about living area in increments of 100 sq. ft. Provide your client with the estimate (or estimates if it v aries by neighborhood) as well as confidence intervals for any estimate(s).
For this analysis, we also want to create to make an RShiny app that will display at least display a scatterplot of price of the home v. square footage (GrLivArea) and allow for the plot to be displayed for at least the NAmes, Edwards and BrkSide neighborhoods separately.

Analysis 2: Build the most predictive model for sales prices of homes in all of Ames Iowa. This includes all neighborhoods. Your group is limited to only the techniques we have learned in 6371 (no random forests or other methods we have not yet covered). Specifically, you should produce 4 models: one from forward selection, one from backwards elimination, one from stepwise selection, and one that you build custom. The custom model could be one of the three preceding models or one that you build by adding or subtracting variables at your will. Generate an adjusted R-Squared, CV Press and Kaggle Score for each of these models and clearly describe which model you feel is the best in terms of being able to predict future sale prices of homes in Ames, Iowa. We will use all 4 our models to predict the Saleprice on a test dataset(test.csv).
In the end, we want to write a 7 page report discussing our findings along with an appendix that includes all of our codes.

Rshiny App Report

Beer Analysis Project With R

Scatterplot: ABV vs. IBU

Assume that the audience is the CEO and CFO of Budweiser (your client) and that they only have had one class in statistics and have indicated that I cannot take more than 7 minutes of their time. They have hired you to address the 9 questions / items

Project Questions to Answer

  • 1.How many breweries are present in each state?
  • 2.Merge beer data with the breweries data. Print the first 6 observations and the last six observations to check the merged file.
  • 3.Address the missing values in each column.
  • 4.Compute the median alcohol content and international bitterness unit for each state. Plot a bar chart to compare.
  • 5.Which state has the maximum alcoholic (ABV) beer? Which state has the most bitter (IBU) beer?
  • 6.Comment on the summary statistics and distribution of the ABV variable.
  • 7.Is there an apparent relationship between the bitterness of the beer and its alcoholic content? Draw a scatter plot. Make your best judgment of a relationship and EXPLAIN your answer.
  • 8.Budweiser would also like to investigate the difference with respect to IBU and ABV between IPAs (India Pale Ales) and other types of Ale (any beer with “Ale” in its name other than IPA). You decide to use KNN classification to investigate this relationship. Provide statistical evidence one way or the other.
  • 9.Find one other useful inference from the data that you feel Budweiser may be able to find value in. You must convince them why it is important and back up your conviction with appropriate statistical evidence

This project was done as part of the DS 6306 course at Southern Methodist Univeristy.

Video Presentation R Markdown

Project Car Price Prediction in R

In this project, we want to create a model that can best predict used car prices. Our team was tasked to identify any relationships between the selling prices and the other variables. Our variable selection methodologies includes, but not limited to, forward selection, backward selection, stepwise selection, etc. Since this is an observational study our conclusions are limited to the data included in our analysis.

Video Presentation R Markdown

Bank Marketing Analysis & Predictions in R

In this section you will build 3 additional classification models to compare to your model in objective 1. The goal of this objective is to build a model where prediction performance is prioritized. One model should be an attempt at a complex logistic regression model including interaction terms or polynomial terms. One model should be an LDA or QDA. And the final model should be a nonparametric model such as knn, random forest, classification tree, etc.

Video Presentation R Markdown

Recent Projects

December 10, 2022

Employee Analysis & Predictions

In this project, I will identify the top three factors that contribute to turnover (backed up by evidence provided by analysis). The analysis is backed up by robust experimentation and appropriate visualization. Additionally, we also want to build a model that accurately predicts employee salary.

December 10, 2022

Home Sale Price Predictions

Build a model that get an estimate of how the SalePrice of the house is related to the square footage of the living area of the house and if the SalesPrice depends on which of 3 interested neighborhood the house is located in. Next, build the most predictive model with as many neccessary for sales prices of homes in all of Ames Iowa.

October 22, 2022

Beer Analysis project with R

Assume that the audience is the CEO and CFO of Budweiser (your client) and that they only have had one class in statistics and have indicated that I cannot take more than 7 minutes of their time. They have hired you to address 9 questions / items.

Project Spotlight

Disaster Response Pipeline

Create an NLP pipeline to help people in needs

Create a machine learning/NLP pipeline to categorize these events and build a model to classify messages that are sent during disasters.

By classifying these messages, we can allow these messages to be sent to the appropriate disaster relief agency. The dataset -provided by Figure Eight- is used to build a model that classifies disaster messages, while the web app is where a respondent can input a new message and get classification results in several categories.