Waze Project

Course 5 - Regression analysis: Simplify complex data relationships

Your team is more than halfway through their user churn project. Earlier, you completed a project proposal, used Python to explore and analyze Waze’s user data, created data visualizations, and conducted a hypothesis test. Now, leadership wants your team to build a regression model to predict user churn based on a variety of variables.

You check your inbox and discover a new email from Ursula Sayo, Waze's Operations Manager. Ursula asks your team about the details of the regression model. You also notice two follow-up emails from your supervisor, May Santner. The first email is a response to Ursula, and says that the team will build a binomial logistic regression model. In her second email, May asks you to help build the model and prepare an executive summary to share your results.

A notebook was structured and prepared to help you in this project. Please complete the following questions and prepare an executive summary.

Course 5 End-of-course project: Regression modeling

In this activity, you will build a binomial logistic regression model. As you have learned, logistic regression helps you estimate the probability of an outcome. For data science professionals, this is a useful skill because it allows you to consider more than one variable against the variable you're measuring against. This opens the door for much more thorough and flexible analysis to be completed.

The purpose of this project is to demostrate knowledge of exploratory data analysis (EDA) and a binomial logistic regression model.

The goal is to build a binomial logistic regression model and evaluate the model's performance.

This activity has three parts:

Part 1: EDA & Checking Model Assumptions

  • What are some purposes of EDA before constructing a binomial logistic regression model?

Part 2: Model Building and Evaluation

  • What resources do you find yourself using as you complete this stage?

Part 3: Interpreting Model Results

  • What key insights emerged from your model(s)?

  • What business recommendations do you propose based on the models built?


Follow the instructions and answer the question below to complete the activity. Then, you will complete an executive summary using the questions listed on the PACE Strategy Document.

Be sure to complete this activity before moving on. The next course item will provide you with a completed exemplar to compare to your own work.

Build a regression model

PACE stages

Throughout these project notebooks, you'll see references to the problem-solving framework PACE. The following notebook components are labeled with the respective PACE stage: Plan, Analyze, Construct, and Execute.

PACE: Plan

Consider the questions in your PACE Strategy Document to reflect on the Plan stage.

Task 1. Imports and data loading

Import the data and packages that you've learned are needed for building logistic regression models.

In [1]:
# Packages for numerics + dataframes
### YOUR CODE HERE ###

import numpy as np
import pandas as pd

# Packages for visualization
### YOUR CODE HERE ###

import seaborn as sns
import matplotlib.pyplot as plt

# Packages for Logistic Regression & Confusion Matrix
### YOUR CODE HERE ###

from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
import sklearn.metrics as metrics
from sklearn.linear_model import LogisticRegression

Import the dataset.

Note: As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [2]:
# Load the dataset by running this cell

df = pd.read_csv('waze_dataset.csv')

PACE: Analyze

Consider the questions in your PACE Strategy Document to reflect on the Analyze stage.

In this stage, consider the following question:

  • What are some purposes of EDA before constructing a binomial logistic regression model?

==> ENTER YOUR RESPONSE HERE

Task 2a. Explore data with EDA

Analyze and discover data, looking for correlations, missing data, potential outliers, and/or duplicates.

Start with .shape and info().

In [3]:
### YOUR CODE HERE ###

print(df.shape)
print('')
df.info()
(14999, 13)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   ID                       14999 non-null  int64  
 1   label                    14299 non-null  object 
 2   sessions                 14999 non-null  int64  
 3   drives                   14999 non-null  int64  
 4   total_sessions           14999 non-null  float64
 5   n_days_after_onboarding  14999 non-null  int64  
 6   total_navigations_fav1   14999 non-null  int64  
 7   total_navigations_fav2   14999 non-null  int64  
 8   driven_km_drives         14999 non-null  float64
 9   duration_minutes_drives  14999 non-null  float64
 10  activity_days            14999 non-null  int64  
 11  driving_days             14999 non-null  int64  
 12  device                   14999 non-null  object 
dtypes: float64(3), int64(8), object(2)
memory usage: 1.5+ MB

Question: Are there any missing values in your data?

==> ENTER YOUR RESPONSE HERE

Use .head().

In [4]:
### YOUR CODE HERE ###
df.head()
Out[4]:
ID label sessions drives total_sessions n_days_after_onboarding total_navigations_fav1 total_navigations_fav2 driven_km_drives duration_minutes_drives activity_days driving_days device
0 0 retained 283 226 296.748273 2276 208 0 2628.845068 1985.775061 28 19 Android
1 1 retained 133 107 326.896596 1225 19 64 13715.920550 3160.472914 13 11 iPhone
2 2 retained 114 95 135.522926 2651 0 0 3059.148818 1610.735904 14 8 Android
3 3 retained 49 40 67.589221 15 322 7 913.591123 587.196542 7 3 iPhone
4 4 retained 84 68 168.247020 1562 166 5 3950.202008 1219.555924 27 18 Android

Use .drop() to remove the ID column since we don't need this information for your analysis.

In [5]:
### YOUR CODE HERE ###

df.drop('ID', axis=1)
Out[5]:
label sessions drives total_sessions n_days_after_onboarding total_navigations_fav1 total_navigations_fav2 driven_km_drives duration_minutes_drives activity_days driving_days device
0 retained 283 226 296.748273 2276 208 0 2628.845068 1985.775061 28 19 Android
1 retained 133 107 326.896596 1225 19 64 13715.920550 3160.472914 13 11 iPhone
2 retained 114 95 135.522926 2651 0 0 3059.148818 1610.735904 14 8 Android
3 retained 49 40 67.589221 15 322 7 913.591123 587.196542 7 3 iPhone
4 retained 84 68 168.247020 1562 166 5 3950.202008 1219.555924 27 18 Android
... ... ... ... ... ... ... ... ... ... ... ... ...
14994 retained 60 55 207.875622 140 317 0 2890.496901 2186.155708 25 17 iPhone
14995 retained 42 35 187.670313 2505 15 10 4062.575194 1208.583193 25 20 Android
14996 retained 273 219 422.017241 1873 17 0 3097.825028 1031.278706 18 17 iPhone
14997 churned 149 120 180.524184 3150 45 0 4051.758549 254.187763 6 6 iPhone
14998 retained 73 58 353.419797 3383 13 51 6030.498773 3042.436423 14 13 iPhone

14999 rows × 12 columns

Now, check the class balance of the dependent (target) variable, label.

In [6]:
### YOUR CODE HERE ###
df['label'].value_counts(normalize =True)
Out[6]:
retained    0.822645
churned     0.177355
Name: label, dtype: float64

Call .describe() on the data.

In [7]:
### YOUR CODE HERE ###

df.describe()
Out[7]:
ID sessions drives total_sessions n_days_after_onboarding total_navigations_fav1 total_navigations_fav2 driven_km_drives duration_minutes_drives activity_days driving_days
count 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000
mean 7499.000000 80.633776 67.281152 189.964447 1749.837789 121.605974 29.672512 4039.340921 1860.976012 15.537102 12.179879
std 4329.982679 80.699065 65.913872 136.405128 1008.513876 148.121544 45.394651 2502.149334 1446.702288 9.004655 7.824036
min 0.000000 0.000000 0.000000 0.220211 4.000000 0.000000 0.000000 60.441250 18.282082 0.000000 0.000000
25% 3749.500000 23.000000 20.000000 90.661156 878.000000 9.000000 0.000000 2212.600607 835.996260 8.000000 5.000000
50% 7499.000000 56.000000 48.000000 159.568115 1741.000000 71.000000 9.000000 3493.858085 1478.249859 16.000000 12.000000
75% 11248.500000 112.000000 93.000000 254.192341 2623.500000 178.000000 43.000000 5289.861262 2464.362632 23.000000 19.000000
max 14998.000000 743.000000 596.000000 1216.154633 3500.000000 1236.000000 415.000000 21183.401890 15851.727160 31.000000 30.000000

Question: Are there any variables that could potentially have outliers just by assessing at the quartile values, standard deviation, and max values?

==> ENTER YOUR RESPONSE HERE

Task 2b. Create features

Create features that may be of interest to the stakeholder and/or that are needed to address the business scenario/problem.

km_per_driving_day

You know from earlier EDA that churn rate correlates with distance driven per driving day in the last month. It might be helpful to engineer a feature that captures this information.

  1. Create a new column in df called km_per_driving_day, which represents the mean distance driven per driving day for each user.

  2. Call the describe() method on the new column.

In [8]:
# 1. Create `km_per_driving_day` column
df['km_per_driving_day']= df['driven_km_drives'] / df['driving_days']

# 2. Call `describe()` on the new column
### YOUR CODE HERE ###

df['km_per_driving_day'].describe()
Out[8]:
count    1.499900e+04
mean              inf
std               NaN
min      3.022063e+00
25%      1.672804e+02
50%      3.231459e+02
75%      7.579257e+02
max               inf
Name: km_per_driving_day, dtype: float64

Note that some values are infinite. This is the result of there being values of zero in the driving_days column. Pandas imputes a value of infinity in the corresponding rows of the new column because division by zero is undefined.

  1. Convert these values from infinity to zero. You can use np.inf to refer to a value of infinity.

  2. Call describe() on the km_per_driving_day column to verify that it worked.

In [9]:
# 1. Convert infinite values to zero
### YOUR CODE HERE ###
df.loc[df['km_per_driving_day']==np.inf, 'km_per_driving_day'] = 0
# 2. Confirm that it worked
### YOUR CODE HERE ###
df['km_per_driving_day'].describe()
Out[9]:
count    14999.000000
mean       578.963113
std       1030.094384
min          0.000000
25%        136.238895
50%        272.889272
75%        558.686918
max      15420.234110
Name: km_per_driving_day, dtype: float64

professional_driver

Create a new, binary feature called professional_driver that is a 1 for users who had 60 or more drives **and** drove on 15+ days in the last month.

Note: The objective is to create a new feature that separates professional drivers from other drivers. In this scenario, domain knowledge and intuition are used to determine these deciding thresholds, but ultimately they are arbitrary.

To create this column, use the np.where() function. This function accepts as arguments:

  1. A condition
  2. What to return when the condition is true
  3. What to return when the condition is false
Example:
x = [1, 2, 3]
x = np.where(x > 2, 100, 0)
x
array([  0,   0, 100])
In [10]:
# Create `professional_driver` column
### YOUR CODE HERE ###

df['professional_driver'] = np.where((df['drives']>=60) & (df['driving_days']>=15), 1,0)

Perform a quick inspection of the new variable.

  1. Check the count of professional drivers and non-professionals

  2. Within each class (professional and non-professional) calculate the churn rate

In [11]:
# 1. Check count of professionals and non-professionals
### YOUR CODE HERE ###
print(df['professional_driver'].value_counts())
print('')
# 2. Check in-class churn rate
### YOUR CODE HERE ###

df.groupby(['professional_driver'])['label'].value_counts()
0    12405
1     2594
Name: professional_driver, dtype: int64

Out[11]:
professional_driver  label   
0                    retained    9463
                     churned     2348
1                    retained    2300
                     churned      188
Name: label, dtype: int64

The churn rate for professional drivers is 7.6%, while the churn rate for non-professionals is 19.9%. This seems like it could add predictive signal to the model.

PACE: Construct

After analysis and deriving variables with close relationships, it is time to begin constructing the model.

Consider the questions in your PACE Strategy Document to reflect on the Construct stage.

In this stage, consider the following question:

  • Why did you select the X variables you did?

==> ENTER YOUR RESPONSE HERE

Task 3a. Preparing variables

Call info() on the dataframe to check the data type of the label variable and to verify if there are any missing values.

In [12]:
### YOUR CODE HERE ###
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   ID                       14999 non-null  int64  
 1   label                    14299 non-null  object 
 2   sessions                 14999 non-null  int64  
 3   drives                   14999 non-null  int64  
 4   total_sessions           14999 non-null  float64
 5   n_days_after_onboarding  14999 non-null  int64  
 6   total_navigations_fav1   14999 non-null  int64  
 7   total_navigations_fav2   14999 non-null  int64  
 8   driven_km_drives         14999 non-null  float64
 9   duration_minutes_drives  14999 non-null  float64
 10  activity_days            14999 non-null  int64  
 11  driving_days             14999 non-null  int64  
 12  device                   14999 non-null  object 
 13  km_per_driving_day       14999 non-null  float64
 14  professional_driver      14999 non-null  int64  
dtypes: float64(4), int64(9), object(2)
memory usage: 1.7+ MB

Because you know from previous EDA that there is no evidence of a non-random cause of the 700 missing values in the label column, and because these observations comprise less than 5% of the data, use the dropna() method to drop the rows that are missing this data.

In [13]:
# Drop rows with missing data in `label` column
### YOUR CODE HERE ###

df.dropna(subset=['label'])
Out[13]:
ID label sessions drives total_sessions n_days_after_onboarding total_navigations_fav1 total_navigations_fav2 driven_km_drives duration_minutes_drives activity_days driving_days device km_per_driving_day professional_driver
0 0 retained 283 226 296.748273 2276 208 0 2628.845068 1985.775061 28 19 Android 138.360267 1
1 1 retained 133 107 326.896596 1225 19 64 13715.920550 3160.472914 13 11 iPhone 1246.901868 0
2 2 retained 114 95 135.522926 2651 0 0 3059.148818 1610.735904 14 8 Android 382.393602 0
3 3 retained 49 40 67.589221 15 322 7 913.591123 587.196542 7 3 iPhone 304.530374 0
4 4 retained 84 68 168.247020 1562 166 5 3950.202008 1219.555924 27 18 Android 219.455667 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
14994 14994 retained 60 55 207.875622 140 317 0 2890.496901 2186.155708 25 17 iPhone 170.029229 0
14995 14995 retained 42 35 187.670313 2505 15 10 4062.575194 1208.583193 25 20 Android 203.128760 0
14996 14996 retained 273 219 422.017241 1873 17 0 3097.825028 1031.278706 18 17 iPhone 182.225002 1
14997 14997 churned 149 120 180.524184 3150 45 0 4051.758549 254.187763 6 6 iPhone 675.293092 0
14998 14998 retained 73 58 353.419797 3383 13 51 6030.498773 3042.436423 14 13 iPhone 463.884521 0

14299 rows × 15 columns

Impute outliers

You rarely want to drop outliers, and generally will not do so unless there is a clear reason for it (e.g., typographic errors).

At times outliers can be changed to the median, mean, 95th percentile, etc.

Previously, you determined that seven of the variables had clear signs of containing outliers:

  • sessions
  • drives
  • total_sessions
  • total_navigations_fav1
  • total_navigations_fav2
  • driven_km_drives
  • duration_minutes_drives

For this analysis, impute the outlying values for these columns. Calculate the 95th percentile of each column and change to this value any value in the column that exceeds it.

In [14]:
# Impute outliers
### YOUR CODE HERE ###

for column in ['sessions','drives','total_sessions','total_navigations_fav1','total_navigations_fav2','driven_km_drives','duration_minutes_drives']:
    threshold = df[column].quantile(0.95)
    df.loc[df[column]> threshold, column] = threshold

Call describe().

In [15]:
df.describe()
Out[15]:
ID sessions drives total_sessions n_days_after_onboarding total_navigations_fav1 total_navigations_fav2 driven_km_drives duration_minutes_drives activity_days driving_days km_per_driving_day professional_driver
count 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000
mean 7499.000000 76.568705 64.058204 184.031320 1749.837789 114.415228 27.214281 3939.632764 1789.647426 15.537102 12.179879 578.963113 0.172945
std 4329.982679 67.297958 55.306924 118.600463 1008.513876 124.686795 36.735583 2216.041510 1222.705167 9.004655 7.824036 1030.094384 0.378212
min 0.000000 0.000000 0.000000 0.220211 4.000000 0.000000 0.000000 60.441250 18.282082 0.000000 0.000000 0.000000 0.000000
25% 3749.500000 23.000000 20.000000 90.661156 878.000000 9.000000 0.000000 2212.600607 835.996260 8.000000 5.000000 136.238895 0.000000
50% 7499.000000 56.000000 48.000000 159.568115 1741.000000 71.000000 9.000000 3493.858085 1478.249859 16.000000 12.000000 272.889272 0.000000
75% 11248.500000 112.000000 93.000000 254.192341 2623.500000 178.000000 43.000000 5289.861262 2464.362632 23.000000 19.000000 558.686918 0.000000
max 14998.000000 243.000000 201.000000 454.363204 3500.000000 424.000000 124.000000 8889.794236 4668.899349 31.000000 30.000000 15420.234110 1.000000

Encode categorical variables

Change the data type of the label column to be binary. This change is needed to train a logistic regression model.

Assign a 0 for all retained users.

Assign a 1 for all churned users.

Save this variable as label2 as to not overwrite the original label variable.

Note: There are many ways to do this. Consider using np.where() as you did earlier in this notebook.

In [16]:
# Create binary `label2` column
### YOUR CODE HERE ###

df['label2'] = np.where(df['label'] == 'churned', 1,0)
df
Out[16]:
ID label sessions drives total_sessions n_days_after_onboarding total_navigations_fav1 total_navigations_fav2 driven_km_drives duration_minutes_drives activity_days driving_days device km_per_driving_day professional_driver label2
0 0 retained 243 201 296.748273 2276 208 0 2628.845068 1985.775061 28 19 Android 138.360267 1 0
1 1 retained 133 107 326.896596 1225 19 64 8889.794236 3160.472914 13 11 iPhone 1246.901868 0 0
2 2 retained 114 95 135.522926 2651 0 0 3059.148818 1610.735904 14 8 Android 382.393602 0 0
3 3 retained 49 40 67.589221 15 322 7 913.591123 587.196542 7 3 iPhone 304.530374 0 0
4 4 retained 84 68 168.247020 1562 166 5 3950.202008 1219.555924 27 18 Android 219.455667 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
14994 14994 retained 60 55 207.875622 140 317 0 2890.496901 2186.155708 25 17 iPhone 170.029229 0 0
14995 14995 retained 42 35 187.670313 2505 15 10 4062.575194 1208.583193 25 20 Android 203.128760 0 0
14996 14996 retained 243 201 422.017241 1873 17 0 3097.825028 1031.278706 18 17 iPhone 182.225002 1 0
14997 14997 churned 149 120 180.524184 3150 45 0 4051.758549 254.187763 6 6 iPhone 675.293092 0 1
14998 14998 retained 73 58 353.419797 3383 13 51 6030.498773 3042.436423 14 13 iPhone 463.884521 0 0

14999 rows × 16 columns

Task 3b. Determine whether assumptions have been met

The following are the assumptions for logistic regression:

  • Independent observations (This refers to how the data was collected.)

  • No extreme outliers

  • Little to no multicollinearity among X predictors

  • Linear relationship between X and the logit of y

For the first assumption, you can assume that observations are independent for this project.

The second assumption has already been addressed.

The last assumption will be verified after modeling.

Note: In practice, modeling assumptions are often violated, and depending on the specifics of your use case and the severity of the violation, it might not affect your model much at all or it will result in a failed model.

Collinearity

Check the correlation among predictor variables. First, generate a correlation matrix.

In [17]:
# Generate a correlation matrix
### YOUR CODE HERE ###

corr = df.corr()
corr
Out[17]:
ID sessions drives total_sessions n_days_after_onboarding total_navigations_fav1 total_navigations_fav2 driven_km_drives duration_minutes_drives activity_days driving_days km_per_driving_day professional_driver label2
ID 1.000000 0.010622 0.009740 0.002722 -0.005673 0.001089 -0.003190 -0.000773 0.000236 -0.001887 -0.001808 -0.007003 0.011884 0.004777
sessions 0.010622 1.000000 0.996907 0.595796 0.006480 0.003123 0.009746 0.001425 -0.005364 0.024374 0.018518 -0.011099 0.443151 0.033695
drives 0.009740 0.996907 1.000000 0.593945 0.006198 0.002307 0.010365 0.001789 -0.004763 0.023821 0.018036 -0.010609 0.443705 0.034528
total_sessions 0.002722 0.595796 0.593945 1.000000 0.005328 0.002377 0.008061 -0.000835 -0.001456 0.015932 0.012568 -0.014563 0.254599 0.022467
n_days_after_onboarding -0.005673 0.006480 0.006198 0.005328 1.000000 0.001232 -0.007468 -0.002852 -0.009140 -0.009725 -0.008563 0.011631 0.001364 -0.124711
total_navigations_fav1 0.001089 0.003123 0.002307 0.002377 0.001232 1.000000 0.002467 -0.007898 0.002722 0.012176 0.011339 -0.002056 0.001135 0.051766
total_navigations_fav2 -0.003190 0.009746 0.010365 0.008061 -0.007468 0.002467 1.000000 0.003831 -0.000951 -0.003489 0.003197 0.004308 0.006892 0.014263
driven_km_drives -0.000773 0.001425 0.001789 -0.000835 -0.002852 -0.007898 0.003831 1.000000 0.690996 -0.005620 -0.007816 0.343509 0.000219 0.020131
duration_minutes_drives 0.000236 -0.005364 -0.004763 -0.001456 -0.009140 0.002722 -0.000951 0.690996 1.000000 -0.005852 -0.007585 0.238133 -0.010022 0.040527
activity_days -0.001887 0.024374 0.023821 0.015932 -0.009725 0.012176 -0.003489 -0.005620 -0.005852 1.000000 0.947706 -0.397062 0.452947 -0.295190
driving_days -0.001808 0.018518 0.018036 0.012568 -0.008563 0.011339 0.003197 -0.007816 -0.007585 0.947706 1.000000 -0.407633 0.468653 -0.286080
km_per_driving_day -0.007003 -0.011099 -0.010609 -0.014563 0.011631 -0.002056 0.004308 0.343509 0.238133 -0.397062 -0.407633 1.000000 -0.165346 0.146798
professional_driver 0.011884 0.443151 0.443705 0.254599 0.001364 0.001135 0.006892 0.000219 -0.010022 0.452947 0.468653 -0.165346 1.000000 -0.117857
label2 0.004777 0.033695 0.034528 0.022467 -0.124711 0.051766 0.014263 0.020131 0.040527 -0.295190 -0.286080 0.146798 -0.117857 1.000000

Now, plot a correlation heatmap.

In [18]:
# Plot correlation heatmap
### YOUR CODE HERE ###
plt.figure(figsize=(15,15))
sns.heatmap(corr, vmin = -1, vmax = 1, annot = True)
plt.show()

If there are predictor variables that have a Pearson correlation coefficient value greater than the absolute value of 0.7, these variables are strongly multicollinear. Therefore, only one of these variables should be used in your model.

Note: 0.7 is an arbitrary threshold. Some industries may use 0.6, 0.8, etc.

Question: Which variables are multicollinear with each other?

==> ENTER YOUR RESPONSE HERE

Task 3c. Create dummies (if necessary)

If you have selected device as an X variable, you will need to create dummy variables since this variable is categorical.

In cases with many categorical variables, you can use pandas built-in pd.get_dummies(), or you can use scikit-learn's OneHotEncoder() function.

Note: Variables with many categories should only be dummied if absolutely necessary. Each category will result in a coefficient for your model which can lead to overfitting.

Because this dataset only has one remaining categorical feature (device), it's not necessary to use one of these special functions. You can just implement the transformation directly.

Create a new, binary column called device2 that encodes user devices as follows:

  • Android -> 0
  • iPhone -> 1
In [19]:
# Create new `device2` variable
### YOUR CODE HERE ###

df['device2'] = OneHotEncoder(drop='first').fit_transform(df[['device']]).toarray()
df = df.astype({'device2':int})
df.tail()
Out[19]:
ID label sessions drives total_sessions n_days_after_onboarding total_navigations_fav1 total_navigations_fav2 driven_km_drives duration_minutes_drives activity_days driving_days device km_per_driving_day professional_driver label2 device2
14994 14994 retained 60 55 207.875622 140 317 0 2890.496901 2186.155708 25 17 iPhone 170.029229 0 0 1
14995 14995 retained 42 35 187.670313 2505 15 10 4062.575194 1208.583193 25 20 Android 203.128760 0 0 0
14996 14996 retained 243 201 422.017241 1873 17 0 3097.825028 1031.278706 18 17 iPhone 182.225002 1 0 1
14997 14997 churned 149 120 180.524184 3150 45 0 4051.758549 254.187763 6 6 iPhone 675.293092 0 1 1
14998 14998 retained 73 58 353.419797 3383 13 51 6030.498773 3042.436423 14 13 iPhone 463.884521 0 0 1

Task 3d. Model building

Assign predictor variables and target

To build your model you need to determine what X variables you want to include in your model to predict your target—label2.

Drop the following variables and assign the results to X:

  • label (this is the target)
  • label2 (this is the target)
  • device (this is the non-binary-encoded categorical variable)
  • sessions (this had high multicollinearity)
  • driving_days (this had high multicollinearity)

Note: Notice that sessions and driving_days were selected to be dropped, rather than drives and activity_days. The reason for this is that the features that were kept for modeling had slightly stronger correlations with the target variable than the features that were dropped.

In [20]:
# Isolate predictor variables
### YOUR CODE HERE ###

X = df.drop(columns = ['label','label2','device','sessions','driving_days'])

Now, isolate the dependent (target) variable. Assign it to a variable called y.

In [21]:
# Isolate target variable
### YOUR CODE HERE ###

y = df['label2']

Split the data

Use scikit-learn's train_test_split() function to perform a train/test split on your data using the X and y variables you assigned above.

Note 1: It is important to do a train test to obtain accurate predictions. You always want to fit your model on your training set and evaluate your model on your test set to avoid data leakage.

Note 2: Because the target class is imbalanced (82% retained vs. 18% churned), you want to make sure that you don't get an unlucky split that over- or under-represents the frequency of the minority class. Set the function's stratify parameter to y to ensure that the minority class appears in both train and test sets in the same proportion that it does in the overall dataset.

In [22]:
# Perform the train-test split
### YOUR CODE HERE ###

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify=y)
In [23]:
# Use .head()
### YOUR CODE HERE ###
X_train.head()
Out[23]:
ID drives total_sessions n_days_after_onboarding total_navigations_fav1 total_navigations_fav2 driven_km_drives duration_minutes_drives activity_days km_per_driving_day professional_driver device2
3483 3483 50 90.468920 1039 0 15 6558.564887 1501.683515 2 0.000000 0 0
13354 13354 45 243.720232 1480 0 35 5657.864872 4660.502879 13 471.488739 0 0
6059 6059 48 61.511644 737 362 3 2575.235783 1407.662881 16 234.112344 0 0
198 198 13 186.979465 3306 184 32 905.681067 299.016399 10 90.568107 0 0
12381 12381 2 124.305442 802 0 0 2813.451801 2021.436512 0 0.000000 0 1

Use scikit-learn to instantiate a logistic regression model. Add the argument penalty = None.

It is important to add penalty = None since your predictors are unscaled.

Refer to scikit-learn's logistic regression documentation for more information.

Fit the model on X_train and y_train.

In [25]:
### YOUR CODE HERE ###

model = LogisticRegression(penalty = 'none', max_iter=400)

model.fit(X_train, y_train)
Out[25]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=400,
                   multi_class='auto', n_jobs=None, penalty='none',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Call the .coef_ attribute on the model to get the coefficients of each variable. The coefficients are in order of how the variables are listed in the dataset. Remember that the coefficients represent the change in the log odds of the target variable for every one unit increase in X.

If you want, create a series whose index is the column names and whose values are the coefficients in model.coef_.

In [29]:
### YOUR CODE HERE ###

pd.Series(model.coef_[0], index=X.columns)
Out[29]:
ID                         0.000001
drives                     0.002334
total_sessions            -0.000122
n_days_after_onboarding   -0.000395
total_navigations_fav1     0.001121
total_navigations_fav2     0.000975
driven_km_drives          -0.000036
duration_minutes_drives    0.000127
activity_days             -0.102379
km_per_driving_day         0.000006
professional_driver       -0.001478
device2                   -0.001015
dtype: float64

Call the model's intercept_ attribute to get the intercept of the model.

In [28]:
### YOUR CODE HERE ###
model.intercept_
Out[28]:
array([-0.00147336])

Check final assumption

Verify the linear relationship between X and the estimated log odds (known as logits) by making a regplot.

Call the model's predict_proba() method to generate the probability of response for each sample in the training data. (The training data is the argument to the method.) Assign the result to a variable called training_probabilities. This results in a 2-D array where each row represents a user in X_train. The first column is the probability of the user not churning, and the second column is the probability of the user churning.

In [30]:
# Get the predicted probabilities of the training data
### YOUR CODE HERE ###

training_probabilities = model.predict_proba(X_train)
training_probabilities
Out[30]:
array([[0.63043816, 0.36956184],
       [0.80145633, 0.19854367],
       [0.78919254, 0.21080746],
       ...,
       [0.82233589, 0.17766411],
       [0.95566842, 0.04433158],
       [0.85525192, 0.14474808]])

In logistic regression, the relationship between a predictor variable and the dependent variable does not need to be linear, however, the log-odds (a.k.a., logit) of the dependent variable with respect to the predictor variable should be linear. Here is the formula for calculating log-odds, where p is the probability of response:
$$ logit(p) = ln(\frac{p}{1-p}) $$

  1. Create a dataframe called logit_data that is a copy of df.

  2. Create a new column called logit in the logit_data dataframe. The data in this column should represent the logit for each user.

In [31]:
### YOUR CODE HERE ###

logit_data = X_train.copy()

logit_data['logit'] = [np.log(prob[1]/ prob[0]) for prob in training_probabilities]

In logistic regression, the relationship between a predictor variable and the dependent variable does not need to be linear, however, the log-odds (a.k.a., logit) of the dependent variable with respect to the predictor variable should be linear. Here is the formula for calculating log-odds, where p is the probability of response:
$$ logit(p) = ln(\frac{p}{1-p}) $$

  1. Create a dataframe called logit_data that is a copy of df.

  2. Create a new column called logit in the logit_data dataframe. The data in this column should represent the logit for each user.

In [ ]:
# 1. Copy the `X_train` dataframe and assign to `logit_data`
### YOUR CODE HERE ###


# 2. Create a new `logit` column in the `logit_data` df
### YOUR CODE HERE ###

Plot a regplot where the x-axis represents an independent variable and the y-axis represents the log-odds of the predicted probabilities.

In an exhaustive analysis, this would be plotted for each continuous or discrete predictor variable. Here we show only driving_days.

In [33]:
# Plot regplot of `activity_days` log-odds
### YOUR CODE HERE ###
sns.regplot(x='activity_days', y='logit', data=logit_data)
plt.title('Log-odds: activity_days')
Out[33]:
Text(0.5, 1.0, 'Log-odds: activity_days')

PACE: Execute

Consider the questions in your PACE Strategy Document to reflect on the Execute stage.

Task 4a. Results and evaluation

If the logistic assumptions are met, the model results can be appropriately interpreted.

Use the code block below to make predictions on the test data.

In [35]:
# Generate predictions on X_test
### YOUR CODE HERE ###

y_preds = model.predict(X_test)

Now, use the score() method on the model with X_test and y_test as its two arguments. The default score in scikit-learn is accuracy. What is the accuracy of your model?

Consider: Is accuracy the best metric to use to evaluate this model?

In [36]:
# Score the model (accuracy) on the test data
### YOUR CODE HERE ###

model.score(X_test, y_test)
Out[36]:
0.8312

Task 4b. Show results with a confusion matrix

Use the confusion_matrix function to obtain a confusion matrix. Use y_test and y_preds as arguments.

In [38]:
### YOUR CODE HERE ###
cm = metrics.confusion_matrix(y_test, y_preds)
cm
Out[38]:
array([[3083,   33],
       [ 600,   34]])

Next, use the ConfusionMatrixDisplay() function to display the confusion matrix from the above cell, passing the confusion matrix you just created as its argument.

In [40]:
### YOUR CODE HERE ###

disp = metrics.ConfusionMatrixDisplay( confusion_matrix = cm,
                             display_labels = ['retained', 'churned'])
disp.plot()
Out[40]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7fdb0ee4f3d0>

You can use the confusion matrix to compute precision and recall manually. You can also use scikit-learn's classification_report() function to generate a table from y_test and y_preds.

In [42]:
# Calculate precision manually
### YOUR CODE HERE ###

precision = cm[1,1]/ (cm[0,1] + cm[1,1])
precision
Out[42]:
0.5074626865671642
In [43]:
# Calculate recall manually
### YOUR CODE HERE ###

recall = cm[1,1]/ (cm[1, 0] + cm[1,1])
recall
Out[43]:
0.05362776025236593
In [47]:
# Create a classification report
### YOUR CODE HERE ###

target_labels = ['retained','churned']
print(metrics.classification_report(y_test, y_preds, target_names = target_labels))
              precision    recall  f1-score   support

    retained       0.84      0.99      0.91      3116
     churned       0.51      0.05      0.10       634

    accuracy                           0.83      3750
   macro avg       0.67      0.52      0.50      3750
weighted avg       0.78      0.83      0.77      3750

Note: The model has decent precision but very low recall, which means that it makes a lot of false negative predictions and fails to capture users who will churn.

BONUS

Generate a bar graph of the model's coefficients for a visual representation of the importance of the model's features.

In [48]:
# Create a list of (column_name, coefficient) tuples
feature_importance = list(zip(X_train.columns, model.coef_[0]))

# Sort the list by coefficient value
feature_importance = sorted(feature_importance, key=lambda x: x[1], reverse=True)
feature_importance
Out[48]:
[('drives', 0.00233431890503866),
 ('total_navigations_fav1', 0.0011214814300998155),
 ('total_navigations_fav2', 0.0009745306599053898),
 ('duration_minutes_drives', 0.00012689911883806293),
 ('km_per_driving_day', 6.4610483656169894e-06),
 ('ID', 1.411053457057454e-06),
 ('driven_km_drives', -3.5578753005605824e-05),
 ('total_sessions', -0.00012199249898301086),
 ('n_days_after_onboarding', -0.00039489304032060753),
 ('device2', -0.0010149565289877427),
 ('professional_driver', -0.0014780313973741977),
 ('activity_days', -0.10237928220150684)]
In [52]:
# Plot the feature importances
### YOUR CODE HERE ###

sns.barplot(x=[x[1] for x in feature_importance],
            y=[x[0] for x in feature_importance])
plt.show()

Task 4c. Conclusion

Now that you've built your regression model, the next step is to share your findings with the Waze leadership team. Consider the following questions as you prepare to write your executive summary. Think about key points you may want to share with the team, and what information is most relevant to the user churn project.

Questions:

  1. What variable most influenced the model's prediction? How? Was this surprising?

  2. Were there any variables that you expected to be stronger predictors than they were?

  3. Why might a variable you thought to be important not be important in the model?

  4. Would you recommend that Waze use this model? Why or why not?

  5. What could you do to improve this model?

  6. What additional features would you like to have to help improve the model?

==> ENTER YOUR RESPONSES TO QUESTIONS 1-6 HERE

Congratulations! You've completed this lab. However, you may not notice a green check mark next to this item on Coursera's platform. Please continue your progress regardless of the check mark. Just click on the "save" icon at the top of this notebook to ensure your work has been logged.