Logistic regression in r studio

LOGISTIC REGRESSION IN R STUDIO DOWNLOAD

If you look at the categorical variables, you will notice that n – 1 dummy variables are created for these variables. Residual deviance: 28842 on 32223 degrees of freedomĪll the variables in the above output have turned out to be significant(p values are less than 0.05 for all the variables). Null deviance: 36113 on 32230 degrees of freedom (Dispersion parameter for binomial family taken to be 1) To train the logistic model, we will be using glm() function. We will train the model using the training dataset and predict the values on the test dataset. We will be splitting the data into the test and train using the createDataPartition() function from the caret package in R. Geom_histogram(aes(fill = Class), color = "black", binwidth = 2)ĭata looks much more skewed for the lower-income people as compared to the high-income group.

Let us see how the distribution of age looks for the two income groups. To save time, I will directly be going forward with the bivariate analysis. # Converting ? to NAįinally taking a look into the target variable We will first convert all ? to NA and then use na.omit() to keep the complete observation. # Converting to factor variablesĪdult$Workclass <- as.factor(adult$Workclass)Īdult$Marital-status <- as.factor(adult$Marital-status) Now, we must convert them to factor variables using as.factor() function. This variable looks well-distributed then Workclass.

# Combining levelsĪdult$Marital-status <- "Married"Īdult$Marital-status <- "Not-Married" We can reduce the above levels to never married, married and never married. Let us do a similar treatment for our other categorical variable # Generating the frequency table

# Combining levelsĪdult$Workclass <- "Unemployed"Īdult$Workclass <- "SL-gov"Īdult$Workclass <- "Self-employed" Some of the levels have very few observations and looks like we have an opportunity to combine similar looking levels. Also, the data is not uniformly distributed. The table suggests that there are some 2799 missing values in this variable, which are represented by the (?) symbol. Private Self-emp-inc Self-emp-not-inc State-gov The best way to summarize the categorical variable is to create the frequency table, and that is what we will do using table function. The new dataset has 48842 observations and only 4 variables # Subsetting the data and keeping the required variablesĪdult <- adult Out of these three variables – WorkClass and Marital-status are categorical variables where as Age is a continuous variable. library(readr)Īs mentioned earlier, we will be using three variables WorkClass, Marital-status and Age to build the model. The adult dataset is fairly large, and to read it faster, I will be using read_csv() from readr package to load the data from my local machine. However, for the demo purpose, we will be using only three variables from the whole dataset. So, we will try to demonstrate all the essential tasks which are part of model building exercise. The idea here is to give you a fair idea about how a data scientist or a statistician builds a predictive model.

LOGISTIC REGRESSION IN R STUDIO DOWNLOAD

You can download this Adult Income data from the UCI repository.īeta coefficient in logistics regression are chosen based upon maximum likelihood estimates. In this tutorial, we will be using Adult Income data from the UCI machine learning repository to predict the income class of an individual based upon the information provided in the data. In the above equation, p represents the odds ratio, and the formula for the odds ratio is as given below: Case Study – What is UCI Adult Income? The following mathematical formula is used to generate the final output. In logistic regression, the model predicts the logit transformation of the probability of the event. These independent variables can be either qualitative or quantitative. The logistic regression model is used to model the relationship between a binary target variable and a set of independent variables. However, by default, a binary logistic regression is almost always called logistics regression. When the dependent variable is dichotomous, we use binary logistic regression. Binary Logistic Regression is used to explain the relationship between the categorical dependent variable and one or more independent variables.