What are R-language and the need to learn it?
R-language is a statistical programming language used by Data Scientists for data analysis and statistical computing in nearly every industry & field. It was developed in the early 90s. Since then, numerous efforts have been made to keep improving the proficiency & user interface of R.
The journey of R language from a fundamental text editor to an interactive R Studio, and more recently- Jupyter Notebooks, has intrigued & tangled many data science communities across the world.
The inclusion of various high-end packages in R has made it more and more potent with time. Packages such as dplyr, tidyr, readr, data.table, SparkR, ggplot2 have made data manipulation, visualization, and computation much more efficient, fast, and accurate.
R has enough potential & provisions to implement machine learning algorithms in a fast and simple protocol. By the end of this blog post, you will have the right amount of exposure to understanding the whereabouts & fundamentals for building predictive models using machine learning on your own.
Note: No prior knowledge of data science or analytics is mandatory here. However, previous experience of algebra and statistics will be helpful.
Benefits of using R-language
Coming to analyze, count & consider, there are a plethora of benefits R extends in the Data Science world. But, I have chalked out a few basic ones that seem relevant for a fundamental idea about R.
- The style of coding is quite handy & uncomplicated.
- It’s open-source. Hence, no need to pay any subscription fee to use it.
- There is the availability of instant access to over 7800 packages customized for various computation tasks.
- The community support is overwhelming. There are plenty of forums to help you out.
- Gives high-performance computing experience (packages required to be installed)
- One of the highly sought-after & pursued skills by analytics and data science companies right now.
Installation
of R / R Studio
You can download and install the old version of R. But, I would insist you start with RStudio. It provides a much better & proficient coding experience.
For Windows users, RStudio is available for Windows Vista and above versions. Follow the steps below for installing RStudio:
- Navigate to https://www.rstudio.com/products/rstudio/download
- In the section Installers for Supported Platforms, select and click on the RStudio installer based on your existing operating system. The download should begin as soon as you click.
- Keep clicking next till you reach Finish, then click that.
- Wait for the download to complete.
- As soon as the download gets completed, click on the generated RStudio desktop icon or use ‘search windows’ to access the program. It looks like this:
A quick understanding of the RStudio interface:
- R Script Code Editor: This is the space to write codes. To run those codes, select the line(s) of code and press Ctrl + Enter. Alternatively, you can click on the Run button located at the top right corner of R Script.
- R Console: This area shows the output of the code run. Also, you can directly write codes in the console. However, code entered directly in the R console cannot be traced later on. This is where R script comes to use.
- R Environment: This space displays the set of external elements added, which includes data set, variables, vectors, functions, etc. To check if the data has been successfully loaded in R, always keep an eye on this area.
- Graphical Output: This space displays the graphs generated during exploratory data analysis. Not just charts, you can select packages. For further detail on this go through the embedded R’s official documentation.
Installation
of R Packages
The real power of R lies in its incredible packages. In R, most data handling tasks can be performed in two ways- using R packages or using R base functions. In this post, I’ll also introduce you to the handiest and powerful R packages. There are two ways to install packages in R.
1.
Using R Script-
To install a package from Script, type install.packages(“package name”)
As a first time user, a pop might appear to select your CRAN mirror (country server), choose accordingly and press OK.
Note: You can type this either in the console directly and press ‘Enter’ or in the R script and click on ‘Run.’
2.
Using Package Library-
- Run R studio
- Click on the Packages tab in the bottom-right section and then click on install. The following dialog box will appear.
- In the Install Packages dialog, write the package name you want to install under the Packages field and then click install. This will install the package you searched for or give you a list of matching kit based on your package text.
Basic
Computations in R (Simple Example)
To familiarize ourselves with the R coding environment, let us start with some basic calculations. R console can be used as an interactive calculator too. Type the following in your console:
> 2 + 3
> 5
> 6 / 3
> 2
> (3*8)/(2*3)
> 4
> log(12)
> 1.07
> sqrt (121)
> 11
Similarly, you can experiment with different combinations/operations of calculations and check out the results. In case you want to check the previous calculation, it can be done in two ways. First, click on the R console, and press the Up/Down arrow key on your keyboard. This will activate the previously executed commands. Press Enter.
But, what if you have performed too many calculations & want to check out the ones lying way before the last one? In this case, finding out the result by scrolling using arrow keys through every command will turn out to be too tedious.
In such situations, creating a variable is a better way. In R, you can create a variable using <- or = sign. Let’s assume I want to create a variable x to compute the sum of 10 and 15. I’ll write it as:
> x <- 15 + 10
> x
> 15
Once you create a variable, you no longer get the output directly on the console, unless you call the Variable in the next line. Always remember, variables can be alphabets, alphanumeric, but not numeric.
Essentials of R-language
A thorough understanding of this section is exceptionally vital. This is one of the building blocks of your R programming knowledge. If you get this right, you will face less trouble in debugging.
R has five basic or atomic classes of objects. But, let us first understand what an object is!
Everything you see or create in R is an object. A vector, matrix, data frame, or even a variable is treated as an object by R. So, R has five basic classes of objects:
- Character
- Numeric (Real Numbers)
- Integer (Whole Numbers)
- Complex
- Logical (True / False)
Since these classes are self-explanatory by names, I wouldn’t elaborate on that topic. These classes have attributes. An attribute can be referred to as an identifier- a name or a number which aptly identifies them. An object can have the following attributes:
- Names, dimension names
- Dimensions
- Class
- Length
Attributes of an object can be accessed using attributes() function.
Let us now understand the concept of the object and attributes programmatically. The most fundamental purpose in R is known as a Vector. You can create an empty vector using vector(). Remember, vectors contain objects of the same class.
For example, Let’s create vectors of different classes. We can create a vector using c() or concatenate command.
> a <- c(1.8, 4.5) --numeric
> b <- c(1 + 2i, 3 - 6i) --complex
> d <- c(23, 44) --integer
> e <- vector("logical", length = 5)
Similarly, you can create a vector of various classes.
Data Types in R language
R has various data types, which include, vector (numeric, integer, etc.), matrices, data frames, and lists. Let’s get a brief idea about each one of them.
Vector:
A vector contains objects of the same class. But, you can mix objects of different classes too. When objects of different classes are mixed in a list, coercion occurs, which causes the conversion of objects of different types into one class. For example:
> qt <- c("Time", 24, "October", TRUE, 3.33) --character
> ab <- c(TRUE, 24) --numeric
> cd <- c(2.5, "May") --character
To check the class of any object, use class(“vector name”) function.
> class(qt)
"character"
To convert the class of a vector, you can use as. command.
> bar <- 0:5
> class(bar)
> "integer"
> as.numeric(bar)
> class(bar)
> "numeric"
List:
A list is a special type of vector that contains elements of different data types. For example:
> my_list <- list(22, "ab", TRUE, 1 + 2i)
> my_list
[[1]]
[1] 22
[[2]]
[1] "ab"
[[3]]
[1] TRUE
[[4]]
[1] 1+2i
The double bracket [[1]] shows the index of the first element and so on. Hence, you can easily extract the element of lists depending on their index. Like this:
> my_list[[3]]
> [1] TRUE
Matrices:
When a vector is introduced with dimension attributes (rows & columns), it becomes a matrix. Sets of rows and columns represent a matrix. It is a two-dimensional data structure. It consists of elements of the same class. Let’s create a matrix of 3 rows and two columns:
> my_matrix <- matrix(1:6, nrow=3, ncol=2)
> my_matrix
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
The dimensions of a matrix can be obtained using either dim() or attributes() command.
> dim(my_matrix)
[1] 3 2
> attributes(my_matrix)
$dim
[1] 3 2
To extract a particular element from a matrix:
> my_matrix[,2] --extracts second column
> my_matrix[,1] --extracts first column
Data Frame:
This is the most commonly used data type. It is used to store tabular data. In a matrix, every element must have the same class. But, in a data frame, you can put together a list of vectors containing different classes, which means, every column of a data frame acts like a list. Every time you will read data in R, it will be stored in the form of a data frame. Hence, it is important to understand the majorly used commands on the data frame:
> df <- data.frame(name = c("ash","jane","paul","mark"), score = c(67,56,87,91))
> df
name score
1 ash 67
2 jane 56
3 paul 87
4 mark 91
> str(df) --df is the name of the data frame.
'data.frame': 4 obs. of 2 variables:
Control
Structures in R
As the name suggests, a control structure controls the flow of the code or commands written inside a function. A function is a set of multiple commands written to automate a repetitive coding task. Some of the important Control Structures in R include- if/else, for, while, etc. Let us understand these in brief:
if / else: used to test a condition
if (<condition>){
--statements
} else {
--statements
}
Example-
#initialize a variable
N <- 10
#check if this variable * 5 is > 40
if (N * 5 > 40){
print("This is easy!")
} else {
print ("It's not easy!")
}
[1] "This is easy!"
for:
used when a loop is to be executed a fixed number of times. We commonly use ‘for loop’ for iterating over the elements of an object (list, vector).
for(<search condition>){
--statements
}
Example-
#initialize a vector
y <- c(99,45,34,65,76,23)
#print the first 4 numbers of this vector
for(i in 1:4){
print (y[i])
}
[1] 99
[1] 45
[1] 34
[1] 65
while:
It first tests a condition and executes only if the state is found to be true. When the loop gets executed for its first iteration, the condition is rechecked. Hence, it is necessary to set the condition such that the loop doesn’t run infinitely.
while(<condition>){
--statements
}
Example-
#initialize a condition
Age <- 12
#check if age is less than 17
while(Age < 17){
print(Age)
Age <- Age + 1 --this code breaks the loop after execution
}
[1] 12
[1] 13
[1] 14
[1] 15
[1] 16
There are a few other control structures as well but are less frequently used than the ones explained above, as follows:
- repeat – executes an infinite loop
- break – breaks the execution of a loop
- next – allows skipping an iteration in a circle
- return – helps exit a function
Useful
R Packages
R is supported by various packages to complement the work done by control structures. Some of the most basic and commonly used packages in predictive modelling are as follows:
- Importing Data: R provides a wide range of packages for importing data available in any format such as .txt, .csv, .json, .sql etc. In order to import large files of data quickly, it is advisable to install and use data.table, readr, RMySQL, sqldf, jsonlite.
- Data Visualization: R provides some in-built plotting commands. The built-in controls are suitable for generating simple graphical representations. But they get complicated when it comes to creating advanced graphics. Hence, you should install ggplot2.
- Data Manipulation: R has a vast collection of packages for data manipulation. These packages enable you to perform basic & advanced computations in a spree. They are dplyr, plyr, tidyr, lubridate, stringr.
- Modelling/ Machine Learning: For modelling, the caret package in R is powerful enough to cater to every need for creating a machine learning model. However, you can install packages based specifically on algorithmic requirements, such as randomForest, rpart, gbm, etc.
Exploratory Data Analysis in R-language
From this section onwards, we’re diving deep into understanding the primary stages of predictive modeling. Data Exploration is a crucial stage of predictive modeling. You can’t build efficient & proficient models unless you learn the skill of exploration of data entirely. This stage forms a concrete foundation for data manipulation (the very next step). Let’s understand it in R-language.
Before we start, you must get familiar with these terms:
- Response Variable (i.e., Dependent Variable): In a data set, the response variable (y) is the one on which we make predictions.
- Predictor Variable (i.e., Independent Variable): In a data set, predictor variables (Xi) are those using which we predict the response variable.
- Train Data: The predictive model is always built on the train data set. An intuitive way to identify the training data is that it still has the response variable included.
- Test Data: Once the model is built, its accuracy is verified on test data. This data always contains a lesser number of observations than the train data set. Also, it does not include the response variable.
Graphical
Representation of Variables
Graphical Representation of Variables
It is more comfortable & clearer to understand all the variables with the help of graphical/visual aid. Using graphs, we can analyze the data in two ways- Univariate Analysis and Bivariate Analysis.
Univariate analysis is done with one variable. It is a lot easier to implement. Bivariate analysis is done with two variables. Let’s now experiment by performing bivariate analysis & check out the results.
For visualization, I’ll use ggplot2 package. These graphs would help us understand the distribution and frequency of variables in the data set.
> ggplot(train, aes(x= Item_Visibility, y = Item_Outlet_Sales)) + geom_point(size = 2.5, color="navy") + xlab("Item Visibility") + ylab("Item Outlet Sales") + ggtitle("Item Visibility vs Item Outlet Sales")
- Train Data: The predictive model is always built on train data set. An intuitive way to identify the train data is that it always has the response variable included.
- Test Data: Once the model is built, its accuracy is verified on test data. This data always contains a lesser number of observations than the train data set. Also, it does not include the response variable.
Graphical Representation of Variables
It is easier & clearer to understand all the variables with the help of graphical/visual aid. Using graphs, we can analyze the data in two ways- Univariate Analysis and Bivariate Analysis.
Univariate analysis is done with one variable. It is a lot easier to implement. Bivariate analysis is done with two variables. Let’s now experiment by implementing bivariate analysis & check out the results.
For visualization, I’ll use ggplot2 package. These graphs would help us understand the distribution and frequency of variables in the data set.
> ggplot(train, aes(x= Item_Visibility, y = Item_Outlet_Sales)) + geom_point(size = 2.5, color="navy") + xlab("Item Visibility") + ylab("Item Outlet Sales") + ggtitle("Item Visibility vs Item Outlet Sales")
Here we can see that the majority of sales have been obtained from products having a visibility of less than 0.2. This suggests that item_visibility < 2 must be an essential factor in determining sales.
Let’s plot another interesting sample graph in order to strengthen our understanding of concepts.
> ggplot(train, aes(Outlet_Identifier, Item_Outlet_Sales)) + geom_bar(stat = "identity", color = "purple") +theme(axis.text.x = element_text(angle = 70, vjust = 0.5, color = "black")) + ggtitle("Outlets vs Total Sales") + theme_bw()
- Train Data: The predictive model is always built on train data set. An intuitive way to identify the training data is that it always has the response variable included.
- Test Data: Once the model is built, its accuracy is verified on test data. This data always contains a lesser number of observations than the train data set. Also, it does not include the response variable.
Graphical
Representation of Variables
Graphical Representation of Variables
It is easier & clearer to understand all the variables with the help of graphical/visual aid. Using graphs, we can analyze the data in two ways- Univariate Analysis and Bivariate Analysis.
Univariate analysis is done with one variable. It is a lot easier to implement. Bivariate analysis is done with two variables. Let’s now experiment by implementing bivariate analysis & check out the results.
For visualization, I’ll use ggplot2 package. These graphs would help us understand the distribution and frequency of variables in the data set.
> ggplot(train, aes(x= Item_Visibility, y = Item_Outlet_Sales)) + geom_point(size = 2.5, color="navy") + xlab("Item Visibility") + ylab("Item Outlet Sales") + ggtitle("Item Visibility vs Item Outlet Sales")
Here we can see that majority of sales has been obtained from products having visibility less than 0.2. This suggests that item_visibility < 2 must be an important factor in determining sales.
Let’s plot another interesting sample graph in order to strengthen our understanding of concepts.
> ggplot(train, aes(Outlet_Identifier, Item_Outlet_Sales)) + geom_bar(stat = "identity", color = "purple") +theme(axis.text.x = element_text(angle = 70, vjust = 0.5, color = "black")) + ggtitle("Outlets vs Total Sales") + theme_bw()
Here we can infer that OUT027 has contributed to the majority of sales followed by OUT35. OUT10 and OUT19 have probably the least footfall, thereby providing to the least outlet sales.
Now, we have an idea of the variables and their importance on the response variable.
Let us now combine the data sets which will save our time as we don’t need to write separate codes for train and test data sets. To combine the two data frames, we must make sure that they have equal columns-
> dim(train)
[1] 8523 12
> dim(test)
[1] 5681 11
> combi <- rbind(train, test)
Data
Manipulation in R
This section covers how to execute most frequently used data manipulation tasks with R. It includes various examples with datasets and code. It gives you a quick look at several functions used in R.
- Replacing / Recoding values: It means replacing existing value(s) with new value(s).
Create Dummy Data-
mydata = data.frame(State = ifelse(sign(rnorm(25))==-1,'Delhi','Goa'), Q1= sample(1:25))
In this example, we are replacing 1 with 6 in Q1 Variable-
mydata$Q1[mydata$Q1==1] <- 6
In this example, we are replacing “Delhi” with “Mumbai” in State variable. We need to convert the variable from factor to character-
mydata$State = as.character(mydata$State)
mydata$State[mydata$State=='Delhi'] <- 'Mumbai'
Another method- you have first to install the car package.
# Install the car package
install.packages("car")
# Load the car package
library("car")
Recoding to a new column:
# Create a new column called Ques1
mydata$Ques1<- recode(mydata$Q1, "1:4=0; 5:6=1")
Note: Make sure you have installed and loaded the “car” package before running the above syntax.
- Renaming variables: To rename variables, you have first to install the dplyr package.
install.packages("dplyr")
# Load the plyr package
library(dplyr)
# Rename Q1 variable to var1
mydata <- rename(mydata, var1 = Q1)
- Keeping and Dropping Variables
In this example, we keep only first two variables.
mydata1 <- mydata[1:2]
In this example, we keep first and third through sixth variables.
mydata1 <- mydata[c(1,3:6)]
In this example, we select variables using their names such as v1, v2, v3.
newdata <- mydata[c("v1", "v2", "v3")]
Deleting a particular column (Fifth column)
mydata [-5]
Dropping Q3 variable
mydata$Q3 <- NULL
Deleting multiple columns
mydata [-(3:4) ]
Dropping multiple variables by their names
df = subset(mydata, select = -c(x,z) )
- Subset data (Selecting Observations): it implies filtering rows (observations).
Create Sample Data
mydata = data.frame(Name = ifelse(sign(rnorm(25))==-1,'ABC','DEF'), age = sample(1:25))
Selecting first 10 observations
newdata <- mydata[1:10,]
Copy data into a new data frame called ‘newdata’
newdata<-subset(mydata, age==3)
Conditional Statement (AND) while selecting observations
newdata<-subset(mydata, Name=="ABC" & age==3)
Conditional Statement (OR) while selecting observations
newdata<-subset(mydata, Name=="ABC" | age==3)
- Sorting: it is one of the most common data manipulation tasks. It is generally used when we want to see the top 5 highest / lowest values of a variable.
Sorting a vector
x= sample(1:50)
x = sort(x, decreasing = TRUE)
Note: The function sort() is used for sorting a 1 dimensional vector. It cannot be used for more than 1 dimensional vector.
Sorting a data frame
mydata = data.frame(Gender = ifelse(sign(rnorm(25))==-1,'F','M'), SAT= sample(1:25))
Sort gender variable in ascending order
mydata.sorted <- mydata[order(mydata$Gender),]
- Dealing with missing data:
Number of missing values in a variable
colSums(is.na(mydata))
Number of missing values in a row
rowSums(is.na(mydata))
List rows of data that have missing values
mydata[!complete.cases(mydata),]
Creating a new dataset without missing data
mydata1 <- na.omit(mydata)
Convert a value to missing
mydata[mydata$Q1==999,"Q1"] <- NA
- Aggregate by groups:
The following code calculates mean for variable “x” by grouped variable “y”.
samples = data.frame(x =c(rep(1:10)), y=round((rnorm(10))))
mydata <- aggregate(x~y, samples, mean, na.rm = TRUE)
Frequency for a vector: to calculate frequency for State vector, you can use table() function.
> State <- c("DL","MU","NY","DL","NY","MU")
> table(State)
State
DL MU NY
2 2 2
- Merging (Matching): It merges the cases common to both datasets.
mydata <- merge(mydata1, mydata2, by=c("ID"))
- Removing Duplicates:
data = read.table(text="
X Y Z
6 5 0
6 5 0
6 1 5
8 5 3
1 NA 1
8 7 2
2 0 2", header=TRUE)
In the example below, we are removing duplicates in a whole data set.
mydata1 <- unique(data)
In the example below, we are removing duplicates by “Y” column.
mydata2 <- subset(data, !duplicated(data[,"Y"]))
- Combining Columns and Rows: If the columns of two matrices have the same number of rows, they can be combined into a larger matrix using cbind() function. In the example below, A and B are matrices.
newdata<- cbind(A, B)
Similarly, we can combine the rows of two matrices if they have the same number of columns with the rbind() function. In the example below, A and B are matrices.
newdata<- rbind(A, B)
- Combining Rows when a different set of columns: The function rbind() doesn’t work when the column names do not match in the two datasets. For example, dataframe1 has three columns A, B, C. Dataframe2 also has three columns A, D, E. The function rbind() when used here will throw an error. The function smartbind() from gtools would combine column A and return NA where column names do not match.
install.packages("gtools") --If not installed
library(gtools)
mydata <- smartbind(mydata1, mydata2)
So, we finally came to the end of the blog post. I hope I was able to impart you, some basic idea & fundamental knowledge about R. The experience can get you started with your detailed learning spree about data munging & modeling using the language. If you are to follow my advice, don’t jump towards building a complex model. Simple models give you fundamental learning, a benchmark score, and a threshold to work.
In this brief tutorial, I have demonstrated the steps used in data exploration, data visualization, data manipulation. Happy learning, folks!