What is R?
R is a programming language which provides an environment suitable for statistical analysis and graphical representation. R is one of those languages which are extensively used by data scientists or machine learning engineers while practicing data science. R is a dialect of another programming language namely S, which was developed by John Chambers at bell labs. S is also a language that deals with statistical modelling. According to surveys R is the second most used language in data science after python.
To the audience
This tutorial is designed for people who are seeking to learn a new language dealing with statistics and graphical representation. This course will take you right from your basic level to advanced version. It will be beneficial for data miners, statisticians, data scientists as well as software engineers. If you are familiar with any other programming language then this learning will be paced, however, you don’t need to worry if you are a newbie, as we will take a tour from basics to advanced levels. Now let’s get started.
R is an interpreted programming language, which also allows looping, conditioning, and nesting as well as using functions.
History
R was initially created in New Zealand in the year 1991 by Ross Ihaka and Robert Gentleman which was later published in the year 1993 to the public. In the year 1995, R was made a free software under the GNU General Public License. Then, in the year 1997 the R core group was formed which now controls the source code of R and modifies them from time to time.
Features of R
R has a number of features that makes it a popular language. Some of the features are :-
- Syntax of R is similar to S, which makes it easy for S-PLUS users to switch over.
- R is a simple and well-developed programming language which uses conditionals, loops and user defined input-output features.
- It runs on almost every standard computing platform.
- The development team of R is very active and has frequent releases.
- It handles data effectively and has an excellent storage facility.
- R also contains a number of data types such as strings, vectors, arrays, matrices, lists, etc.
- It provides excellent graphical features which are very sophisticated and is better than of the stat packages.
- R is quite a lean software and the functionality of R is divided into modular packages.
- The user community of R is very active on R-help and R-devel mailing lists, as well as on stack overflow.
How R is better than other languages
R is one of the most popular programming languages used for statistical analysis and data modeling. It is because of this reason why R is preferred for applications of machine learning. Here we are going to discuss some of the major features which are advantageous for R programming language.
- Open source language: R is an open source programming language which means that we can work with R for free and without any need of a license or fee to buy the services offered. Moreover, we can contribute towards the development of R through package optimization, developing new packages and libraries, or by resolving current issues.
- Platform independent: R is a platform independent programming language which means that a code written in R will run on all systems irrespective of their operating systems. This is very beneficial as programmers will only need to develop a software once and it can easily be run on any operating system.
- Machine learning based operations: R provides a lot of packages, libraries and features that is beneficial for developing machine learning applications and to perform various operations such as classification and regression tasks.
- Data wrangling: R supports data wrangling through various packages such as dplyr which is very beneficial while converting the data into a structural form.
- Plotting graphs: R provides libraries such as ggplot2 and plotly which are very beneficial for visualizing the data through various graphs and plots.
- Statistics: R is the most popular language when it comes to statistics and is also known as the language of statistics, this is the reason why R is chosen over other programming languages while developing statistical applications.
- Packages: r contains a lot of packages to support machine learning and other operations. The CRAN repository contains most of the packages beneficial for machine learning operations.
- Support and growth: R is a continuously growing programming language, which means that it is under development and new packages and libraries are still being updated in R language. A team of R development group is also available who are continuously working for the growth of R.
Applications of R Programming
R is one of the latest cutting-edge programming languages. A lot of data scientists and machine learning engineers are working using R which makes it quite a useful language for various machine learning and data science tasks. Various tech giants like Facebook, Google, Accenture, etc. use R for developing reliable tools. R can be used in various other domains of data science projects, some of which are being discussed here:
- Finance: R is considered as the most beneficial programming language for working in the domain of finance, the reason being that finance basically deals with the use of statistics and R is popularly known for working with statistics. R is very useful while dealing with tasks such as plotting density plots, drawdown plots and various other statistical plots. R also provides tools for time-series analysis, moving averages, autoregression, etc. These features present in R makes it very beneficial while dealing with financial tasks.
- Healthcare: R is very beneficial while dealing with various applications in healthcare. The fields related to genetics, epidemiology, drug research and various others are dependent on R for various research purposes and statistical analysis. The companies use the data analysis and processing capabilities of R to make further usage in these fields.
- Banking: R is widely used in banking sectors for dealing with fields related to finance sectors. Banking sectors use R in conjugation with Hadoop for the analysis of customer segmentation, checking the quality of customers for loans, customer sentiment analysis etc.
- E-commerce: E-commerce sectors finds a lot of applications of data science with the help of R, from the part to deal various parts of data for different products to customer reviews segmentation. R finds a lot of use of use in fields of e-commerce due to the various support of machine learning being used in this field.
- Social media: Social media like Facebook, twitter and Instagram uses R for various purposes of application of machine learning. The data visualizations tasks and segmentation of data is some of the most popular applications of R in social media. R also helps in the developement of various features which requires the use of machine learning such as for suggestions of videos, posts, and followers.
Hence, we can conclude that R is a very good programming language for statistical uses and also for graphical representations which makes it very beneficial for data scientists and machine learning engineers.
How to download and install R studio on Windows and Mac
R is the primary statistical programming language for performing modeling and graphical tasks. With its extensive support for performing matrix computation, R is used for a variety of tasks that involve complex datasets.
There are two software’s that we are going to install for R programming language are:
- R – R is a free software environment for statistical computing and graphics that you can use to clean, analyze, and graph your data.
- RStudio – It is a comprehensive environment for R scripting and has more features than RStudio.
Environment Setup
We are going to guide you for the installation of ide for R and R studio for devices supporting Mac and windows.
In order to Install R go to this website here, or simply type this link in your web browser: https://cran.r-project.org.
It will take you to this landing page from where you can download R for windows or mac.
Installing R on Mac
After clicking on the above link you will be taken to this page from where you can click Download R for mac Os. It will take you to the next page respectively,
Click on the link beside the red arrow and a file will start downloading automatically. As time passes, a newer version may be made available however you can download the latest version by just clicking on the first link. After downloading the file run it, accept all the terms and conditions, and click “Continue” in order to move forward to the installation.
Installing R on Windows
After clicking on the above link you will be taken to this page from where you can click Download R for mac Os. It will take you to the next page respectively, Click on the link beside the red arrow, which will open a new page. Here,
Click on the link beside the red arrow and a file will start downloading automatically. As time passes, a newer version may be made available however you can download the latest version by just clicking on the first link. After downloading the file run it and accept all the terms and conditions and click “Continue” in order to move forward to the installation.
Installing R studio
R studio is important for Mac users to install however, not mandatory for windows users. In order to download r studio click on this link here, or type this in your web browser: https://rstudio.com. This will take you to the following page:
Go to the products tab and hover on it, a drop-down menu like this will appear from which select RStudio as shown below:
Next, this page appears:
Then, scroll down to click download RStudio desktop, and download the free version:
On the next page which appears scroll down to the bottom of the page with the heading All installers and chick on the appropriate os version of your system:
This will start downloading the required file. To further install click on “Continue” or “Next” while accepting the terms and conditions and the installer will install the software onto your system.
Now, launch the R ide to start your coding platform. The console page will appear automatically on the R GUI where you can run your codes and experiment to learn it thoroughly. There is no need to install any additional ide as R is enough on its own. Now, type the command “getwd()” in order to check the location of your installed software as the additional files needed in the future works should be present there for the program to run efficiently. An overview of the landing page is shown below:
Components of RStudio
- Source – In the top left corner of the screen is the text editor that allows you to work within source scripting. You can enter multiple lines in this source. Furthermore, users can save the R scripts to files that are stored in local memory.
- Console – This is present on the bottom left corner of the main window of R Studio. It facilitates interactive scripting in R.
- Workspace and History – In the top right corner, you will find the R workspace and the history window. This will give you the list of all the variables that were created in the environment session. Furthermore, you can also view the list of past commands that were executed by R.
Files, Plots, Package, and Help at the bottom right corner gives access to the following tools:
- Files – A user can browse the various files and folders on a computer.
- Plots – We obtain the user plots here.
- Packages – Here, we can view the list of all the installed packages.
- Help – We can browse the built-in help system of R with this command.
Getting Started with R
Let’s get started with R programming. We will try to print “Hello World!” as our first program in R programming.
Type the following lines on the R command prompt:
hello <- "Hello World!"
print ( hello )
[1] "Hello World!"
In the first line of the code above, “hello” is a variable that has been assigned with the string “Hello World!”. We will discuss what a variable means in the upcoming sections, so let’s skip it for now. The “<-“ is known as an assignment operator which assigns the given string to the variable hello.
Note that in R the assignment operator is denoted by “<-“.
After assigning value to the variable we will now try to print the value been assigned. In order to get the output, we use the command print() in R to get the correct output. However, we don’t need any kind of indentation symbols in R like other programming languages.
So, we can see that the output to the code is printed in the third line itself. Here we see “[1]” before our output, and the reason behind this is that R treats every output as a vector, and every individual output is considered as a vector in R. We will see the further cases of R in the upcoming codes.
Comments
Comments are the texts in the code that are introduced with the sole purpose to clear the doubts and explain the coding part to the other programmers as well as helping in problems related to code readability. In R programming a comment is declared by a “#” symbol. Comments are removed by the compiler in the first step, so you don’t need to worry about the errors due to commenting. Hence anything that follows a “#” will a considered a comment. If a line contains a double ‘#’ symbol then the comment starts with the second symbol. Let’s practice commenting.
Another set of examples in R is here, try to run it for yourself and get yourself ready.
x <- 4 # Assigning x with value 4
print ( x ) # printing the value of x
[1] 4
> x # we can also print the value of x just by typing the name of the variable in R
> x <- ## This is an incomplete expression, a “+” sign appears which tells us to add value to it.
> "hello" # Here, we are assigning “hello” to the incomplete variable.
[1] "hello"
In the above code segment, we are trying to get the output just by typing the name of the variable i.e., x in this case, this type of printing the output is known as auto printing, and while using the print() function to get the output it’s called explicit printing.
Although the above comments are written within the code, they do not interfere with the actual code. The above line of comments which are shown above are examples of single-line comments. If you are using Rstudio then in order to comment out multiple lines use the shortcut “ ctrl + shift + c “ in order to comment or uncomment multiple lines. Now, let’s see an example that will teach us how to do multi-line commenting in the R prompt.
x <- 5
if ( x == 5 ) {
+ "This is a multiline comment,
+ to show the example of a comment.
+ This thing really looks good"
+ print ( x )
+
}
[1] 5
The above code segment is an example of how a multi-line comment works and this also shows how it ignores the comment and prints the results.
Let’s take a case in which we have to print a set of integers. We can use the “:” operator to create the integer sequences. For example, run the below code:
x <- 5 : 24
print ( x )
[1] 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Here, we are trying to print the integer sequence starting with “5” till “24”.
Variables and Constants in R
A variable is a generic term in any programming language which is used to store a value and this stored value can be changed or manipulated. A variable may store a vector, a list, a string, or other R objects. There is a set of rules while writing a variable name that must be followed in order to avoid errors. A valid variable name consists of letters, digits, periods(.), and underscore(_). A variable name always starts with a period or a letter but the period should never be followed immediately by a digit. It should also not start with a digit else it will be considered invalid. Reserved words of R should not be used as a variable name. The below table shows a list of examples having the valid and valid types of variable names.
Variable Name | Validity | Reason |
Hello | valid | Contains letters |
HelLo | valid | Contains Letters |
Hello_hi | valid | Contains letters and underscore |
Hello_1 | valid | Contains letters, underscore and number |
Hello_1. | valid | Contains letters, underscore, numbers, and period(.) |
Hello_hi% | invalid | Contains an invalid special character |
_hello | invalid | Starts with an underscore |
.hello | valid | Can start with a dot but shouldn’t be followed by a digit immediately |
.7hello | invalid | Can start with a dot but shouldn’t be followed by a digit immediately |
Hello.1_r | valid | Starts with letters and doesn’t have invalid characters. |
Reserved words are those words in a program that are fixed and stored for syntax purposes. Some of the reserved words include if, else, include, etc.
Constants in R, are the entities whose values can’t be changed and are fixed. There are two types of constants in R namely, numeric constants and character constants. Numeric constants are those constants that contain numbers, it is of type integer, double or complex. The numeric constants that are followed by L are considered as integers, normal numeric constants are of type double and those constants that are followed by i are of type complex. Numeric constants that are preceded by 0x and 0X are interpreted as hexadecimal numbers. The type of function that is used to check the data type of the variable is: typeof() function. Let’s see some code segments for this concept:
> var_r <- "Hello"
> typeof ( var_r )
[1] "character"
> var_r <- 34
> typeof ( var_r )
[1] "double"
> var_r <- 34L
> typeof ( var_r )
[1] "integer"
Finding the variables: In order to find the variables used in our workspace, we use the ls() function.
> print ( ls ( ) )
[1] "myString" "var_r" "x"
Deleting the variables in R: In order delete any variable in R we use the rm() function. However, if we use both the ls() and rm() functions together all the variables will be deleted.
> rm ( myString )
> print ( myString )
[1] Error in print ( myString ) : object 'myString' not found
> rm ( list = ls ( ) )
> ls ( )
[1] Character ( 0 )
In the first set of codes, we deleted the variable “myString” hence the variable is no more to be found. In the second set of codes, we have deleted the whole list of the variables that is why are getting “character ( 0 )” output on running the ls() command, which means there are no more variables.
Operators in R
Since, till now we have learned about variables in R, let us go further and learn what are operators and how are they helpful to us while programming. An operator is a built-in symbol that is used to perform specific calculations such as mathematical calculations or logical calculations. These operators are also used to assign values to a variable and also to compare two values. There are five types in which the operators are categorized into:
- Arithmetic Operators
- Relational Operators
- Logical Operators
- Assignment Operators
- Miscellaneous Operators
We will now look upon each and every operator and discuss them briefly.
Arithmetic Operators are those operators which are used for arithmetic calculations like sum difference, product, etc. The below table shows the description of the arithmetic operators.
OPERATOR | DEFINITION | EXAMPLE |
+ | Used to add two vectors | > v1 <- c ( 3 , 6 , 9 ) > v2 <- c ( 5 , 7 , 2 ) > # Adding both the vectors > sum <- v1 + v2 > print ( sum ) [1] 8 13 11 |
– | It is used to find the difference between the two vectors | > v1 <- c ( 3 , 6 , 9 ) > v2 <- c ( 5 , 7 , 2 ) > # Subtracting v2 from v1 > diff <- v1 – v2 > print ( diff ) [1] -2 -1 7 |
* | It is used to find the product of two vectors | > v1 <- c ( 3 , 6 , 9 ) > v2 <- c ( 5 , 7 , 2 ) > # Multiplying both the vectors > product <- v1 * v2 > print ( product ) [1] 15 42 18 |
/ | It is used the divide the first vector with the second vector | > v1 <- c ( 3 , 6 , 9 ) > v2 <- c ( 5 , 7 , 2 ) > # Dividing vector v1 from v2 > div <- v1 / v2 > print ( div ) [1] 0.6000000 0.8571429 4.5000000 |
%% | Gives the remainder when the first vector is divided by the second | > v1 <- c ( 3 , 6 , 9 ) > v2 <- c ( 5 , 7 , 2 ) > # Remainder of v1 when divided by v2 > r <- v1 %% v2 > print ( r ) [1] 3 6 1 |
%/% | It gives the quotient when the first vector is divided by the second vector | > v1 <- c ( 3 , 6 , 9 ) > v2 <- c ( 5 , 7 , 2 ) > # Quotient resulted when v1 is divided by v2 > q <- v1 %/% v2 > print ( q ) [1] 0 0 4 |
^ | It gives the result of the first vector raised to the exponent of the second vector. | > v1 <- c ( 3 , 6 , 9 ) > v2 <- c ( 5 , 7 , 2 ) > # Exponent resulted is > exp <- v1 ^ v2 > print ( exp ) [1] 243 279936 81 |
Relational Operators are those operators which are used to compare one vector to another or one value to another. The following table will give us an insight into the relational operators.
OPERATOR | DEFINITION | EXAMPLE |
> | This operator checks that every element of the first vector is greater than the second vector | > v1 <- c ( 3 , 6 , 9 ) > v2 <- c ( 5 , 7 , 2 ) > # Checks that elements of v1 is greater than v2 > print ( v1 > v2 ) [1] FALSE FALSE TRUE |
< | This operator checks that every element of the first vector is less than the second vector | > v1 <- c ( 3 , 6 , 9 ) > v2 <- c ( 5 , 7 , 2 ) > # Checks that elements of v1 is less than v2 > print ( v1 < v2 ) [1] TRUE TRUE FALSE |
== | This operator checks that every element of the first vector is equal to the second vector | > v1 <- c ( 3 , 6 , 9 ) > v2 <- c ( 5 , 7 , 2 ) > # Checks that elements of v1 is equal to that of v2 > print ( v1 == v2 ) [1] FALSE FALSE FALSE |
<= | This operator checks that every element of the first vector is less than or equal to the second vector | > v1 <- c ( 3 , 6 , 9 ) > v2 <- c ( 5 , 7 , 2 ) > # Checks that elements of v1 is less than or equal to that of v2 > print ( v1 <= v2 ) [1] TRUE TRUE FALSE |
>= | This operator checks that every element of the first vector is greater than or equal to the second vector | > v1 <- c ( 3 , 6 , 9 ) > v2 <- c ( 5 , 7 , 2 ) > # Checks that elements of v1 is greater than or equal to that of v2 > print ( v1 >= v2 ) [1] FALSE FALSE TRUE |
!= | This operator checks that every element of the first vector is unequal to the second vector | > v1 <- c ( 3 , 6 , 9 ) > v2 <- c ( 5 , 7 , 2 ) > # Checks that elements of v1 is unequal to that of v2 > print ( v1 != v2 ) [1] TRUE TRUE TRUE |
Logical operators are those symbols which introduce the concept of logic i.e., the concept of logical AND, OR, NOT respectively. While applying logical operators to the expressions it results in Boolean expression as TRUE (1 or greater than 1) or FALSE (0). It is applied only to the vectors containing numeric, logical or complex data types. The following table shows the classification based on the operator usage with example.
OPERATOR | DEFINITION | EXAMPLE |
& | It is the logical AND operator, which is applied as element wise in R. It combines every element of the first vector with that of the second vector and results in TRUE if both of the compared elements are TRUE. | > v1 <- c ( 24 , FALSE , TRUE , 5+3i ) > v2 <- c ( 9 , TRUE , TRUE , 3+6i ) > # Applying the ‘&’ operator > print ( v1 & v2 ) [1] TRUE FALSE TRUE TRUE |
| | It is the logical OR operator, which is applied as element wise in R. It combines every element of the first vector with that of the second vector and results in TRUE even if at least one of the compared elements are TRUE. | > v1 <- c ( 24 , FALSE , TRUE , 5+3i ) > v2 <- c ( 9 , TRUE , TRUE , 3+6i ) > # Applying the ‘|’ operator > print ( v1 | v2 ) [1] TRUE TRUE TRUE TRUE |
! | It is the logical NOT operator which takes an input of one vector and negates the result i.e., gives the opposite result or opposite logical value. | > v1 <- c ( 24 , FALSE , TRUE , 5+3i ) > # Applying the ‘!’ operator > print ( !v1 ) [1] FALSE TRUE FALSE FALSE |
&& | It is called the logical AND operator. It takes first element of both the vectors and results in TRUE if both the values in the two vectors are TRUE. | > v1 <- c ( 24 , FALSE , TRUE , 5+3i ) > v2 <- c ( 9 , TRUE , TRUE , 3+6i ) > # Applying the ‘&&’ operator > print ( v1 && v2 ) [1] TRUE |
|| | It is called the logical OR operator. It takes first element of both the vectors and results in TRUE one of the elements is TRUE. | > v1 <- c ( 24 , FALSE , TRUE , 5+3i ) > v2 <- c ( 9 , TRUE , TRUE , 3+6i ) > # Applying the ‘||’ operator > print ( v1 || v2 ) [1] TRUE |
Assignment operators are used to assign values to a vector, or a vector or another data type to a particular variable. In simple words it assigns values to every variable and data types. The following table explains assignment operators with example.
OPERATOR | DEFINITION | EXAMPLE |
<- or <<- or = | They are called the left assignment operators. | > v1 <- c ( -2 , 7 , TRUE , 3-4i ) > v2 = c ( -2 , 7 , TRUE , 3-4i ) > v3 <<- c ( -2 , 7 , TRUE , 3-4i ) > # printing the results > print ( v1 ) [1] -2+0i 7+0i 1+0i 3-4i > print ( v2 ) [1] -2+0i 7+0i 1+0i 3-4i > print ( v3 ) [1] -2+0i 7+0i 1+0i 3-4i |
-> or ->> | They are called the right assignment operators. | > c ( -2 , 7 , TRUE , 3-4i ) -> v1 > c ( -2 , 7 , TRUE , 3-4i ) ->> v2 > # printing the results > print ( v1 ) [1] -2+0i 7+0i 1+0i 3-4i > print ( v2 ) [1] -2+0i 7+0i 1+0i 3-4i |
Miscellaneous operators are the operators which are used for other general purposes and they do not deal with mathematical or logical operations. The following table shows the explanation and example of miscellaneous operators.
OPERATOR | DEFINITION | EXAMPLE |
: (colon operator) | It is used to create a series on number which is to be used by vectors or other data types. | > v <- c ( 1:8 ) > print ( v ) [1] 1 2 3 4 5 6 7 8 |
%in% | This operator identifies whether an element belongs to a vector or not. | > v1 <- 5 > v2 <- 10 > v3 <- 1:7 > print ( v1 %in% v3 ) [1] TRUE > print ( v2 %in% v3 ) [1] FALSE |
%*% | It is used to multiply a matrix with th transpose of the same matrix. | > mat <- matrix ( c ( -3 , 5 , 7 , -4 , 0 , 9 ) , nrow = 2 , ncol = 3 , byrow = TRUE ) > trans <- mat %*% t(mat) > print ( trans ) [,1] [,2] [1,] 83 75 [2,] 75 97 |
Data Types in R
In R programming, the data types are known as R objects. Based on the data type of a particular variable appropriate memory is reserved in its storage. R has five basic or “atomic” classes of R objects:
- Character : All the strings are stored in this r object.
- Numeric ( only real numbers ) : All the real numbers are stored in this r object. There are special numbers present in this data type like “inf” which is used to represent infinity, and “Nan” which is used to refer empty data items and is also known as not a number.
- Integer : It contains only the integer form of a number. For example :- 5L, 36L
- Complex : It contains a set of complex numbers containing i as the complex part of the number. For example :- 5+10i, 4+3i
- Logical : It results in true or false basically used to check a particular condition.
Data Type | Example | Code |
Character | ‘apple’, “boy”, “TRUE”, “86.5” | > var <- “apple” > print ( class ( var ) ) [1] “character” |
Numeric | 35.4, 999, | > var <- 35.4 > print ( class ( var ) ) [1] “numeric” |
Integer | 7L, 32L | > var <- 35L > print ( class ( var ) ) [1] “integer” |
Complex | 5+3i, 7+2i | > var <- 5+2i > print ( class ( var ) ) [1] “complex” |
Logical | TRUE, FALSE | > var <- TRUE > print ( class ( var ) ) [1] “logical” |
Attributes in R
R objects can contain attributes that are also helpful. They are accessed using the attribute() function. Some of the attributes present in the R objects are:
- Names, dimnames
- Dimensions : These includes arrays and matrices
- Class
- Length
- User defined attributes
Regardless of the basic data types that we saw earlier R offers six other frequently used data types which are listed below:
- Vectors
- Matrices
- Lists
- Factors
- Arrays
- Data Frames
Let’s study these data types in detail for a better understanding of the terms:
Vectors
In order to create a vector of multiple r objects, we use the c() function. The c() function combines the elements in a vector. Look at the code below and try to execute it.
#create a vector
car<-c ("red" , "blue" , "green" , “orange”)
print(car) #printing the vector
[1] "red" "blue" "green" "orange"
print ( class ( car ) ) #Finding the class of the vector
[1] "character"
In order to produce a vector of the given length and mode, we the vector() function.
x <- vector ( mode = "numeric" , length = 5 )
x
[1] 0 0 0 0 0
x <- vector ( mode = "logical" , length = 5 )
x
[1] FALSE FALSE FALSE FALSE FALSE
In a generic attempt to coerce different objects to be mixed in a vector, after which every element of the vector is of the same class the process is coercion. Examples are being followed for coercion as follows:
y <- c ( 54 , "amaze" ) #returns the character class
y <- c ( False , 0 ) #returns the numeric class
y <- c ( "base" , TRUE ) #returns the character class
R objects can be explicitly coerced from one class to another using the ”as.*” function. In place of ‘*’ in the as.* function, we will put the different r objects in which we desire to convert the initial r object.
x <- 0:5
class ( x )
[1] "integer"
as.logical(x)
[1] FALSE TRUE TRUE TRUE TRUE TRUE
as.character ( x )
[1] "0" "1" "2" "3" "4" "5"
as.numeric ( x )
[1] 0 1 2 3 4 5
The coercions that have no meanings at all often result in NA’s.
x<-c ( "x" , "y", "z" )
as.logical ( x )
[1] NA NA NA
as.numeric ( x )
[1] NA NA NA
Warning message:
NAs introduced by coercion
as.complex ( x )
[1] NA NA NA
Warning message:
NAs introduced by coercion
Accessing a vector: Execute the following codes.
#Accessing the vector elements using position
d <- c ("Sunday" , "Monday" , "Tuesday" , "Wednesday" , "Thursday" , "Friday" , "Saturday")
f <- d [ c ( 2 , 4 , 5 ) ]
print(f)
[1] "Monday" "Wednesday" "Thursday"
#accessing the vector elements using negative indexing
x <- d [ c ( -3 , -6 ) ]
print ( x )
[1] "Sunday" "Monday" "Wednesday" "Thursday" "Saturday"
# Accessing vector elements using 0/1 indexing.
y <- [ c ( 0 , 0 , 0 , 0 , 0 , 0 , 1 ) ]
print ( y )
[1] "Sunday"
Performing arithmetic operations on Vectors: We can perform arithmetic operations on vectors as well like addition, subtraction, multiplication, and division.
#creating the two vectors for arithmetic operations
y <- c ( 3 , 6 , 2 , 7, 5 , 9 )
z <- c ( 4 , 7 , 3 ,5 , 1 , 2 )
#Addition operation
sum <- y + z
sum
[1] 7 13 5 12 6 11
#Subtraction operation
diff <- y - z
diff
[1] -1 -1 -1 2 4 7
#Multiplication Operation
mul <- y * z
mul
[1] 12 42 6 35 5 18
#Divison Operation
div <- y / z
div
[1] 0.7500000 0.8571429 0.6666667 1.4000000 5.0000000 4.5000000
Vector Sorting: We can perform a sorting operation on the vectors using the sort() function.
#sorting the elements of a vector
#Initializing the vector
v <- c ( 4 , 8 , 2 , 0 , 5 , 6 , 9 )
#Sorting the given vector
result <- sort( v )
print ( result )
[1] 0 2 4 5 6 8 9
#sorting the elements of vector in reverse order
result <- sort( v , decreasing = TRUE )
print ( result )
[1] 9 8 6 5 4 2 0
#Initializing a character vector
v <- c ( 'red' , 'yellow' , 'brown' , 'black' , 'white', 'blue' )
#Sorting the character vector
result <- sort ( v )
print ( result )
[1] "black" "blue" "brown" "red" "white" "yellow"
#Sorting the character vector in reverse form
result <- sort ( v , decreasing = TRUE )
print ( result )
[1] "yellow" "white" "red" "brown" "blue" "black"
Matrices
A matrix is a vector with a two-dimensional data set, which can be created by taking two-dimensional vector input. We can create a matrix having values of the atomic data types i.e., r objects but of the same type, however, using character and logical objects as matrices does not give any meaning to the matrices so matrices of these types are used very less frequently or none at all. So, the r objects which are used very frequently are numeric, integer and complex. A matrix can be created using the matrix() function. The syntax for the matrix function in R is:
matrix ( data , nrow , ncol , byrow , dimnames )
The parameters used in the definition of the matrix are:
- data: It is the input for the data elements of the matrix.
- nrow: It is the number of rows of the matrix.
- ncol: It is the number of columns of the matrix.
- byrow: It is a logical expression which decides how a matrix is to be filled, i.e., through row wise or through column wise. If the byrow is TRUE then the matrix will be filled row wise, else it will be filled column wise.
- dimnames: It is the names assigned to the dimensions of a matrix, i.e., the names assigned to the rows and columns.
Now, let’s create a matrix with the above parameters:
# What if we don’t define the data of the matrix.
> matrix ( nrow = 2 , ncol = 2 )
[,1] [,2]
[1,] NA NA
[2,] NA NA
# The resulting matrix will be filled by NA which means the value is not available
# Row matrix
> matrix ( c ( 4 : 15 ) , nrow = 4 , byrow = TRUE)
[,1] [,2] [,3]
[1,] 4 5 6
[2,] 7 8 9
[3,] 10 11 12
[4,] 13 14 15
# Column Matrix
> matrix ( c ( 4 : 15 ) , nrow = 4 , byrow = FALSE)
[,1] [,2] [,3]
[1,] 4 8 12
[2,] 5 9 13
[3,] 6 10 14
[4,] 7 11 15
In the row matrix, while defining the matrix we are providing the data using the c() function and then we are defining the number of rows using nrow and arranging them according to the row matrix form. Similarly, we are arranging the column matrix in the same way as the row matrix. Now, let us define a matrix in which we will name the row names and column names beforehand.
# Define the row names and the column names
rn = c ( " A " , " B " , " C " , " D " )
cn = c ( " X " , " Y " , " Z " )
# Defining the matrix
matrix ( c ( 4 : 15 ) , nrow = 4 , byrow = FALSE , dimnames = list ( rn , cn ))
X Y Z
A 4 8 12
B 5 9 13
C 6 10 14
D 7 11 15
Since now we have learned how to define a matrix so let’s move further and try to access the elements of the matrix:
# Accessing the elements at 4th row and 2nd column
print ( m [ 4 , 2 ])
[1] 11
# Accessing the element at 3rd row and 3rd column
print ( m [ 3 , 3 ])
[1] 14
# Accessing the elements of 2nd row
print ( m [ 2 , ])
[1] X Y Z
5 9 13
# Accessing the elements of 3rd column
print ( m [ , 3 ])
[1] A B C D
12 13 14 15
Since now we know how to access an element of a matrix let’s move to the computations performed in a matrix. We can perform a number of basic operations on the matrices like the addition of matrices, subtraction of matrices, multiplication of matrices, and then a division of matrices. Note that, while performing addition and subtraction of matrices the dimensions of both the matrices should be the same in order to perform the operations. So, let’s continue with our operations.
# creating two matrices of the order 2 x 3 to perform addition and subtraction
matrix1 <- matrix ( c ( 5 , -3 , 8 , 0 , -3 , 6) , nrow = 2 )
print ( matrix1 )
[,1] [,2] [,3]
[1,] 5 8 -3
[2,] -3 0 6
matrix2 <- matrix ( c ( -3 , 7 , -2 , -4 , 8 , -7 ) , nrow = 2 )
print ( matrix2 )
[,1] [,2] [,3]
[1,] -3 -2 8
[2,] 7 -4 -7
#Adding the two matrices
sum <- matrix1 + matrix2
print ( sum )
[,1] [,2] [,3]
[1,] 2 6 5
[2,] 4 -4 -1
# Subtracting the matrix1 with the matrix2
diff <- matrix1 - matrix2
print ( diff )
[,1] [,2] [,3]
[1,] 8 10 -11
[2,] -10 4 13
Now, let’s do multiplication and division on the two defined matrices. Here, we don’t have a condition in which both the matrices need to be of same dimensions, but in case of matrix multiplication the number of columns in the first matrix should be equal to the number of rows in the second matrix. Hence, the resulting matrix will have the number of rows of the first matrix and the number of columns of the second matrix. Now, let’s apply multiplication and division on two different matrices.
# Create a matrix of dimension 2x3
matrix1 <- matrix ( c ( 5 , -3 , 8 , 0 , -3 , 6) , nrow = 2 )
print ( matrix1 )
[,1] [,2] [,3]
[1,] 5 8 -3
[2,] -3 0 6
#create another matrix of dimension 2*3
matrix2 <- matrix ( c ( -3 , 7 , -2 , -4 , 8 , -7 ) , nrow = 2 )
print ( matrix2 )
[,1] [,2] [,3]
[1,] -3 -2 8
[2,] 7 -4 -7
# Multiplying matrix1 with matrix2
multiplication <- matrix1 * matrix2
print ( multiplication )
[,1] [,2] [,3]
[1,] -15 -16 -24
[2,] -21 0 -42
# Dividing matrix1 with matrix2
division <- matrix1 / matrix2
print ( division )
[,1] [,2] [,3]
[1,] -1.6666667 -4 -0.3750000
[2,] -0.4285714 0 -0.8571429
Matrices can also be formed in another form using the functions cbind() and rbind(). The following example shows the use of rbind() and cbind().
# Declare x and y
x <- 5 : 7
y <- 12 : 14
cbind( x , y )
x y
[1,] 5 12
[2,] 6 13
[3,] 7 14
rbind ( x , y )
[,1] [,2] [,3]
x 5 6 7
y 12 13 14
Lists
Lists are widely used r objects in R programming, as it is one of the most important data types which make R programming so special. List can contain numbers, string, characters, vectors, matrix, functions and another list itself as an element of a list which makes it very useful in data science. A list can be created using the list() function. Let’s try to execute and create a list for ourselves.
# Creating a list containing numeric , string and vector values
l <- list ( 23.5 , 34L , "Apple" , c ( 32 , 7 , 21 ) )
print(l)
[[1]]
[1] 23.5
[[2]]
[1] 34
[[3]]
[1] "Apple"
[[4]]
[1] 32 7 21
Meanwhile, now we know how to create a list let’s now try to name a particular element of the list.
# Creating a list containing a matrix , list and vector
l <- list ( matrix ( c ( 3 , -4 , 7 , 5 , -9 , 0 ) , nrow = 2 ) , c ( "Hello" , “32.5” , "Bye" ) , list ( 76.2 , "Hey" ) )
# Giving names to the elements in the list
names ( l ) <- c ( " a_matrix " , " a_list " , " a-vector " )
# Printing the list
print(l)
$` a_matrix `
[,1] [,2] [,3]
[1,] 3 7 -9
[2,] -4 5 0
$` a_vector `
[1] "Hello" "32.5" "Bye"
$` a-list `
$` a-list `[[1]]
[1] 76.2
$` a-list `[[2]]
[1] "Hey"
Let’s continue moving on, and now we will try to access the elements of a list.
# Creating a list containing a matrix , list and vector
l <- list ( matrix ( c ( 0 , 8 , -4 , 6 , -1 , 7 ) , nrow = 2 ) , c ( 34.5 , 52 , 37 ) , list ( 76.2 , " red " , 34L ))
# Accessing the first element of the list
print ( l[1] )
[[1]]
[,1] [,2] [,3]
[1,] 0 -4 -1
[2,] 8 6 7
# Accessing the third element of the list which is a list itself
print ( l[3] )
[[1]]
[[1]][[1]]
[1] 76.2
[[1]][[2]]
[1] " red "
[[1]][[3]]
[1] 34
# We can also a list according to the names provided to elements of the list.
names ( l ) <- c ( " a_matrix " , " a_vector " , " a-list " )
# Let's try to access a element according to the names of the list
print ( l $' a_matrix ' )
[,1] [,2] [,3]
[1,] 0 -4 -1
[2,] 8 6 7
Now, let’s see what other things are out there that we can do with the lists. We can add a new element to the existing list, remove an element from the existing list, update a existing list, merge two lists and we can also convert a list into a vector. Let’s see these in the code provided below:
# Creating a list containing a matrix , list and vector
l <- list ( matrix ( c ( 4 , -5 , -6 , -7 , 9 , -1 ) , nrow = 2 ) , c ( 73 , 37 , 64.3 ) , list ( " Monday " , " Holiday " , 76L , 10+5i ))
# Adding new element at the end of the list
l[4] <- 32
print ( l[4] )
[1] 32
# Removing the 4th element from the list
l[4] <- NULL
print ( l[4] )
[1] NULL
# Updating the second element of the list
l[2] <- " This a string instead of vector "
print ( l[2] )
[1] " This a string instead of vector "
#let us now learn how to merge two lists
l1 <- list ( 23 , 45 , 56 )
l2 <- list ( " Jan " , " Feb " , " Mar " )
l3 <- c ( l1 , l2 )
print ( l3 )
# Now, let's try to convert a list into a vector
l1 <- list ( 6:15 )
print( l1 )
[1] 6 7 8 9 10 11 12 13 14 15
# Convert list l1 to vector v1
v1 <- unlist ( l1 )
print ( v1 )
[1] 6 7 8 9 10 11 12 13 14 15
# A list can be converted into a vector using the unlist() function.
Factors
Factors are those R objects that are used to represent categorical data and represent them in a statistical form. Factors are basically vectors of integers which are labelled. Factors can be ordered as well as unordered. Factors can be very helpful while dealing with data that are represented in a column and representing it by a factor looks nicer as compared to integers. Suppose, we have to represent a column as “male” and “female”, so it would be inappropriate to represent them by integers as 1 and 2 hence, factors will be helpful in these type of cases.
Factors can be created using the factor() function.
# Creating a vector in order to be converted into a factor
v <- c ( "Red" , "Yellow" , "Blue" , "Brown" , "Pink" , "Black" )
# Ensure to check that vector is not a factor, we can use the is.factor() function
print( is.factor(v) )
[1] FALSE
# As we know that v is a vector and not a factor, let's convert it into a factor using the factor() function
fact <- factor(v)
print ( fact )
[1] Red Yellow Blue Brown Pink Black
Levels: Black Blue Brown Pink Red Yellow
# The output of the factor fact shows us the elements of the factor and then lists down the levels of the factor.
# checking if the results of fact is a factor or not using the is.factor() function
print( is.factor(fact) )
[1] TRUE
Factors can also be create using data frames, but we will see that after learning about the data frames. We can change the order of the levels in a factor just by applying the factor() function again with the new order of levels.
# Creating a factor by initializing a vector
v <- c ( "Red" , "Yellow" , "Blue" , "Brown" , "Pink" , "Black" )
fact <- factor(v)
print ( fact )
[1] Red Yellow Blue Brown Pink Black
Levels: Black Blue Brown Pink Red Yellow
# Applying the factor function again
fact1 <- factor ( fact , levels = c ( " Yellow", "Blue" , "Pink" ) )
print ( fact1 )
[1] <NA> <NA> Blue <NA> Pink <NA>
Levels: Yellow Blue Pink
Factor levels can be generated using the gl() function. The syntax for using the gl() function is:
gl(n, k, length = n*k, labels = seq_len(n), ordered = FALSE)
The description of the parameters used:
- n: it takes an integer value which determines the number of levels.
- k: It takes an integer value which describes the number of replications.
- length: It provides the length of the final result.
- Labels: It is a vector of labels which is optional used for the resulting factor levels.
- Ordered: It is a logical object which indicates whether the result should be ordered or not.
An example showing the use of gl() function is as follows:
v <- gl ( 3 , 4 , labels = c ( "Red" , "Blue" , "Orange" ) )
print ( v )
[1] Red Red Red Red Blue Blue Blue Blue Orange Orange Orange Orange
Levels: Red Blue Orange
Arrays
Arrays are the R objects which are capable of storing data in more than two dimensions. For example, if we want to create an array of dimension (3,3,2) then it will result in creating two matrices of dimension 3×3 i.e., 3 rows and 3 columns.
An array can be created using the array() function, and we need to give the input in the form of vectors to create an array and dimensions(dim) in the argument. The following example will help us in creating an array of dimension (3, 3, 4).
# Create two vectors for the array input
v1 <- c ( 2 ,5 ,7 )
v2 <- c ( 3 , -4 , 7 , 8 , 2 , 5 )
# Taking the above two vectors as input for the array where the dimension is of order 3x3 and four matrices are created.
arr1 <- array ( c (v1 , v2 ) , dim = c ( 3 , 3 , 4 ) )
print ( arr1 )
, , 1
[,1] [,2] [,3]
[1,] 2 3 8
[2,] 5 -4 2
[3,] 7 7 5
, , 2
[,1] [,2] [,3]
[1,] 2 3 8
[2,] 5 -4 2
[3,] 7 7 5
, , 3
[,1] [,2] [,3]
[1,] 2 3 8
[2,] 5 -4 2
[3,] 7 7 5
, , 4
[,1] [,2] [,3]
[1,] 2 3 8
[2,] 5 -4 2
[3,] 7 7 5
After creating an array let us now try to change the names of our array using the dimnames parameter.
# Naming the columns, rows and matrices respectively
column.names <- c ( "column1" , "column2" , "column3" )
row.names <- c ( "row1" , "row2" , "row3" )
matrix.names <- c( "matrix1" , "matrix2" , "matrix3" , "matrix4" )
# Taking these inputs to the final array
result <- array ( c ( v1 , v2 ) , dim = c ( 3 , 3 , 4 ) , dimnames = list ( row.names , column.names , matrix.names ) )
print ( result )
, , matrix1
column1 column2 column3
row1 2 3 8
row2 5 -4 2
row3 7 7 5
, , matrix2
column1 column2 column3
row1 2 3 8
row2 5 -4 2
row3 7 7 5
, , matrix3
column1 column2 column3
row1 2 3 8
row2 5 -4 2
row3 7 7 5
, , matrix4
column1 column2 column3
row1 2 3 8
row2 5 -4 2
row3 7 7 5
Now, let us try to access the elements of an array which is really useful while performing array operations and calculations.
# printing the third column of the second matrix
print ( result[,3,2] )
row1 row2 row3
8 2 5
# Printing the third matrix
> print ( result[,,3] )
column1 column2 column3
row1 2 3 8
row2 5 -4 2
row3 7 7 5
# Printing the element in second row, third column of the fourth matrix
print ( result[2,3,4] )
[1] 2
We can perform calculations on array elements in a very easy way using the apply() function. The syntax for the apply() function is:
apply ( x, margin , fun )
The parameters used in the apply() function is defined as follows:
- x: this is the array to be used in the apply() function
- margin: this is where data is to be provided for further calculations
- fun: it is the function to be applied or the operation that has to be done on the array elements.
An example below shows the use of apply() function:
# Create two vectors for the creation of an array
v1 <- c ( 35 , 12 , 9 )
v2 <- c ( 4 , 9 , 0 , 8 , 32 , 18 )
arr <- array ( c ( v1 , v2 ) , dim = c ( 3 , 3 , 2 ) )
print ( arr )
, , 1
[,1] [,2] [,3]
[1,] 35 4 8
[2,] 12 9 32
[3,] 9 0 18
, , 2
[,1] [,2] [,3]
[1,] 35 4 8
[2,] 12 9 32
[3,] 9 0 18
# Using the apply() function to print the sum of the rows of across the two matrices
> sum.rows <- apply ( arr , c ( 1 ) , sum )
> print ( sum.rows )
[1] 94 106 54
Dataframes
A data frame as the name suggests is a table with a two dimensional structure in which the column contains values of each variable and each row contains the one set of values corresponding to each column just the same as a tabular data format. The characteristics or rules to be followed throughout while creating a data frame is listed as follows:
The names of the rows should not be same i.e., it should be unique.
The values and names in the columns should be non-empty.
Every column in the data frame should contain the same number of items.
The data type of the stored elements should be numeric, factor, logical, character, etc.
Let us now try to execute the below code and create a data frame:
# Creating a data frame
student.data <- data.frame ( Roll_no = c ( 1 : 6 ) , stu_name = c ( "Ajay" , "Michael" , "Dani" , "John" , "Jack" , "kamran" ) , Course = c ( "P325" , "R126" , "P325" , "R213" , "R213" , "P210" ) , start_date = as.Date ( c ( "2016-01-01" , "2016-07-23" , "2016-01-15" , "2016-08-11" , "2017-01-11" , "2017-07-25" ) ) , stringsAsFactors = FALSE )
print ( student.data )
Roll_no stu_name Course start_date
1 1 Ajay P325 2016-01-01
2 2 Michael R126 2016-07-23
3 3 Dani P325 2016-01-15
4 4 John R213 2016-08-11
5 5 Jack R213 2017-01-11
6 6 kamran P210 2017-07-25
Now, let us try to see to see the structure of the data frame using the str() function.
# Creating the data frame
> student.data <- data.frame ( Roll_no = c ( 1 : 6 ) , stu_name = c ( "Ajay" , "Michael" , "Dani" , "John" , "Jack" , "kamran" ) , Course = c ( "P325" , "R126" , "P325" , "R213" , "R213" , "P210" ) , start_date = as.Date ( c ( "2016-01-01" , "2016-07-23" , "2016-01-15" , "2016-08-11" , "2017-01-11" , "2017-07-25" ) ) , stringsAsFactors = FALSE )
# Printing the structure of data frame student.data
str(student.data)
# output
‘data.frame’: 6 obs. of 4 variables:
$ Roll_no : int 1 2 3 4 5 6
$ stu_name : chr “Ajay” “Michael” “Dani” “John” …
$ Course : chr “P325” “R126” “P325” “R213” …
$ start_date: Date, format: “2016-01-01” “2016-07-23” “2016-01-15” “2016-08-11” …
Let’s retrieve the statistical summary of the data frame, which can be completed using the summary() function.
# Printing the summary of the student.data data frame.
print ( summary (student.data ) )
Roll_no stu_name Course start_date
Min. :1.00 Length:6 Length:6 Min. :2016-01-01
1st Qu.:2.25 Class :character Class :character 1st Qu.:2016-03-02
Median :3.50 Mode :character Mode :character Median :2016-08-01
Mean :3.50 Mean :2016-08-19
3rd Qu.:4.75 3rd Qu.:2016-12-03
Max. :6.00 Max. :2017-07-25
Now, as we have learnt about the structure and summary functions, let us now move further to know how to extract data from the data frames.
# Extracting the student name and course code from the previously defined data frame
ext <- data.frame ( student.data$stu_name , student.data$Course )
print ( ext )
student.data.stu_name student.data.Course
1 Ajay P325
2 Michael R126
3 Dani P325
4 John R213
5 Jack R213
6 kamran P210
# Extracting data from rows 2,3 and 4
ext <- student.data[2:4,]
print ( ext )
Roll_no stu_name Course start_date
2 2 Michael R126 2016-07-23
3 3 Dani P325 2016-01-15
4 4 John R213 2016-08-11
# Adding a new column named "dept" to the existing data frame.
student.data$dept <- c( "IT" , "ECE" , "IT" , "CSE" , "ECE" , "CSE" )
print ( student.data )
Roll_no stu_name Course start_date dept
1 1 Ajay P325 2016-01-01 IT
2 2 Michael R126 2016-07-23 ECE
3 3 Dani P325 2016-01-15 IT
4 4 John R213 2016-08-11 CSE
5 5 Jack R213 2017-01-11 ECE
6 6 kamran P210 2017-07-25 CSE
In order to add new rows to the data frame we need to use the rbing() function to bind the new rows to the existing data frame. We will create a new data frame and then we will merge both of them accordingly.
# Creating first data frame
student.data <- data.frame ( Roll_no = c ( 1 : 6 ) , stu_name = c ( "Ajay" , "Michael" , "Dani" , "John" , "Jack" , "kamran" ) , Course = c ( "P325" , "R126" , "P325" , "R213" , "R213" , "P210" ) , start_date = as.Date ( c ( "2016-01-01" , "2016-07-23" , "2016-01-15" , "2016-08-11" , "2017-01-11" , "2017-07-25" ) ) , dept = c ( "IT" , "CSE" , "ECE" , "IT" , "CSE" , "ECE" ) , stringsAsFactors = FALSE )
print ( student.data )
Roll_no stu_name Course start_date dept
1 1 Ajay P325 2016-01-01 IT
2 2 Michael R126 2016-07-23 CSE
3 3 Dani P325 2016-01-15 ECE
4 4 John R213 2016-08-11 IT
5 5 Jack R213 2017-01-11 CSE
6 6 kamran P210 2017-07-25 ECE
# creating a second data frame
student.new_data <- data.frame ( Roll_no = c ( 7 : 9 ) , stu_name = c ( "Ravi" , "Harry" , "Shanin" ) , Course = c ( "P325" , "R126" , "P325" ) , start_date = as.Date ( c ( "2017-07-11" , "2017-07-13" , "2017-01-03" ) ) , dept = c ( "IT" , "CSE" , "ECE" ) , stringsAsFactors = FALSE )
print ( student.new_data )
Roll_no stu_name Course start_date dept
1 7 Ravi P325 2017-07-11 IT
2 8 Harry R126 2017-07-13 CSE
3 9 Shanin P325 2017-01-03 ECE
#Now, binding both the data types
student.finaldata <- rbind ( student.data , student.new_data)
print ( student.finaldata )
Roll_no stu_name Course start_date dept
1 1 Ajay P325 2016-01-01 IT
2 2 Michael R126 2016-07-23 CSE
3 3 Dani P325 2016-01-15 ECE
4 4 John R213 2016-08-11 IT
5 5 Jack R213 2017-01-11 CSE
6 6 kamran P210 2017-07-25 ECE
7 7 Ravi P325 2017-07-11 IT
8 8 Harry R126 2017-07-13 CSE
9 9 Shanin P325 2017-01-03 ECE
Now, we will learn about a advanced concept in data frames related to data mining i.e.,we are now going to see the use of merge() function which is used to merge the two data frames. In the example below, we have taken a dataset which displays the merget dataset of the body weight and brain usage of animals and mammals which is available in the library ‘MASS’. We will merge these datasets based on the values of the body and brain. In order to use your own data, download the dataset pdf provided by MASS and select the appropriate data sets. Let us now execute the code to see the merged data.
> library(MASS)
> me <- merge(x = Animals, y = mammals,
+ by.x = c ( "body", "brain" ),
+ by.y = c("body", "brain")
+ )
> print(me)
body brain
1 0.023 0.4
2 0.120 1.0
3 0.122 3.0
4 0.280 1.9
5 1.040 5.5
6 1.350 8.1
7 10.000 115.0
8 100.000 157.0
9 187.100 419.0
10 192.000 180.0
11 2.500 12.1
12 207.000 406.0
13 2547.000 4603.0
14 27.660 115.0
15 3.300 25.6
16 35.000 56.0
17 36.330 119.5
18 465.000 423.0
19 52.160 440.0
20 521.000 655.0
21 529.000 680.0
22 55.500 175.0
23 6.800 179.0
24 62.000 1320.0
25 6654.000 5712.0
> nrow(me)
[1] 25
While practicing data science we often need to reshape and refine our data set. So, we will hereby see a little sneak peek into melting and casting of data related to bacteria from the MASS library. The functions which are used to do this are called melt() and cast(). Follow the code shown below:
> library(MASS)
> print(bacteria)
y ap hilo week ID trt
1 y p hi 0 X01 placebo
2 y p hi 2 X01 placebo
3 y p hi 4 X01 placebo
4 y p hi 11 X01 placebo
5 y a hi 0 X02 drug+
6 y a hi 2 X02 drug+
7 n a hi 6 X02 drug+
8 y a hi 11 X02 drug+
9 y a lo 0 X03 drug
10 y a lo 2 X03 drug
11 y a lo 4 X03 drug
12 y a lo 6 X03 drug
13 y a lo 11 X03 drug
14 y p lo 0 X04 placebo
15 y p lo 2 X04 placebo
16 y p lo 4 X04 placebo
…. …. …. ….. …. …… ………….
…. …. …. ….. …. …… ………….
211 y a hi 0 Z24 drug+
212 y a hi 2 Z24 drug+
213 y a hi 4 Z24 drug+
214 n a hi 6 Z24 drug+
215 n a hi 11 Z24 drug+
216 y a hi 0 Z26 drug+
217 y a hi 2 Z26 drug+
218 y a hi 4 Z26 drug+
219 n a hi 6 Z26 drug+
220 y a hi 11 Z26 drug+
First of all, we will melt the data in order to organize it, except the columns “y” and “ap” into multiple rows. In order to use the melt() and cast() we need to download some important packages and the import them into our R environment i.e., in our library. The below code shows the step by step guide to melt the data set.
> install.packages ( "reshape" )
> install.packages ( "reshape2" )
> library ( MASS )
> library ( reshape )
## Attaching package: ‘reshape2’
## The following objects are masked from ‘package:reshape’: colsplit, melt, recast
> molten.data <- melt ( bacteria , value.name = "y" )
#Using y, ap, hilo, ID, trt as id variables
> print ( molten.data )
y ap hilo ID trt variable y
1 y p hi X01 placebo week 0
2 y p hi X01 placebo week 2
3 y p hi X01 placebo week 4
4 y p hi X01 placebo week 11
5 y a hi X02 drug+ week 0
6 y a hi X02 drug+ week 2
7 n a hi X02 drug+ week 6
8 y a hi X02 drug+ week 11
9 y a lo X03 drug week 0
10 y a lo X03 drug week 2
…. …. …. ….. …. …… ………….
…. …. …. ….. …. …… ………….
215 n a hi Z24 drug+ week 11
216 y a hi Z26 drug+ week 0
217 y a hi Z26 drug+ week 2
218 y a hi Z26 drug+ week 4
219 n a hi Z26 drug+ week 6
220 y a hi Z26 drug+ week 11
Now, as we have the molten data we will cast it using the cast() function:
> casted.data <- cast ( molten.data , ID~variable,sum)
> print ( casted.data )
y ap hilo ID trt variable y
1 y p hi X01 placebo week 0
2 y p hi X01 placebo week 2
3 y p hi X01 placebo week 4
4 y p hi X01 placebo week 11
5 y a hi X02 drug+ week 0
6 y a hi X02 drug+ week 2
7 n a hi X02 drug+ week 6
8 y a hi X02 drug+ week 11
9 y a lo X03 drug week 0
10 y a lo X03 drug week 2 ….. …… …… …… …… ….. …. ……. ……
R Control Structures
Control structures in R consists of two types: decision making and loops. Decision making panel consists of if, if-else, nested if-else and switch case. The loops are of several types such as for, while, repeat, break, next and return. We will start off with the decision making conditions and move till loops.
Decision making panels are constructed in a way such that it checks one or more conditions that are to be evaluated and based on the results being TRUE and FALSE, certain statements related to it are executed.
IF statement
It is a decision based conditional statement in which the condition in the if statement is checked and based on the result being TRUE, the statement is executed otherwise the code breaks from the statement. Example of IF statement is:
> if ( 3==3 )
+ {
+ print ( "hello" )
+ }
[1] "hello"
IF – ELSE statement
It is the same as IF statement, the difference being that the condition when returned false in the IF statement, ELSE statement gets executed. The syntax of the if-else and nested if-else block is given below:
if(<condition>) {
# condition executed
} else {
# other condition gets executed
}
IF – ELSE IF statement
if(<condition1>) {
# condition executed
} else if(<condition2>) {
# other condition executed
} else {
# next condition is executed
}
The flow chart showing the working of if-else condition is given below.
Switch case
It is used to take an input from the user and tests for the required result against a given number of values. The syntax for the Switch case is shown as below:
switch ( expression , case1 , case2 , case3 …. )
Looping statements
Loops are a very useful tool considering the situation when we have to execute a particular piece of code or a statement for a larger number of times which is impossible to do manually. If we have to print series of number from a starting point to the ending point, loops helps us a lot by just executing the statements a given number of times. There are three kinds of loop structures in R, which we are going to discuss one by one.
For loop: This loop repeats the statement or a group of multiple statement based on the condition of the loop being true. It tests the condition of the loop at the end of the loop body. Syntax for “for” loop is:
for ( value in sequence )
{
statement
}
We can perform nesting in for loops as well. The below syntax shows the nesting in for loops.
for ( val1 in sequence )
{
for ( val2 in sequence )
{
Statement
}
}
While loop: A while loop is the same as for loop except the fact that the while loop first tests the condition before executing the body. Syntax for a while loop is:
while (test_expression)
{
statement
}
Example of while loop:
i <- 5
while (i < 15) {
print(i)
i = i+2
}
OUTPUT:
[1] 5 [1] 7 [1] 9 [1] 11 [1] 13Repeat loop: This loop is useful in case if we want to iterate over a block of code for a number of times. This loop can’t exit on its own once it has started, however we can use break statement in order to break the loop and exit. The syntax for the repeat loop is:
repeat {
statement
}
An example showing how to use repeat loop is:
x <- 5
repeat {
print(x)
x = x+2
if (x == 13){
break}}
OUTPUT:
[1] 5 [1] 7 [1] 9 [1] 11 [1] 13Now, as we have studied about the loops let us now study further about the loop statements which include break, next and return statements. Let’s learn briefly about these statements.
- Break: The break statement is used to terminate the loop forcefully from within the code and transfers the execution to the immediate step following the loop.
- Next: The next statement is used when we have to skip a number of iterations in a particular loop.
- Return: The return statement commands that the function should exit and return the value calculated.
All these loops are very beneficial while programming in R, but what if we want to apply the loops on the R command line. These set of loops can’t be executed on the command loop. R has a different set of functions which makes life easier by executing loops even on the command line. These functions are as defined below:
- lapply: The lapply() function applies loop over a list and evaluates a list on each of the list elements. It will always return in list and does not change its data type even if the input is provided in another data type. The below example shows the use of lapply().
> x <- list(a = 1:9, b = rnorm(10))
> lapply(x, mean)
$a
[1] 5
$b
[1] -0.1628759
- sapply: It is the same as lapply() function but it tries to simplify the result and then print it. The code below shows us the difference between the lapply() and the sapply() functions.
> a <- list ( b = 1:4 , c = rnorm(10) , d = rnorm ( 20 , 1 ) )
> lapply ( a , mean )
$b
[1] 2.5
$c
[1] -0.437244
$d
[1] 0.9667708
> sapply ( a , mean )
b c d
2.5000000 -0.4372440 0.9667708
> mean ( a )
[1] NA
Warning message:
In mean.default(a) : argument is not numeric or logical: returning NA
- apply: The apply() function is applied over the margins of a defined array.
- tapply: The tapply() function is applied over the subsets of a vector.
- mapply: The mapply() function is a multivariate form of the lapply() function.
An auxiliary function split() is sometimes used in conjuction with the laplly() function which is also very useful.
R Functions
A function is a set of commands or a group of statements which is created to provide the solution of a particular task which in result will be used while writing a program, and then we can just call the function wherever required. Using functions solves the complexity of a program by reducing the amount of code to be written and as a result instead of writing the same block of code, every time we can just call the function whenever required. There are two types of functions which are defined in R i.e., built-in functions and user-defined functions. A R function can be created using the keyword function(). The syntax for the function is as follows:
function_name <- function ( arg_1 , arg_2 , … ) {
Function body
}
The components of a R function include:
- Function name: It is the of the function being provided. A function is stored as a R object by this name in the R environment. The rules for naming a function is the same as that of naming a variable.
- Function Arguments: Function arguments are named variables used locally for a particular function. They are the input being provided to the function which in later case calls the external input and performs the calculations through these variables. Arguments are optional in a function, and we can either pass one arguments or even multiple arguments in a function.
- Function Body: The function body is the part of a function which defines the function and gives an idea of what a function does.
- Return value: this part of the function is responsible for returning the final value of the function.
R Functions are “first class objects” which means that they are the same just as other R objects. Functions can also be passed as arguments to other functions and can also be nested. Now, let’s discuss about the built-in functions and the user defined functions.
Built-in functions are those functions which has already been defined in the R library. The programmer can directly call them in their programs. Some examples of built-in functions are seq(), min(), max(), avg(), etc. Some working examples of built-in functions are shown below:
# Find the mean of numbers between 5 and 18
print ( mean ( 5 : 18 ) )
[1] 11.5
# Find the sum of first 15 natural numbers
print ( sum ( 1 : 15 ) )
[1] 120
# Print the sequence of even numbers between 2 and 24
print ( seq ( 2 , 24 , 2 ) )
[1] 2 4 6 8 10 12 14 16 18 20 22 24
User defined functions are those functions which are defined and manipulated by the programmer. They are user specific and once created they also behave and work like the built-in functions. In order to create and run a function kindly go to file section of your R GUI and then select new script and then perform your code in that script section. Given below is an example of how a function can be created and used.
# Calling a function with an argument which prints the first six multiples of 2.
> fun <- function(x) {
+ for ( i in 1:x ) {
+ b <- I * 2
+ print ( b )}}
fun(6)
[1] 4
[1] 6
[1] 8
[1] 10
[1] 12
# Calling a function without an argument
fun <- function() {
+ for ( i in 1:6 ) {
+ b <- I * 2
+ print ( b )
}}
fun()
[1] 2
[1] 4
[1] 6
[1] 8
[1] 10
[1] 12
The reason behind this function running is that we are directly putting the value instead of passing the value through the argument. Now, let’s call a function whose arguments are pre-defined, and in this case we can either choose to pass the function without any argument or with arguments.
# Creating the function with pre-defined arguments
> fun <- function ( a=2 , b=5) {
+ sum <- a + b
+ print ( sum )
+ }
> # Call the function fun having no arguments
> fun()
[1] 7
> # Call the function by passing the arguments as 5 and 9
> fun( 5 , 9 )
[1] 14
In R programming the arguments to the functions are evaluated lazily, hence evaluated only when need be. This property of R functions is known as Lazy evaluation of Functions. The below example shows the lazy evaluation property in a better way.
> fun <- function ( a , b ) {
+ a^2
+ }
> fun ( 3 )
[1] 9
Here, the argument b does not create any problem as it is not called by the function calling, thus not creating a problem in the above code.
Let’s suppose a case where you want to extend your arguments to the function in the later period of time. In this case, we can use the ‘…’ argument which will in later point of time will let you to add other arguments to your functions with the help of R methods. Methods are similar to functions, the only difference being that parameters are passed in methods whereas functions take arguments. The following code block shows an example of the ‘…’ argument:
newplot <- function ( a , b , type = “l” , … ) {
plot (a , b , type = type , … )
}
Dates and Time
R has a set of specific functions used to define the dates and time wherever required. Dates in R are defined by Date class. Dates are represented using the as.date() function.
> x <- as.Date("2020-04-11")
> print ( x )
[1] "2020-04-11"
Time is represented in R using the POSIXct and POSIXlt class. POSIXct contains the time in the form of a very large integer, and is used when time has to be stored in something like a data frame. POSIXlt contains the time in the form of a list underneath and also stores other useful information such as the day of week, day of year, month, day of month, etc. There are other functions such as weekdays which are used to give the day of the week, months which gives the name of the month and quarters which gives the quarter i.e., resulting in Q1, Q2, Q3 and Q4. An example showing the use of POSIXlt is given below.
> a <- Sys.time()
> a
[1] "2021-06-14 12:05:43 IST"
> b <- as.POSIXlt( a )
> names(unclass(b))
[1] "sec" "min" "hour" "mday" "mon" "year" "wday" "yday"
[9] "isdst" "zone" "gmtoff"
> b$min
[1] 5
Suppose that your date is written in a different format and you want to change it to your desired form then you can use the strptime() function to get the desired format.’
> date <- c ( "January 08, 2021 12:11", "March 16, 2021 13:15")
> x <- strptime(date, "%B %d, %Y %H:%M")
> print ( x )
[1] "2021-01-08 12:11:00 IST" "2021-03-16 13:15:00 IST"
> class ( x )
[1] "POSIXlt" "POSIXt"
R Packages
R has a collection of functions, data, documentations, and shareable codes contained in a bundle known as R packages. R packages helps us while writing codes related to statistical data. These packages are stored under the directory named ‘library’ in the R environment. R installs some of the basic packages by default while installing it, however, we need to manually install some of the packages which are required for some specific purposes. These packages are what makes the R programming a strong and powerful language. We can get the details of all the R packages by clicking on this link. Now, let’s check where and what packages are preinstalled on out systems.
The location or Path where our packages are stored in our system can be known by executing the following command:
> .libPaths()
[1] “C:/Program Files/R/R-4.1.0/library”The Path for the packages library can be different for different users depending upon their system and where they have installed their R GUI. Now, in order to know what packages are preinstalled in our system run the following command:
> library()
This results in providing the following packages:
Packages in library ‘C:/Program Files/R/R-4.1.0/library’:
base The R Base Package
boot Bootstrap Functions (Originally by Angelo Canty for S)
class Functions for Classification
cluster “Finding Groups in Data”: Cluster Analysis Extended Rousseeuw et al.
codetools Code Analysis Tools for R
compiler The R Compiler Package
datasets The R Datasets Package
foreign Read Data Stored by ‘Minitab’, ‘S’, ‘SAS’, ‘SPSS’, ‘Stata’, ‘Systat’, ‘Weka’, ‘dBase’
graphics The R Graphics Package
grDevices The R Graphics Devices and Support for Colours and Fonts
grid The Grid Graphics Package
KernSmooth Functions for Kernel Smoothing Supporting Wand & Jones (1995)
lattice Trellis Graphics for R
MASS Support Functions and Datasets for Venables and Ripley’s MASS
Matrix Sparse and Dense Matrix Classes and Methods
methods Formal Methods and Classes
mgcv Mixed GAM Computation Vehicle with Automatic Smoothness Estimation
nlme Linear and Nonlinear Mixed Effects Models
nnet Feed-Forward Neural Networks and Multinomial Log-Linear Models
parallel Support for Parallel computation in R
rpart Recursive Partitioning and Regression Trees
spatial Functions for Kriging and Point Pattern Analysis
splines Regression Spline Functions and Classes
stats The R Stats Package
stats4 Statistical Functions using S4 Classes
survival Survival Analysis
tcltk Tcl/Tk Interface
tools Tools for Package Development
translations The R Translations Package
utils The R Utils Package
In order to know that what packages you are running currently in your R system, run the command given below:
> search()
[1] “.GlobalEnv” “package:stats” “package:graphics” [4] “package:grDevices” “package:utils” “package:datasets” [7] “package:methods” “Autoloads” “package:base”As till now we have seen how to know what packages are on our system and what are we specifically running, let us now see how can we install a new package as this is going to be an important task in the future once we into the field of data training and we need to train our models. A package can be installed into the R environment by two ways: first is that we can directly install it from the CRAN directory and the second is that we install them manually by downloading them initially.
In order to install a package directly into the system from the CRAN server, we need to run the following command as given below. The syntax for the command is:
> install.packages(“Package Name”)
Here, “Package name” is the name of the package that you want to install on your R environment, and this list can be available to you through the website above. Let’s try to install a package named “abc” in our environment. This package is a tool for Approximate Bayesian Computation (ABC).
> install.packages(“abc”)
On running this command a prompt will appear on your screen which will ask for permission to make the environment writeable. Select “Yes” and then one more prompt will appear which will ask for the nearest location of yours in order to get the package installed from the nearest CRAN mirror. Once the location is provided to it the package will be installed automatically.
Now, let us try to install the R packages manually. To do this first of all we need to download the R package from the CRAN website given above and then save it in the form of a zip file. After this run the following command in order to install a package in your environment. The syntax of the code is:
install.packages ( “Zip file name with path” , repos = NULL, type = “source”)
# Run the below command to install the package “abc” to your environment
install.packages( “D:/abc_2.1.zip” , repos = NULL , type = “source”)
After we are done with installing of packages we now need to load the package to our library in order to use the package. To do so, the following syntax must be followed:
library ( “Name of package” , lib.loc = “path of directory” )
Reading Data in R
R is a programming language which deals best in statistical calculations and data mining. So, in order to perform all the operations related to data we will first need to learn how to read data in R. We can even write data in R and we will practice both reading and writing data in R in the upcoming part. R can read and write data in table, csv, xml, codes, etc.
R – CSV file
Most of the datasets present are arranged in the csv, so it becomes important for a data scientist to be able to read and write data to and from a csv dataset. The csv file is a text filein which the values are stored in the form such that it is separated by a comma. Let’s create a csv file into our notepad by the name student.csv.
Roll , name , section , course , start_date
1 , Rohit , 17 , CSE , 2017-01-09
2 , Raushan , 16 , IT , 2016-07-13
3 , Anjali , 18 , CSE , 2018-07-13
4 , Ashwin , 17 , IT , 2017-01-09
5 , Ajay , 18 , IT , 2018-01-09
6 , Gouri , 17 , CSE , 2017-01-09
7 , Shanon , 16 , IT , 2016-07-13
8 , Manpreet , 17 , CSE , 2017-01-09
In order to be able to read out saved csv file we need to save it into the same directory as of in which R is installed. We will use the read.csv() function in order to read the csv file.
> data <- read.csv("student.csv")
> print ( data )
roll name section course start_date
1 1 Rohit 17 CSE 2017-01-09
2 2 Raushan 16 IT 2016-07-13
3 3 Anjali 18 CSE 2018-07-13
4 4 Ashwin 17 IT 2017-01-09
5 5 Ajay 18 IT 2018-01-09
6 6 Gouri 17 CSE 2017-01-09
7 7 Shanon 16 IT 2016-07-13
8 8 Manpreet 17 CSE 2017-01-09
# Checks if the csv file is a data frame or not
print ( is.data.frame ( data ) )
[1] TRUE
# Prints the number of columns
print ( ncol ( data ) )
[1] 5
# Prints the number of rows
print ( nrow ( data ) )
[1] 8
Now, we can perform operations on this similar to the one’s given above. Let’s try to print the details of the student with “CSE” branch from the student.csv file using the subset() function.
> subset( data, course == “CSE”)
roll name section course start_date
1 1 Rohit 17 CSE 2017-01-09
3 3 Anjali 18 CSE 2018-07-13
6 6 Gouri 17 CSE 2017-01-09
8 8 Manpreet 17 CSE 2017-01-09
Similarly, we can apply other calculations as well but let us now move further in order to learn how to write in a csv file. We can write or create a new csv file using the write.csv() function. The following example shows how to use the write.csv() function.
> data <- read.csv("student.csv")
> write.csv(data, "mydata.csv")
> read.csv("mydata.csv")
X roll name section course start_date
1 1 1 Rohit 17 CSE 2017-01-09
2 2 2 Raushan 16 IT 2016-07-13
3 3 3 Anjali 18 CSE 2018-07-13
4 4 4 Ashwin 17 IT 2017-01-09
5 5 5 Ajay 18 IT 2018-01-09
6 6 6 Gouri 17 CSE 2017-01-09
7 7 7 Shanon 16 IT 2016-07-13
8 8 8 Manpreet 17 CSE 2017-01-09
In the above output, we see that there is an additional row X being added automatically, so we need to remove this row in order to make our code look much better. We can set the row.names to FALSE in order to remove the row labels in the csv file.
> write.csv(data, "mydata.csv", row.names = FALSE)
> read.csv("mydata.csv")
roll name section course start_date
1 1 Rohit 17 CSE 2017-01-09
2 2 Raushan 16 IT 2016-07-13
3 3 Anjali 18 CSE 2018-07-13
4 4 Ashwin 17 IT 2017-01-09
5 5 Ajay 18 IT 2018-01-09
6 6 Gouri 17 CSE 2017-01-09
7 7 Shanon 16 IT 2016-07-13
8 8 Manpreet 17 CSE 2017-01-09
Now we don’t have any unwanted row labels in our csv file. Similarly we can remove quotations or double quotes present in out csv file by setting quote = FALSE. Similarly,we can set the append function, col.names and row.names functions as well as sep functions while writing our data into write.csv().
R – Excel file
Now we will see how to read data from the excel sheets as Microsoft’s excel is one of the most widely used software which is used to store data in a tabular form. It is used almost everywhere in order to save data of the employees, students, work, etc. It stores the data in .xls or .xlsx form. There are some excel specific packages from which r can read directly, these packages include – readxl, gdata, XLConnect, xlsx, etc. We are going to work with the “readxl” package in the below example however there are other packages as well and in order to work with any other package you can simply install that package, load the package and voila! You are ready to go. So first we will download our package.
> install.packages ( “readxl” )
It may take a while installing this package as it will also install other useful packages with it automatically including cli, utf8, rematch, ellipsis, etc. Let’s try to verify whether our system has installed the correct package or not.
> any ( grepl ( "readxl" , installed.packages () ) )
[1] TRUE
# Load the library “readxl”
> library ( "readxl" )
Now, create a .xls file in your Microsoft excel sheet. You can use the previous data and copy it in excel sheet, if you want to speed up a bit or you can manually write the data which will definitely help you to understand your data. After saving your .xls file run the below given statement to read the data from your .xls file.
> read_excel(“C:\\Users\\…\\student.xlsx”)
roll name section course start_date
1 1 Rohit 17 CSE 2017-01-09
2 2 Raushan 16 IT 2016-07-13
3 3 Anjali 18 CSE 2018-07-13
4 4 Ashwin 17 IT 2017-01-09
5 5 Ajay 18 IT 2018-01-09
6 6 Gouri 17 CSE 2017-01-09
7 7 Shanon 16 IT 2016-07-13
8 8 Manpreet 17 CSE 2017-01-09
R – binary files
As a computer science learner we are all familiar with the terms such as bits, bytes, binary, ternary, etc. For those who don’t know binary representations are used by computer systems as a language understandable only by computers in the form of 0 and 1. These are not the characters that can be read and understood by humans as the characters and symbols in it are contained in the form of bytes and other non-printable characters like Ø and ð, which is really very unpredictable. In order to read a binary file, we need a specific program for the appropriate format. For example, in order to read a word document, we need to open it in an application which supports word format.
R has binary compatible functions which allows R to read and write data to and from a binary file. In order to write data we use the writeBin() function and readBin() function reads the data from the binary file. The syntax for these functions are provided as below:
writeBin ( object , con )
readBin ( con , what , n )
The description of the parameters used in the syntax are:
- Object: It is the binary file which has to be created.
- Con: It is the connection which enables reading and writing on a binary file.
- What: In this place we need to provide the data type of the normal file to be readlike character, integer, etc.
- N: it represents the number of bytes that needs to be written from the specified binary file.
Let us now take an example in which we will try to write and read data in a binary file. First we will import the MASS library and then we will convert the data in the csv format using the writeBin() function.
# Read the dataset from the library MASS
> library ( MASS )
# Print the dataset motors
> print ( motors )
temp time cens
1 150 8064 0
2 150 8064 0
3 150 8064 0
4 150 8064 0
5 150 8064 0
6 150 8064 0
…. … … … … … ….
# Read the data frame “Motors” as a csv and separated by comma
> write.table ( motors , file = "motors.csv" , row.names = FALSE , na = "" , col.names = TRUE , sep = "," )
# Store the 10 records in the csv file as motor data frame
> motor <- read.table ( "motors.csv" , sep = "," , header = TRUE , nrows = 10 )
# Creating the connection object and setting the mode to “wb” to write in the data frame
write.filename = file ( "C:/…/R/win-library/4.1/binmotors.dat" , "wb" )
# Modifying the column names to the connection object
> writeBin ( colnames ( motor ) , write.filename )
> writeBin ( c ( motor$sr_no , motor$temperature , motor$time_in_sec , motor$d ) , write.filename )
# Now, close the file in order to start the read operation
> close ( write.filename )
The binary file which has been created through the binary operations in the above process contains the data frame in the column – wise manner. Now, we will read the data using the appropriate functions and respective columns.
# Creating the connection object in the readBin() function using mode “rb”
> read.file <- file ( "C:/…/ R/win-library/4.1/binmotors.dat" , "rb" )
# Now, we will read the column names. Here we have n=4 as we have to read 4 columns.
> col.name <- readBin (read.file , character() , n = 4)
# Now, we will read the column values. Here n = 44 as we have 4 columns and 40 values
> read.file <- file ("C:/…/binmotors.dat" , "rb" )
> binary.data <- readBin ( read.file , integer() , n = 44 )
> print ( binary.data )
# Combining all the data into dat frame we have.
> data.obtained= cbind ( binary.data[5:14] , binary.data[15:24] , binary.data[25:34] , binary.data[35:44])
# Getting the column names.
> colnames ( finaldata) = column.names
# Printing the whole data finally
> print ( finaldata )
Sr_no temperature time_in_sec d
[1,] 1 150 8064 0 [2,] 2 150 8064 0 [3,] 3 150 8064 0 [4,] 4 150 8064 0 [5,] 5 150 8064 0 [6,] 6 150 8064 0 [7,] 7 150 8064 0 [8,] 8 150 8064 0 [9,] 9 150 8064 0 [10,] 10 150 8064 0Thus, this is the original data which had been previously converted to binary file.
Reading XML files in R
XML stands for Extensible Markup Language. It contains markup tags similar to HTML but in XML the markup tags define the data contained into the file. XML file format also shares the data format on the internet.
In order to read a “xml” file we need to download and install the”XML” package library. To install this run the following command:
> install.packages(“XML”)
Now, we need a xml file to work on. Copy the following xml code given below and paste it in your notepad and then save it with the extension .xml under the all files file type.
<RECORDS>
<STUDENT>
<ROLL>1</ROLL>
<NAME>Rohit</NAME>
<SECTION>17</SECTION>
<COURSE>CSE</COURSE>
<STARTDATE>2017-01-09</STARTDATE>
</STUDENT>
<STUDENT>
<ROLL>2</ROLL>
<NAME>Raushan</NAME>
<SECTION>16</SECTION>
<COURSE>IT</COURSE>
<STARTDATE>2016-07-13</STARTDATE>
</STUDENT>
<STUDENT>
<ROLL>3</ROLL>
<NAME>Anjali</NAME>
<SECTION>18</SECTION>
<COURSE>CSE</COURSE>
<STARTDATE>2018-07-13</STARTDATE>
</STUDENT>
<STUDENT>
<ROLL>4</ROLL>
<NAME>Ashwin</NAME>
<SECTION>17</SECTION>
<COURSE>IT</COURSE>
<STARTDATE>2017-01-09 </STARTDATE>
</STUDENT>
<STUDENT>
<ROLL>5</ROLL>
<NAME>Ajay</NAME>
<SECTION>18</SECTION>
<COURSE>IT</COURSE>
<STARTDATE>2018-01-09</STARTDATE>
</STUDENT>
</RECORDS>
Now, we will read the xml file using the R function xmlParse(). So, now we will load the packages and check out our xml file.
# Load the package “xml”
> library("XML")
# Load the package “methods”
> library("methods")
# Parse the input file and store it in data
> data <- xmlParse ( file = "student.xml" )
# Print the results
> print(data)
1
Rohit
17
Cse
2017-01-09
2
Raushan
16
IT
2016-07-13
3
Anjali
18
CSE
2018-07-13
4
Ashwin
17
IT
2017-01-09
5
Ajay
18
IT
2018-01-09
Now, that we have the output of our xml file as we had wanted it to be, let us now get the number of Nodes present in the XML file:
# Load the required packages
> library ( "XML" )
> library ( "methods" )
# Run the function to read the xml file
> data <- xmlParse ( file = "student.xml" )
# Extracting the root node now in rn
> rn <- xmlRoot ( data )
# Counting the number of nodes present
> nnodes <- xmlSize ( rn )
> print ( nnodes )
[1] 8
Now, let us try and get the details of the first student or the first node from the “student.xml” file:
# Load the required packages
> library ( "XML" )
> library ( "methods" )
# Run the function to read the xml file
> data <- xmlParse ( file = "student.xml" )
# Extracting the root node now in rn
> rn <- xmlRoot ( data )
# Print the root node rn
> print(rn[1])
$STUDENT
1
Rohit
17
CSE
2017-01-09
attr(,”class”)
[1] “XMLInternalNodeList” “XMLNodeList”Now, let us convert our XML file to data frame, as a data in data frame is required for ananlysis of large amount of data.
# Load the required packages
> library ( "XML" )
> library ( "methods" )
# Converting the xml file to the data frame
> df <- xmlToDataFrame("student.xml")
> print(df)
ROLL NAME SECTION COURSE STARTDATE
1 1 Rohit 17 CSE 2017-01-09
2 2 Raushan 16 IT 2016-07-13
3 3 Anjali 18 CSE 2018-07-13
4 4 Ashwin 17 IT 2017-01-09
5 5 Ajay 18 IT 1018-01-09
6 6 Gouri 17 CSE 2017-01-09
7 7 Shanon 16 IT 2016-07-13
8 8 Manpreet 17 CSE 2017-01-09
Now, this data frame is available to be used and manipulated as we wish to.
Reading JSON files in R
JavaScript Object Notation or JSON files can be read by R using the rjson package. We will install the rjson package using the following command:
> install.packages(“rjson”)
Create a input data of type json by copying the below text in notepad and save it using the extension .json and save it as all files.
{
“Roll” : [ “1” , “2” , “3” , “4” , “5” , “6” , “7” , “8” ],
“Name”: [“Rohit “,”Raushan” , “Anjali” , “Ashwin” , “Ajay” , “Gouri” , “Shanon” , “Manpreet”],
“Section” : [ “17” , “16” , “18” , “17” , “18” , “17” , “16” , “17” ],
“Dept” : [ “CSE” , “IT” , “CSE” , “IT” , “IT” , “CSE” , “IT” , “CSE” ],
“StartDate” : [ “1/9/2017” , “7/13/2016” , “7/13/2018” , “1/9/2017” , “1/9/2018” , “1/9/2017” , “7/13/2016” , “1/9/2017” ],
}
# Load the downloaded rjson package
> library ( "rjson" )
# Storing the result in data
> Data <- rjson ( file = "student.json" )
# Printing the data
> print ( data )
$Roll
[1] “1” “2” “3” “4” “5” “6” “7” “8”$Name
[1] “Rohit” “Raushan” “Anjali” “Ashwin” “Ajay” “Gouri” “Shanon” “Manpreet”$Section
[1] “17” “16” “18” “17” “18” “17” “16” “17”$Dept
[1] “CSE” “IT” “CSE” “IT” “IT” “CSE” “IT” “CSE”$StartDate
[1] “1/9/2017” “7/13/2016” “7/13/2018” “1/9/2017” “1/9/2018” “1/9/2017”“7/3/2016” “1/9/2017”
Now that we have read a json file we will now try to convert the json file into a data frame for further manipulations.
# Load the downloaded rjson package
> library ( "rjson" )
# Storing the result in data
> Data <- rjson ( file = "student.json" )
# Converting the json file into a data frame
df <- as.data.frame ( data )
print(df)
ROLL NAME SECTION COURSE STARTDATE
1 1 Rohit 17 CSE 2017-01-09
2 2 Raushan 16 IT 2016-07-13
3 3 Anjali 18 CSE 2018-07-13
4 4 Ashwin 17 IT 2017-01-09
5 5 Ajay 18 IT 1018-01-09
6 6 Gouri 17 CSE 2017-01-09
7 7 Shanon 16 IT 2016-07-13
8 8 Manpreet 17 CSE 2017-01-09
Reading Web data in R
A lot of data is generated by many websites in the form of csv, xls, etc., which are freely available to be extracted and used. For example, World Health Organization WHO generates a lot of data regarding health and concern all over the world. Using R programming we can extract such data present in the web pages. Some of the packages that are used most widely in order to connect to these URL’s and download the relevant files are – stringr, XML, RCurl, etc. We will download these packages first handedly using the commands given below:
> install.packages(“stringr”)
> install.packages(“XML”)
> install.packages(“RCurl”)
> install.packages(“plyr”)
Now, we will feed the url to R and then will download the required files for us in the directory where R is installed. So, we will work with this URL in the further extraction.
The function getHTMLLinks() will help us in gathering the urls of the files. Then the function download.file() will save the files to the local systems. So let’s get started with the coding part.
# First step will be to load the URL
> url <- http://s3.amazonaws.com/assets.datacamp.com/production/course_1561/datasets
# Get the HTML links which are present in this url’s webpage
> link_url <- getHTMLLinks(url)
# Pointing the links that we want to download
> link_name <- links [ str_detect ( links , "horsebeans" ) ]
# Arrange the names in a list
> lst <- as.list(link_name)
# Downloading the specified files
> dwnld <- function ( mainurl , filename ) {
+ fdtls <- str_c ( mainurl , filename )
+ download.file ( filedetails , filename )
+ }
# Downloading the file from the url
> l_ply ( link_name , dwnld , mainurl = "http://s3.amazonaws.com/assets.datacamp.com/production/course_1561/datasets")
You can verify the downloaded files from the directory where you have installed your R.
Reading Databases in R
In R we have to visualize a lot of databases while doing statistical plotting, and if we create a data frame out of those sets we need to perform the same operations for a number of times which can be very tiring if we have a bigger database. R programming can be used to deal with a lot of number of relational databases including that of MySQL, SQLServer, Oracle, etc. and it automatically gets the result in the data frames form. Here we are going to work with MySql database which is generally the most popular database being used and we will practice creating, dropping, inserting, etc. functionalities in MySql.
In order to provide connectivity with the database we are going to install and load a function named “RMySQL”. In order to install this package run the following command:
> install.packages(“RMySQL”)
After downloading the package we will make a connection to the database using the function dbConnect() which will take the inputs as username, password, database name and the host name in order to perform the connection. The syntax of the function dbConnect() is:
dbConnect ( drv, username, password, database_name, host_name )
the parameters used in here are:
- drv: It represents the database driver
- username: It takes the input of username
- password: It takes the password input
- database_name: It takes the name of the database as input
- host_name: It takes the name of the database as input
Let us now, try to make connection to the database:
# install the library
> install ( “RMySQL” )
# Loading the library
> library ( “RMySQL” )
# Connecting the sample database named ’world’ that is available with MySQL installation
> conn = dbConnect(MySQL(), user = 'root', password = '', dbname = 'world',host = 'localhost')
# Listing the databases tables
> dbListTables ( conn )
[1] “City” “Country” “CountryInfo” “Countrylanguage”
We have thus listed down the name of all the tables in the database ‘world’. We will now query a particular table using the R function dbSendQuery() which is then executed in the MySQL and the result set is then return by using the function fetch() and is finally stored in a data frame.
NOTE: While executing these commands and functions one should keep in mind to execute the dbConnect() function beforehand otherwise the functions may not work properly.
# Querying the “city” table to get the rows.
> tab = dbSendQuery(mysqlconnection, "select * from city")
# Storing the result in the database df. Trying to get the first 5 rows of the ‘city’ table
> df = fetch(result, n = 5)
> print ( df )
city city_name country_name
1 1 Kabul Afghanistan
2 2 Qandahar Afghanistan
3 3 Mumbai India
4 4 Chittagong India
5 5 New Jersey USA
Let us now see how to update rows in a table. R has function dbSendQuery() which is used to update the rows in a table.
The syntax for the function dbSendQuery() is:
dbSendQuery(conn, statement)
where conn is the connection to the database and the statement represents the query that needs to be updated in the table.
> dbSendQuery(conn, “update city set city_name = “Muzaffarpur” where country=”India”)
Now, after the table has been updated, let us see how to create a table in MySQL. The function dbWriteTable() can be used in order to create a table. This function will overwrite if there is an existing table and creates a new one is there isn’t any other table of this name.
> dbWriteTable(conn, “city”, city[, ], overwrite = TRUE)
We can use the same function dbSendQuery() which will help us in dropping a table. Let us try to drop the table city to see an example of how to drop a table.
> dbSendQuery(conn, ‘DROP TABLE city’)
The above query can be used to drop the table named ‘city’.
Charts and Graphs in R
R is a statistical programming language which means that, R deals best with mathematical tools such as statistical data interpretation which in turn will then be converted into graphs or charts in order to represent the data in a pictorial form. We can represent the data in the form of pie and bar charts, histograms, line graphs, scatter plots and box plots.
Pie charts in R
A pie chart is one of the most used forms of data interpretation among other representations. It is represented by a circular shape and the data involved is divided by slices of the circle according to the percentage of the corresponding data. A pie chart can be represented using different colors which helps them to identify in an easy manner. The slices of the pie chart hold the information through labelling and number representation.
The function pie() is used to create a pie chart in R which takes a vector of positive numbers as input as percentages can’t be negative. The syntax for the function pie() is given below:
Pie ( x, labels, radius, main, col, clockwise )
The parameters used in this function is defined as follows:
- x: It is used to take the vector input used for the representation of the pie chart.
- labels: It is used for labelling the pie chart which describes the slices of the chart.
- radius: It is used to set the size of the pie chart by the value of the radius of the circle which ranges from -1 to 1 in R.
- main: It is used to represent the heading or title of the pie chart created which defines the pie chart.
- col: It is used to provide the information regarding the color palette to be used for pie chart representation.
- clockwise: It is generally a logical value which is used to indicate whether the slices are moving forward to the clockwise direction o0r in anticlockwise direction according to the data provided.
As we now know the basic syntax of a pie chart, let us now try creating a very basic pie chart using the input vector and labels. Follow the below example as we create a pie chart:
# Initialize a vector input and the labels for the data
x <- c ( 32 , 10 , 27 , 52 , 15 )
labels <- c ( "Tiger" , "Lion" , "Fox" , "Elephant" , "Bear" )
# Providing a name for our pie chart
png ( file = "forest.png" )
# Plotting the chart
pie ( x , labels )
# Saving the file
dev.off()
This will output the following result. In order to see the results, you need to go to the directory in which the R has been installed.
Next, we will work on the colors and main (title) of the pie chart. The parameter col will be used for coloring the pie chart and we will now use the rainbow color palette to color our pie chart. The main parameter will be used to provide title to the pie chart.
# Initialize a vector input and the labels for the data
x <- c ( 32 , 10 , 27 , 52 , 15 )
labels <- c ( "Tiger" , "Lion" , "Fox" , "Elephant" , "Bear" )
# Providing a name for our pie chart
png ( file = "forest_color.jpg" )
# Plotting the chart with the rainbow color palette.
# The length of the vector should be same as for color palette rainbow parameter.
pie(x, labels, main = "Forest animal data", col = rainbow(length(x)))
># Saving the file
dev.off()
The result is as shown below:
Now we will add two more parameters to our pie chart after color and title. The parameters include the percentage of the data distribution and the second parameter will describe the data according to the color in the graph. The function piepercent() is used to represent the percentage and function legend() will describe the data accordingly. The below blocks of code shows an example of these two functions and their usage:
x <- c ( 32 , 10 , 27 , 52 , 15 )
labels <- c ( "Tiger" , "Lion" , "Fox" , "Elephant" , "Bear" )
# Adding the percentage parameter to the chart
piepercent<- round(100*x/sum(x), 1)
# Giving the chart a name
png(file = "Forest_percenatge.jpg")
# Plotting the chart
pie ( x , labels = piepercent , main = "Forest animal data" , col = rainbow ( length ( x ) ) )
# Describing data through legend and putting the description in the top left corner
# Here the function cex = 0.8 is the expansion factor.
legend ( "topleft" , c ( "Tiger" , "Lion" , "Fox" , "Elephant" , "Bear" ) , cex = 0.8 , fill = rainbow ( length ( x ) ) )
# Printing the graph
dev.off()
Now, as we have learnt enough parameters for a successful plot let’s see how to plot a 3D pie chart which enhances the pictorial representation of the plot. In order to plot a 3D pie chart we will need to install the library “plotrix” and install it beforehand. Then we will use the function pie3D() for 3D plotting of the chart.
# installing the library
install.packages ( "plotrix" )
library ( plotrix )
# Declaring vector and input data as labels.
x <- c ( 32 , 10 , 27 , 52 , 15 )
lbl <- c ( "Tiger" , "Lion" , "Fox" , "Elephant" , "Bear" )
# Providing a name for our chart
png ( file = "Forest_percenatge.jpg")
# In the function the explode parameter indicates the amount to divide the pie in user units.
pie3D ( x , labels = lbl , explode = 0.1 , main = "Pie Chart of animals " )
dev.off()
Bar charts in R
Another form of representation used in R programming is a bar chart or bar graph. A bar chart is a pictorial representation that represents the categorical data i.e., the data based on the distribution in categories, in the form of rectangular bars, with the height and length of the bars of the rectangle that are proportional to the given values to be represented by the graph. The bars to be plotted can be arranged vertically or horizontally based on the preference of the user. The vertical bars can also be called as column chart and the horizontal bars can be called as row bars. The function barplot() is used create bar charts in R. We can also provide colors to the bars depending on our choice.
The syntax for the function barplot() is:
Barplot ( H , xlab , ylab , main , names.arg , col )
The parameters that are used in the function barplot() are defined as follows:
- H: It is the vector input in the numeric form for the plotting of the bar chart
- xlab: It represents the label for the x axis
- ylab: It represents the label for the y axis
- main: It provides title for the bar chart
- names.arg: It represents the names of every particular bar in the bar chart
- col: It provides colors to the bars of the bar chart.
Let us now create a bar chart using some basic specifications specifically using labels.
# Initialize the vector for the data and it’s representations
H <- c ( 23 , 17 , 9 , 32 , 14 )
n <- c ( "rice" , "wheat" , "barley" , "corn" , "grains" )
# Providing a name for the bar chart
png(file = "bar_chart.jpg")
# Plotting the specific bar chart
barplot( H , names.arg = n )
# Saving the required bar chart
dev.off()
Now, we will try adding some more parameters to the bar chart once we are ready with the basic bar chart. In the below code the parameter border adds a color to the border of the bars in the bar chart. Let’s practice the code below:
# Initialize the vector for the data and it’s representations
H <- c ( 23 , 17 , 9 , 32 , 14 )
n <- c ( "rice" , "wheat" , "barley" , "corn" , "grains" )
# Providing a name for the bar chart
png ( file = "bar_chart2.jpg" )
# Plotting the specific bar chart
barplot ( H , names.arg=n , xlab="Crops" , ylab="Tonnes" , col="Green" , main="Crops Production chart" , border = "red" )
# Saving the required bar chart
dev.off()
Now, let us study a situation in which we need to create a bar chart with stacked data or a grouped bar chart. It is quite simple based on the requirements of the bar chart. The below example provides a sneak peek into the stacked bar chart example:
# Creating the vectors to add colors, crops and quarters.
colors = c ( "red" , "blue" , "green" )
crops <- c ( "rice" , "wheat" , "barley" , "corn" , "grains" )
quarters <- c ( "q1" , "q2" , "q3" )
# Creating the matrix to list the values for the bar chart
v <- matrix ( c ( 3 , 16 , 3 , 8 , 13 , 7 , 5 , 15 , 4 , 3 ,11 , 17 , 8 , 4 , 5 ) , nrow = 3 , ncol = 5 , byrow = TRUE )
# Giving a name for our bar chart
png ( file = "stackd_crops.png" )
# Plotting the bar chart according to preferences
barplot ( v , main = "Total Production of Crops" , names.arg = crops , xlab = "Crops" , ylab = "Tonnes" , col = colors )
# Adding the legend to the bar chart
legend ( "topleft" , quarters , cex = 1.3 , fill = colors )
# Saving the file finally
dev.off()
Histograms in R
A histogram is similar to a bar chart in many ways except the difference among them is that histogram groups the given values into the continuous ranges. It is used to represent the frequencies of the values and buckets them in a range of continuous columns along the X-axis. A histogram in R is created by the hist() function. The syntax for the use of the hist() function is given below:
hist ( v , main , xlab , ylab , xlim , ylim , breaks , col , border )
The parameters used in this hist() function is listed as:
- v: It is the vector used to store the entries of data for the histogram.
- main: It is used to provide the title for your histogram.
- xlab: It represents the description of the x axis.
- ylab: it represents the description of the y axis.
- xlim: It is used to represent the limit of the range of values to be initialized on X axis.
- ylim: It is used to represent the limit of the range of values to be initialized on Y axis.
- breaks: It represents the width of each bar.
- col: It represents the color of each bar.
- border: It is used to color the borders of the histogram bars.
An example to create a histogram in R is given below, which will definitely help us in creating a histogram:
# Creating the vector for plotting of histogram
v <- c ( 13 , 12 , 15 , 16 , 13 , 15 , 17 , 5 , 14 , 3 , 13 )
# Saving the histogram with a name
png ( file = "hist.png" )
# Creating the histogram and mentioning the attributes
hist ( v , main = " Rainfall" , xlab = "Rainfall" , ylab = "cms" , col = "red" , border = "yellow" , xlim = c(0,40) , ylim = c(0,3) , breaks = 3)
# Saving the histogram finally
dev.off()
Line graphs in R
A line graph is a type of graph which is used to show the variations in data over time. It is plotted by joining several points through a series of straight lines in the corresponding graph. The points in the graph usually defines the changes and updates in the data. A line graph can be created using the plot() function. The syntax for the plot function is as follows:
plot ( v , main , type , col , xlab , ylab )
The parameters used in the above syntax is defines as follows:
- v: It is used to initialize the vector for the creation of the line graph.
- main: It is used to provide a title for the line graph created.
- type: It is one of the most important parameters in this function as it tells the graph as of what kind of graph structures is to be drawn. If we want to plot the points the value of type must be taken as “p”, for drawing of lines the value of type should be “l”, and if we want to create both points and lines then the value of type should be ”o”.
- col: It represents the color to be applied to the line graph.
- xlab: It is used to label the x axis.
- ylab: It is used to label the y axis.
We are going to create a line graph based on the parameters given in the syntax above, and we will try to understand how the line graphs are created in R programming.
# Initialize the vector v to be represented in the line graph
v <- c ( 37 , 19 ,13 , 17 , 29 , 15 )
# Giving a name to our line graph.
png(file = "l_gph.jpg")
# Plotting the line graph with the defined parameters.
plot ( v , type = "o" , col = "blue", xlab = "Year starting from 2013", ylab = "Rice Production", main = "Rice production line Graph")
# Saving the file finally
dev.off()
Now we will see the line graph with multiple representations in a single graph i.e., in a single x and y axis we will plot multiple line graphs. This can be possibly done with the help of lines() function. The below example shows how to create such a graph.
# Initialize the vectors a, b and c to be represented in the line graph
a <- c ( 37 , 19 , 13 , 17 , 29 , 15 )
b <- c ( 23 , 34 , 15 , 19 , 20 , 28 )
c <- c ( 35 , 26 , 16 , 18 , 24 , 25 )
# Giving a name to our line graph.
png ( file = "l3_gph.jpg" )
# Plotting the line graph with the defined parameters.
plot ( a , type = "o" , col = "blue", xlab = "Year", ylab = "Rice Production", main = "Rice production line Graph")
# Defining the additional line graphs b and c
lines ( b , type = "o" , col= "red" )
lines ( c , type = "o" , col= "brown" )
# Saving the file finally
dev.off()
Scatter plots in R
Scatter plots are the type of plots that are usually a mathematical diagram created by plotting the points using Cartesian coordinates in order to display the values. These value each represents the value of two variables, the one being in the horizontal axis and the other in vertical axis. They are called scatter plots as each point in the plot is defined on its own and the whole system of points are scattered all over the graph, the approximate graph in the form of a line is created by covering the maximum points or most of them lying around the line. This line is also known as the best fit line. The function plot() is used for the plotting of scatter plots. The syntax of the function plot() is as shown below:
plot ( x , y , main , xlab , ylab , xlim , ylim , axes , pch )
The parameters in the syntax are defined as follows:
- x: It represents the data sets whose values lies in the horizontal axis or x coordinates.
- y: It represents the data sets whose values lies in the vertical axis or y coordinates.
- main: It represents the title of the scatter plot diagram.
- xlab: It represents the label in the x axis or horizontal axis.
- ylab: It represents the label in the y axis or vertical axis.
- xlim: It represents the limits of values of x which are used for plotting.
- ylim: It represents the limits of values of y which are used for plotting.
- axes: It represents the x axis and the y axis being drawn on the plotted graph.
- pch: It represents the shape of the scatter points in the graph. The default value for the pch is 1 which is an empty circle. We can definitely change this value , the most used value is pch=19 which represents the solid circle and pch = 21 which represents the filled circle.
Let us now take a simple example in which we will learn how to create a simple scatter plot:
# Getting the input data for our plots
attach(mtcars)
# Naming the plot to be saved by that name
png ( file = "carsplot.png" )
# Plotting the scatter plot with the belowspecifications
plot ( wt , mpg , xlab = "Weight of car" , ylab = "Mileage of car" , xlim = c ( 3 , 6 ) ,ylim = c ( 10 , 20 ) , main = "Comparison Scatterplot" , pch = 19 )
# Saving the file finally
dev.off()
Let us now try to draw the best fit lines in the above scatter plots which is a very important task for a data scientist.
attach(mtcars)
The following objects are masked from mtcars (pos = 3):
# am, carb, cyl, disp, drat, gear, hp, mpg, qsec, vs, wt
png(file = "carsplot.png")
# Plotting the scatterplot for the mtcars
plot(wt, mpg,xlab = "Weight of car",ylab = "Mileage of car",xlim = c(3,6),ylim = c(10,20),main = "Comparison Scatterplot",pch=19)
# Adding the regression line to the graph (y~x)
abline ( lm ( mpg~wt ) , col="brown" )
# Adding the lowess line to the graph (x,y)
lines ( lowess ( wt , mpg ) , col = "green" )
# Saving the file in the directory
dev.off()
Now, as we have learnt about creating and plotting a scatter plot let us now learn how to create a scatterplot matrix. The function used to create matrices in the scatterplot is pairs() function. The syntax for the function pairs() is as given below:
pairs( formula, data )
where, formula is used to represent the variables that are to be used in the pairs() function and data is the data set to be taken as input. The example below shows how to create a scatterplot matrix:
# Naming the scatterplot matrix
png ( file = "matrix.png" )
# Plotting the matrix with 4 variables.
pairs ( ~wt+mpg+cyl+hp , data = mtcars , main = "Mtcars Matrix Scatterplot" )
# Saving the png file in the directory
dev.off()
Boxplots in R
A boxplot is a very important data representation method that displays the data in a standardized manner. Boxplots provides a summary of the data in a visual way which helps the data scientists in a great way. Boxplots divides the data into three quartiles. Boxplots can also be used for comparison of data across the data sets by creating boxplots for each one of them. The boxplots can be created using the boxplot() function. The syntax for the function is given as below:
boxplot ( x , main , data , notch , varwidth , names )
The parameters used in the above syntax are defined as follows:
- x: It represents a vector commonly known as the formula in boxplots.
- main: It represents the title being provided to the boxplots.
- data: It represents the data frame of the box plots.
- notch: It is used to draw a notch in the boxplot which takes in input as a logical value. If the value is TRUE then a notch is drawn, else the notch is not drawn.
- varwidth: It represents the width of the plotted box which is proportional to the sample size. It takes input as logical value, i.e., TRUE to enable the varwidth.
- names: These are the group labels which are to be printed under the specified box plots.
Let us now create a basic boxplot without the notch using the mtcars dataset as we have done in the scatterplots. Follow the example below for the creation of a box plot:
# Naming our plotted chart
png ( file = "mtcars_box.png" )
# Plotting the chart as per specifications
boxplot ( mpg ~ gear , data = mtcars , xlab = "Gear of cars", ylab = "Mileage of the cars", main = "Weight and horsepower plot" )
# Saving the file finally
dev.off()
Now, let us plot the boxplot with the notch included.
# Naming our plotted chart
png(file = "mtcars_box2.png")
# Plotting the chart as per specification
boxplot(mpg ~ gear , main = "Weight and horsepower plot" , data = mtcars ,
xlab = "Gear of cars", ylab = "Mileage of the cars" , notch = TRUE, varwidth = TRUE,
col = c("red","blue","green"), names = c("High","Medium","Low"))
# Saving the file finally
dev.off()
This is the final box plot that we get after applying the notch and coloring the boxes of the box plot.