This post will cover the basic ABCs of R with some information about how to use it. More on the same will be covered in following sections. If you are not able to understand any concept,don't worry we will see every concept in action while discussing stats and predictive analytics.
The R language is a platform for mathematical and statistical computations. # R is free. # It's open source. Reasons to use R over other languages: R is an extremely powerful and flexible tool for data analysis, and it contains extensive capabilities for statistical modeling. The Comprehensive R Archive Network (CRAN) web site contains the source code for the program, as well as compiled versions that are ready to use: http://cran.r-project.org/ I will explain basic concepts and syntax for R in this post. 1. Getting Start and Getting Help CRAN contains pre-compiled versions of R for Microsoft Windows, Apple OS X, and several versions of Linux. For Windows and OS X, the program comes with a graphical user interface (GUI). When installing complied versions of R for these two operating systems, an icon for R is installed on the computer. To start an interactive session, launch the program using the icon. Alternatively, R can be started at the command line by typing R. Once the program is started, the q function (for quit) ends the session. > # Comments occur after '#' symbols and are not executed > # You can use this command to quit > q() It will be prompted for options for saving their current work. Note that the language is case-sensitive: Q could not be used to quit the session. To get help on a specific topic, such as a function, put a question mark before the function and press enter: > # Get help on the sd function > ?sd This opens the sd help page. One common challenge with R is finding an appropriate function. To search within all the local R functions on your computer, apropos will match a keyword against the available functions: > apropos("prop") [1] "apropos" "getProperties" [3] "pairwise.prop.test" "power.prop.test" [5] "prop.table" "prop.test" [7] "prop.trend.test" "reconcilePropertiesAndPrototype" Alternatively, the RSiteSearch function conducts an online search of all functions, manuals, contributed documentation, the R-Help newsgroup, and other sources for a keyword. For example, to search for different methods to produce ROC curves, > RSiteSearch("roc") will open a web browser and show the matches. The restrict argument of this function widens the search (see ?RSiteSearch for more details). 2. Packages Base R is the core language features which comes with software. (e.g., the executable program, the fundamental programming framework).Most of the actual R code is contained in seperate modules called packages. When R is installed, a small set of core packages is also installed. However, a large number of packages exist outside of this set. The CRAN web site contains over 6000 packages for download. To load a package, the library function is used: > # Load the random forests package > library(randomForest) > # Show the list of currently loaded packages and other information > sessionInfo() The function install.packages can be used to install additional modules. For example, to install the rpart package for classification and regression tree the code > install.packages("rpart") can be used. Alternatively, the CRAN web site includes “task views” which group similar packages together. For example, the task view for “Machine Learning” would install a set of predictive modeling packages: > # First install the task view package > install.packages("ctv") > # Load the library prior to first use > library(ctv) > install.views("MachineLearning") Some packages depend on other packages (or specific versions). The functions install.packages and install.views will determine additional package requirements and install the necessary dependencies. 3. Creating Objects Anything created in R is an object. Objects can be assigned values using “<-”. For example: > count <- 97 > name <- "Richa" > # Equals also works, but I will explain it later. To see the value of an object, simply type it and hit enter. Also, you can explicitly tell R to print the value of the object. Another helpful function for understanding the contents of an object is str (for structure). As an example, R automatically comes with an object that contains the abbreviated month names. > month.abb [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" [12] "Dec" > str(month.abb) chr [1:12] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" ... This shows that month.abb is a character object with twelve elements. We can also determine the structure of objects that do not contain data, such as the print function: > str(print) function (x, ...) > str(sessionInfo) function (package = NULL) This is useful for see the names of the function arguments. 4. Data Types and Basic Structures There are many different core data types in R. The relevant types are numeric,character, factor, and logical types. Logical data can take on value of TRUE or FALSE. For example, these values can be used to make comparisons or can be assigned to an object: > if(3 > 2) print("greater") else print("less") [1] "greater" > isGreater <- 3 > 2 > isGreater [1] TRUE > is.logical(isGreater) [1] TRUE Numeric data encompass integers and double precision (i.e., decimal valued) numbers. To assign a single numeric value to an R object: > x <- 3.6 > is.numeric(x) [1] TRUE > is.integer(x) [1] FALSE > is.double(x) [1] TRUE > typeof(x) [1] "double" Character strings can be created by putting text inside of quotes: > y <- "your ad here" > typeof(y) [1] "character" > z <- "you can also 'quote' text too" > z [1] "you can also 'quote' text too" Note that R does not restrict the length of character strings. There are several helpful functions that work on strings. First, char counts the number of characters: > nchar(y) [1] 12 > nchar(z) [1] 29 The grep function can be used to determine if a substring exists in the character string > grep("ad", y) [1] 1 > grep("my", y) integer(0) > # If the string is present, return the whole value > grep("too", z, value = TRUE) [1] "you can also 'quote' text too" So far, the R objects shown have a single value or element. The most basic data structure for holding multiple values of the same type of data is a vector. The most basic method of creating a vector is to use the c function(for combine). To create a vector of numeric data: > weights <- c(90, 150, 111, 123) > is.vector(weights) [1] TRUE > typeof(weights) [1] "double" > length(weights) > weights + .25 [1] 90.25 150.25 111.25 123.25 Note that the last command is an example of vector operations. Instead of looping over the elements of the vector, vector operations are more concise and efficient operations. Many functions work on vectors: > mean(weights) [1] 118.5 > colors <- c("green", "red", "blue", "red", "white") > grep("red", colors) [1] 2 4 > nchar(colors) [1] 5 3 4 3 5 An alternate method for storing character data in a vector is to use factors. Factors store character data by first determining all unique values in the data, called the factor levels. The character data is then stored as integers that correspond to the factor levels: > colors2 <- as.factor(colors) > colors2 [1] green red blue red white Levels: blue green red white > levels(colors2) [1] "blue" "green" "red" "white" > as.numeric(colors2) [1] 2 3 1 3 4 There are a few advantages to storing data in factors. First, less memory is required to store the values since potentially long character strings are saved only once (in the levels) and their occurrences are saved as vectors. Second, the factor vector “remembers” all of the possible values. Suppose we subset the factor vector by removing the first value using a negative integer value: > colors2[-1] [1] red blue red white Levels: blue green red white Even though the element with a value of “green” was removed, the factor still keeps the same levels. Factors are the primary means of storing discrete variables in R and many classification models use them to specify the outcome data. To work with a subset of a vector, single brackets can be used in different ways: > weights [1] 90 150 111 123 > # positive integers indicate which elements to keep > weights[c(1, 4)] [1] 90 123 > # negative integers correspond to elements to exclude > weights[-c(1, 4)] [1] 150 111 > # A vector of logical values can be used also but there should > # be as many logical values as elements > weights[c(TRUE, TRUE, FALSE, TRUE)] [1] 90 150 123 Vectors must store the same type of data. An alternative is a list; this is a type of vector that can store objects of any type as elements: > both <- list(colors = colors2, weight = weights) > is.vector(both) [1] TRUE > is.list(both) [1] TRUE > length(both) [1] 2 > names(both) [1] "colors" "weight" Lists can be filtered in a similar manner as vectors. However, double brackets return only the element, while single brackets return another list: > both[[1]] [1] green red blue red white Levels: blue green red white > is.list(both[[1]]) [1] FALSE > both[1] $colors [1] green red blue red white Levels: blue green red white > is.list(both[1]) [1] TRUE > # We can also subset using the name of the list > both[["colors"]] [1] green red blue red white Levels: blue green red white Missing values in R are encoded as NA values: > probabilities <- c(.05, .67, NA, .32, .90) > is.na(probabilities) [1] FALSE FALSE TRUE FALSE FALSE > # NA is not treated as a character string > probabilities == "NA" [1] FALSE FALSE NA FALSE FALSE > # Most functions propagate missing values... > mean(probabilities) [1] NA > # ... unless told otherwise > mean(probabilities, na.rm = TRUE) [1] 0.485 5. Working with Rectangular Data Sets Rectangular data sets usually refer to situations where samples are in rows of a data set while columns of r respond to variables (in some domains, this convention is reversed). There are two main structures for rectangular data: matrices and data frames. The main difference between these two types of objects is the type of data that can be stored within them. A matrix can only contain data of the same type (e.g., character or numeric) while data frames must contain columns of the same data type. Matrices are more computationally efficient but are obviously limited. We can create a matrix using the matrix function. Here, we create a numeric vector of integers from one to twelve and use three rows and four columns: > mat <- matrix(1:12, nrow = 3) > mat [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 The rows and columns can be given names: > rownames(mat) <- c("row 1", "row 2", "row 3") > colnames(mat) <- c("col1", "col2", "col3", "col4") > mat col1 col2 col3 col4 row 1 1 4 7 10 row 2 2 5 8 11 row 3 3 6 9 12 > rownames(mat) [1] "row 1" "row 2" "row 3" Matrices can be subset using method similar to vectors, but rows and columns can be subset separately: > mat[1, 2:3] col2 col3 4 7 > mat["row 1", "col3"] [1] 7 > mat[1,] col1 col2 col3 col4 1 4 7 10 One difficulty with subsetting matrices is that dimensions can be dropped; if either a single row or column is produced by subsetting a matrix, then a vector is the result: > is.matrix(mat[1,]) [1] FALSE > is.vector(mat[1,]) [1] TRUE One method for avoiding this is to pass the drop option to the matrix when subsetting: > mat[1,] col1 col2 col3 col4 1 4 7 10 > mat[1,,drop = FALSE] col1 col2 col3 col4 row 1 1 4 7 10 > is.matrix(mat[1,,drop = FALSE]) [1] TRUE > is.vector(mat[1,,drop = FALSE]) [1] FALSE Data frames can be created using the data.frame function: > df <- data.frame(colors = colors2, + time = 1:5) > df colors time 1 green 1 2 red 2 3 blue 3 4 red 4 5 white 5 > dim(df) [1] 5 2 > colnames(df) [1] "colors" "time" > rownames(df) [1] "1" "2" "3" "4" "5" In addition to the subsetting techniques previously shown for matrices, the $ operator can be used to return single columns while the subset function can be used to return more complicated subsets of rows: > df$colors [1] green red blue red white Levels: blue green red white > subset(df, colors %in% c("red", "green") & time <= 2) colors time 1 green 1 2 red 2 A helpful function for determining if there are any missing values in a row of a matrix or data frame is the complete.cases function, which returns TRUE if there are no missing values: > df2 <- df > # Add missing values to the data frame > df2[1, 1] <- NA > df2[5, 2] <- NA > df2 colors time 1 1 2 red 2 3 blue 3 4 red 4 5 white NA > complete.cases(df2) [1] FALSE TRUE TRUE TRUE FALSE 6. Objects and Classes Each object has at least one type or class associated with it. The class of an object declares what it is (e.g., a character string, linear model, web site URL). The class defines the structure of an object (i.e., how it is stored) and the possible operations associated with this type of object (called methods for the class). For example, if some sort of model object is created, it may be of interest to: • Print the model details for understanding • Plot the model for visualization, or • Predict new samples In this case, print, plot, and predict are some of the possible methods for that particular type of model (as determined by its class). This paradigm is called object-oriented programming. We can quickly determine the class of the previous objects: > pages [1] 97 > class(pages) [1] "numeric" > town [1] "Richmond" > class(town) [1] "character" When the user directs R to perform some operation, such as creating predictions from a model object, the class determines the specific code for the prediction equation. This is called method dispatch. There are two main tech niques for object-oriented programming in R: S3 classes and S4 classes. The S3 approach is more simplistic than S4 and is used by many packages. S4 methods are more powerful than S3 methods but are too complex to ade quately describe in this overview. With S3 methods, the naming convention is to use dots to separate classes and methods. For example, summary.lm is the function that is used to compute summary values for an object that has the lm class (this class is to fit linear models, such as linear regression analysis). Suppose a user created an object called myModel using the lm function. The command modelSummary <- summary(myModel) calculates the common descriptive statistics for the model. R sees that myModel has class lm, so it executes the code in the function summary.lm. For this text, it is important to understand the concept of objects, classes,and methods. However, these concepts will be used at a high level; the code contained in this series rarely delves into the technical minutia “under the hood.” For example, the predict function will be used extensively, but the use will not be required to know which specific method is executed. 7. R Functions In R, modular pieces of code can be collected in functions. Many functions have already been used in this section, such as the library function that loads packages. Functions have arguments: specific slots that are used to pass objects into the function. In R, arguments are named (unlike other languages, such as matlab). For example, the function for reading data stored in comma delimited format (CSV) into an R object has these arguments: > str(read.csv) function (file, header = TRUE, sep = ",", quote = "¨", dec = ".",fill = TRUE, comment.char = "", ...) where file is a character string that points to the CSV file and header indicates whether the initial row corresponds to variable names. The file argument has no default value and the function will result in an error if no file name is specified. Since these functions are named, they can be called in several different ways: > read.csv("data.csv") > read.csv(header = FALSE, file = "data.csv") Notice that the read.csv function has an argument at the end with three dots. This means that other arguments can be added to the read.csv function call that are passed to a specific function within the code for read.csv. In this case, the code uses another function called read.table that is more general. The read.table contains an argument called na.strings that is absent from read.csv. This argument tells R which character values indicate a missing value in the file. Using > read.csv("data.csv", na.strings = "?") has the effect of passing the argument na.strings = "?" from the read.csv function to the read.table function. Note that this argument must be named if it is to be passed through. The three dots are used extensively in the computing sections of each chapter. 8. The Equal sign: So far, the = symbol has been used in several different contexts: 1. Creating objects, such as x=3 2. Testing for equivalence: x == 4 3. Specifying values to function arguments: read.csv(header = FALSE) This can be confusing for newcomers. For example: > new = subset(old, subset = value == "blue", drop = FALSE) uses the symbol four times across all three cases. One method for avoiding confusion is to use <- as the assignment operator. This was just an introduction about how you can use R . Learning a language is based on practice. The more you practice the more you become good at it. Subscribe for Next Topic(EDA with R) direct to your mailbox!
0 Comments
Leave a Reply. |
AuthorAshish Kumar ArchivesCategoriesSubcategory |