Tarea

Project: Test 2

Views: ¹⁷⁹

RPy2

RPy2 is a Python module for interacting with R from Python. It exposes functions, packages and more in Python and allows to reference them. Dots $\dots{}$ in R functions are automatically converted to underscores. Additionally, data conversions for various types can be enabled - first and foremost for NumPy arrays.

%auto
%default_mode python #pure python mode
import numpy as np
import rpy2
import rpy2.robjects as robjects
from rpy2.robjects import r, pandas2ri
pandas2ri.activate()

rpy2.__version__

'2.8.5'

Referencing R functions

RPy2's robjects (or sometimes just imported as ro) exposes the R instance as .r. It is rather easy to get hold of functions and reference them from Python:

c = robjects.r['c']
summary = robjects.r['summary']

v1 = c(5,4.4,1,-1.8)
sumv1 = summary(v1)
print sumv1.__repr__()

R object with classes: ('summaryDefault', 'table') mapped to:
<FloatVector - Python:0x7f506223f4d0 / R:0x634af40>
[-1.800000, 0.300000, 2.700000, 2.150000, 4.550000, 5.000000]

print sumv1

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  -1.80    0.30    2.70    2.15    4.55    5.00 

sumv1[3]

2.15

Evaluating R code directly

robjects.reval("""\
zz <- 1:10
print(paste("sd(zz) = ", sd(zz)))
""")

[1] "sd(zz) =  3.02765035409749"
<rpy2.rinterface.StrSexpVector - Python:0x7f506e5eac00 / R:0x5a8ceb8>

myfunc = robjects.r("""\
function(x) {
   a <- x^2 + rnorm(1)
   k <- 2 * a + 1
   return(k)
}""")

myfunc(2.5)

R object with classes: ('numeric',) mapped to:
<FloatVector - Python:0x7f506223fb48 / R:0x56db568>
[12.440484]

Vectorization

First, enable automatic conversion from NumPy arrays to R arrays. Then, even the custom function works out of the box.

xx = np.array([5,4,2.2,-1,-5.5])
print "Data Type:   ", type(xx)
print "Element Type:", xx.dtype
print "Array Shape: ", xx.shape

Data Type:    <type 'numpy.ndarray'>
Element Type: float64
Array Shape:  (5,)

summary(xx)

R object with classes: ('summaryDefault', 'table') mapped to:
<FloatVector - Python:0x7f506223fd40 / R:0x5bdd258>
[-5.500000, -1.000000, 2.200000, 0.940000, 4.000000, 5.000000]

myfunc(xx)

R object with classes: ('array',) mapped to:
<Array - Python:0x7f506223fc20 / R:0x5b0df58>
[52.274453, 34.274453, 11.954453, 4.274453, 62.774453]

Types of Vectors

`[ ]` and `[[ ]]` are `rx` and `rx2`

# Python style: (10 exclusive)
# v1 = robjects.IntVector(range(1,10))
# R style: (10 inclusive)
v1 = robjects.r.seq(1,10)
print v1

 [1]  1  2  3  4  5  6  7  8  9 10

# Python style, 0-based indexing of vectors
print v1[0]
v1[0] = -99
print v1

1
 [1] -99   2   3   4   5   6   7   8   9  10

# R style, 1-based indexing
v1.rx[3] = 99
print v1.rx(3)

[1] 99

print v1

 [1] -99   2  99   4   5   6   7   8   9  10

l1 = robjects.r("list(aa = c(1,2,3), bb = c(-5,5), cc = 'help')")
print l1

$aa
[1] 1 2 3

$bb
[1] -5  5

$cc
[1] "help"

# R's [[1]]
print l1.rx2(1)

[1] 1 2 3

# indexing into the element [[1]]
print l1.rx2(1).rx(2)

[1] 2

# versus
print l1.rx2(1)[1]

2.0

# Constructing the same from Python is harder, since we need an ordered dictionary
import rpy2.rlike.container as rlc
l2 = robjects.ListVector(
         rlc.OrdDict((
              ('aa', robjects.IntVector([1,2,3])),
              ('bb', robjects.IntVector([-5,5])),
              ('cc', "help"))))
print l2

$aa
[1] 1 2 3

$bb
[1] -5  5

$cc
[1] "help"

# assigning a new string vector to "bb"
l1.rx2["bb"] = robjects.StrVector("this is a short sentence".split())
print(l1[l1.names.index("bb")])

[1] "this"     "is"       "a"        "short"    "sentence"

# Matrix

m = robjects.r.matrix(range(10), nrow=5)
print(m)

     [,1] [,2]
[1,] 0    5   
[2,] 1    6   
[3,] 2    7   
[4,] 3    8   
[5,] 4    9   

type(m)

<class 'rpy2.robjects.vectors.Matrix'>

m.rx2(4,2)

R object with classes: ('integer',) mapped to:
<IntVector - Python:0x7f506224acf8 / R:0x60528c8>
[       8]

# R-operators work, too
print(m.ro > 5)

      [,1]  [,2]
[1,] FALSE FALSE
[2,] FALSE  TRUE
[3,] FALSE  TRUE
[4,] FALSE  TRUE
[5,] FALSE  TRUE

print m.rx((m.ro > 3).ro & (m.ro <= 6))

[[1]]
[1] 4

[[2]]
[1] 5

[[3]]
[1] 6

sv = robjects.StrVector('xyyyxyzyzyxx')
fac = robjects.FactorVector(sv)
print(fac)

 [1] x y y y x y z y z y x x
Levels: x y z

print(summary(fac))

x y z 
4 6 2 

Packages

The idea is to get hold of a reference to a package. The reference is like a module-namespace and populated with all the members.

from rpy2.robjects.packages import importr
r_base = importr("base")

# a bit of the namespace
print(dir(r_base)[-50:-40])

['upper_tri', 'url', 'utf8ToInt', 'vapply', 'vector', 'version', 'warning', 'warnings', 'weekdays', 'weekdays_Date']

print(r_base.version)

               _                                          
platform       x86_64-pc-linux-gnu                        
arch           x86_64                                     
os             linux-gnu                                  
system         x86_64, linux-gnu                          
status         Revised                                    
major          3                                          
minor          2.4                                        
year           2016                                       
month          03                                         
day            16                                         
svn rev        70336                                      
language       R                                          
version.string R version 3.2.4 Revised (2016-03-16 r70336)
nickname       Very Secure Dishes                         

# use Python's `getattr` to access non-standard named identifyers.
# e.g. matrix multiplication
A = np.array([[1, 1],
              [1, 7]])
B = np.array([[4, 5],
              [6, 7]])
matrix_mult = getattr(r_base, "%*%")
print(matrix_mult(A, B))

     [,1] [,2]
[1,]   10   12
[2,]   46   54

r_base.rep(r_base.c("x", "y", "z"), 10)

R object with classes: ('character',) mapped to:
<StrVector - Python:0x7f5061ddff80 / R:0x65295e0>
[str, str, str, ..., str, str, str]

from rpy2.robjects.packages import importr

# datasets
datasets = importr('datasets')
# Note: the __rdata__ should be a plain "data", but doesn't work in this development version.
faithful = datasets.__rdata__.fetch("faithful")["faithful"]
print type(faithful)

<class 'rpy2.robjects.vectors.DataFrame'>

# number of columns!
len(faithful)

2

# S3 datatypes in R for each column
[column.rclass[0] for column in faithful]

['numeric', 'numeric']

# extract some rows
print(faithful.rx(robjects.IntVector([2,3,4,10]), True))

   eruptions waiting
    1.800      54
    3.333      74
    2.283      62
   4.350      85

# extract part of a column
print(faithful.rx2("eruptions")[:10])

 [1] 3.600 1.800 3.333 2.283 4.533 2.883 4.700 3.600 1.950 4.350

Example: `lm`

data = robjects.DataFrame({
       'y' : np.array([4, 5, 5.5, 7, 7.6, 8, 11, 12, 13]),
       'x' : np.array([1, 2,   3, 4, 4.4, 5, 7,  8, 8.5])
       })

lmod = robjects.r.lm("y ~ x", data = data)

print lmod.names

 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "xlevels"       "call"          "terms"         "model"        

coeffs = lmod.rx2("coefficients")
print "R's representation via 'print'"
print(coeffs)
print
print "Same coefficients in Python's floats:"
print ([x for x in coeffs])

R's representation via 'print'
(Intercept)           x 
   2.328485    1.215469 


Same coefficients in Python's floats:
[2.3284853249475894, 1.2154692791485244]

# max is from Python, iterates naturally over the entries in all residuals
print max(lmod.rx2("residuals"))

0.456045395904

Plot

grdevices = importr('grDevices')

# just calling "plot" on the dataframe

_ = grdevices.png(file="rpy2_plot.png", width=640, height=320)
_ = robjects.r.plot(data)
grdevices.dev_off()

R object with classes: ('integer',) mapped to:
<IntVector - Python:0x7f5061dfb830 / R:0x75524b8>
[       1]

salvus.file("rpy2_plot.png")

# Plot of the linear model lmod

_ = grdevices.png(file="rpy2_plot_2.png", width=640, height=520)
_ = robjects.reval("par(mfrow=c(2,2))")
_ = robjects.r.plot(lmod)
grdevices.dev_off()

R object with classes: ('integer',) mapped to:
<IntVector - Python:0x7f5061ec8878 / R:0x78dad28>
[       1]

salvus.file("rpy2_plot_2.png")

# get R's "print" via globalenv, otherwise it's a syntax error in Python!
rprint = robjects.globalenv.get("print")
volcano = datasets.__rdata__.fetch("volcano")["volcano"]
lattice = importr("lattice")

_ = grdevices.png(file="rpy2_plot_wireframe.png", width=480, height=480)
p = lattice.wireframe(volcano,
                      shade = True,
                      zlab = "",
                      aspect = robjects.FloatVector((61.0/87, 0.5)),
                      light_source = robjects.IntVector((10,0,10)))
_ = rprint(p)
grdevices.dev_off()

R object with classes: ('integer',) mapped to:
<IntVector - Python:0x7f5061ebcd40 / R:0x83f4d78>
[       1]

salvus.file("rpy2_plot_wireframe.png")

Advanced: PCA

USArrests = datasets.__rdata__.fetch("USArrests")["USArrests"]
r_stats = importr("stats")
pca_usarrest = r_stats.princomp(USArrests, cor=True)
print(summary(pca_usarrest))

Importance of components:
                          Comp.1    Comp.2    Comp.3     Comp.4
Standard deviation     1.5748783 0.9948694 0.5971291 0.41644938
Proportion of Variance 0.6200604 0.2474413 0.0891408 0.04335752
Cumulative Proportion  0.6200604 0.8675017 0.9566425 1.00000000

grdevices = importr('grDevices')
_ = grdevices.png(file="rpy2_plot_pca.png", width=480, height=480)
_ = robjects.r.biplot(pca_usarrest)
_ = grdevices.dev_off()
salvus.file("rpy2_plot_pca.png")

#low level

print(robjects.r.help("sum"))

R Help on ‘sum’sum                    package:base                    R Documentation

Sum of Vector Elements

Description:

     ‘sum’ returns the sum of all the values present in its arguments.

Usage:

     sum(..., na.rm = FALSE)
     
Arguments:

     ...: numeric or complex or logical vectors.

   na.rm: logical.  Should missing values (including ‘NaN’) be removed?

Details:

     This is a generic function: methods can be defined for it directly
     or via the ‘Summary’ group generic.  For this to work properly,
     the arguments ‘...’ should be unnamed, and dispatch is on the
     first argument.

     If ‘na.rm’ is ‘FALSE’ an ‘NA’ or ‘NaN’ value in any of the
     arguments will cause a value of ‘NA’ or ‘NaN’ to be returned,
     otherwise ‘NA’ and ‘NaN’ values are ignored.

     Logical true values are regarded as one, false values as zero.
     For historical reasons, ‘NULL’ is accepted and treated as if it
     were ‘integer(0)’.

     Loss of accuracy can occur when summing values of different signs:
     this can even occur for sufficiently long integer inputs if the
     partial sums would cause integer overflow.  Where possible
     extended-precision accumulators are used, but this is
     platform-dependent.

Value:

     The sum. If all of ‘...’ are of type integer or logical, then the
     sum is integer, and in that case the result will be ‘NA’ (with a
     warning) if integer overflow occurs.  Otherwise it is a length-one
     numeric or complex vector.

     *NB:* the sum of an empty set is zero, by definition.

S4 methods:

     This is part of the S4 ‘Summary’ group generic.  Methods for it
     must use the signature ‘x, ..., na.rm’.

     ‘plotmath’ for the use of ‘sum’ in plot annotation.

References:

     Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S
     Language_.  Wadsworth & Brooks/Cole.

See Also:

     ‘colSums’ for row and column sums.

Examples:

     ## Pass a vector to sum, and it will add the elements together.
     sum(1:5)
     
     ## Pass several numbers to sum, and it also adds the elements.
     sum(1, 2, 3, 4, 5)
     
     ## In fact, you can pass vectors into several arguments, and everything gets added.
     sum(1:2, 3:5)
     
     ## If there are missing values, the sum is unknown, i.e., also missing, ....
     sum(1:5, NA)
     ## ... unless  we exclude missing values explicitly:
     sum(1:5, NA, na.rm = TRUE)
     

# via RPy2 wrappers
help_sum = robjects.help.Package("base").fetch("sum")

print(help_sum.title())

Sum of Vector Elements

print(help_sum.description())

  \code{sum} returns the sum of all the values
  present in its arguments.

print(help_sum.usage())

sum(\dots, na.rm = FALSE)

for arg, descr in help_sum.arguments():
    print("%-10s: %s" % (arg, descr))

...       : numeric or complex or logical vectors.
na.rm     : logical.  Should missing values (including \code{NaN}) be
    removed?

print(help_sum.seealso())

  \code{\link{colSums}} for row and column sums.

print(help_sum.value())

  The sum. If all of \code{\dots} are of type integer or logical, then
  the sum is integer, and in that case the result will be \code{NA} (with a
  warning) if integer overflow occurs.  Otherwise it is a length-one
  numeric or complex vector.

  \strong{NB:} the sum of an empty set is zero, by definition.

help_sum.sections.keys()

('title', 'name', 'alias', 'keyword', 'description', 'usage', 'arguments', 'details', 'value', 'section', 'references', 'seealso', 'examples')

print(''.join(help_sum.to_docstring(("title", "usage", 'details', "references", "section"))))

title
-----

Sum of Vector Elements 

usage
-----


 sum( , na.rm = FALSE)
 

details
-------


   This is a generic function: methods can be defined for it
   directly or via the  Summary  group generic.
   For this to work properly, the arguments   should be
   unnamed, and dispatch is on the first argument.
 
   If  na.rm  is  FALSE  an  NA  or  NaN  value in
   any of the arguments will cause a value of  NA  or  NaN  to
   be returned, otherwise  NA  and  NaN  values are ignored.
 
   Logical true values are regarded as one, false values as zero.
   For historical reasons,  NULL  is accepted and treated as if it
   were  integer(0) .
 
   Loss of accuracy can occur when summing values of different signs:
   this can even occur for sufficiently long integer inputs if the
   partial sums would cause integer overflow.  Where possible
   extended-precision accumulators are used, but this is
   platform-dependent.
 

references
----------


   Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988)
    The New S Language .
   Wadsworth & Brooks/Cole.
 

section
-------

S4 methods 
   This is part of the S4  Summary 
   group generic.  Methods for it must use the signature
    x,  , na.rm .
 
    plotmath  for the use of  sum  in plot annotation.