Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News AboutSign UpSign In
| Download

Github repo cloud-examples: https://github.com/sagemath/cloud-examples

Views: 7806
License: MIT

RPy2

RPy2 is a Python module for interacting with R from Python. It exposes functions, packages and more in Python and allows to reference them. Dots \dots{} in R functions are automatically converted to underscores. Additionally, data conversions for various types can be enabled - first and foremost for NumPy arrays.

%auto %default_mode python #pure python mode import numpy as np import rpy2 import rpy2.robjects as robjects from rpy2.robjects.packages import importr rpy2.__version__
'2.3.8'

Referencing R functions

RPy2’s robjects (or sometimes just imported as ro) exposes the R instance as .r. It is rather easy to get hold of functions and reference them from Python:

c = robjects.r['c'] summary = robjects.r['summary']
v1 = c(5,4.4,1,-1.8) sumv1 = summary(v1) print sumv1.__repr__()
<FloatVector - Python:0x92155f0 / R:0xcc41fa0> [-1.800000, 0.300000, 2.700000, 2.150000, 4.550000, 5.000000]
print sumv1
Min. 1st Qu. Median Mean 3rd Qu. Max. -1.80 0.30 2.70 2.15 4.55 5.00
sumv1[3]
2.15

Evaluating R code directly

robjects.reval("""\ zz <- 1:10 print(paste("sd(zz) = ", sd(zz))) """)
[1] "sd(zz) = 3.02765035409749" <rpy2.rinterface.SexpVector - Python:0x68177f8 / R:0xc1150e8>
myfunc = robjects.r("""\ function(x) { a <- x^2 + rnorm(1) k <- 2 * a + 1 return(k) }""")
myfunc(2.5)
<FloatVector - Python:0xcc7f3b0 / R:0xc0c8778> [15.198918]

Vectorization

First, enable automatic conversion from NumPy arrays to R arrays. Then, even the custom function works out of the box.

from rpy2.robjects.numpy2ri import numpy2ri robjects.conversion.py2ri = numpy2ri
xx = np.array([5,4,2.2,-1,-5.5]) print "Data Type: ", type(xx) print "Element Type:", xx.dtype print "Array Shape: ", xx.shape
Data Type: <type 'numpy.ndarray'> Element Type: float64 Array Shape: (5,)
summary(xx)
<FloatVector - Python:0xcc9b9e0 / R:0xbd2cae0> [-5.500000, -1.000000, 2.200000, 0.940000, 4.000000, 5.000000]
myfunc(xx)
<Array - Python:0xcc9b758 / R:0xb7fb220> [50.887895, 32.887895, 10.567895, 2.887895, 61.387895]

Types of Vectors

[ ] and [[ ]] are rx and rx2

# Python style: (10 exclusive) # v1 = robjects.IntVector(range(1,10)) # R style: (10 inclusive) v1 = robjects.r.seq(1,10) print v1
[1] 1 2 3 4 5 6 7 8 9 10
# Python style, 0-based indexing of vectors print v1[0] v1[0] = -99 print v1
1 [1] -99 2 3 4 5 6 7 8 9 10
# R style, 1-based indexing v1.rx[3] = 99 print v1.rx(3)
[1] 99
print v1
[1] -99 2 99 4 5 6 7 8 9 10
l1 = robjects.r("list(aa = c(1,2,3), bb = c(-5,5), cc = 'help')") print l1
$aa [1] 1 2 3 $bb [1] -5 5 $cc [1] "help"
# R's [[1]] print l1.rx2(1)
[1] 1 2 3
# indexing into the element [[1]] print l1.rx2(1).rx(2)
[1] 2
# versus print l1.rx2(1)[1]
2.0
# Constructing the same from Python is harder, since we need an ordered dictionary import rpy2.rlike.container as rlc l2 = robjects.ListVector( rlc.OrdDict(( ('aa', robjects.IntVector([1,2,3])), ('bb', robjects.IntVector([-5,5])), ('cc', "help")))) print l2
$aa [1] 1 2 3 $bb [1] -5 5 $cc [1] "help"
# assigning a new string vector to "bb" l1.rx2["bb"] = robjects.StrVector("this is a short sentence".split()) print(l1[l1.names.index("bb")])
[1] "this" "is" "a" "short" "sentence"
# Matrix
m = robjects.r.matrix(range(10), nrow=5) print(m)
[,1] [,2] [1,] 0 5 [2,] 1 6 [3,] 2 7 [4,] 3 8 [5,] 4 9
type(m)
<class 'rpy2.robjects.vectors.Matrix'>
m.rx2(4,2)
<IntVector - Python:0xccaa560 / R:0xbfdf248> [ 8]
# R-operators work, too print(m.ro > 5)
[,1] [,2] [1,] FALSE FALSE [2,] FALSE TRUE [3,] FALSE TRUE [4,] FALSE TRUE [5,] FALSE TRUE
print m.rx((m.ro > 3).ro & (m.ro <= 6))
[[1]] [1] 4 [[2]] [1] 5 [[3]] [1] 6

Bonus: Factors

sv = robjects.StrVector('xyyyxyzyzyxx') fac = robjects.FactorVector(sv) print(fac)
[1] x y y y x y z y z y x x Levels: x y z
print(summary(fac))
x y z 4 6 2
︠914d8922-a802-412d-a3ad-c544ad1903dci︠ %md ### Packages The idea is to get hold of a reference to a package. The reference is like a module-namespace and populated with all the members.

Packages

The idea is to get hold of a reference to a package. The reference is like a module-namespace and populated with all the members.

from rpy2.robjects.packages import importr r_base = importr("base")
# a bit of the namespace print(dir(r_base)[-50:-40])
['upper_tri', 'url', 'utf8ToInt', 'vapply', 'vector', 'version', 'warning', 'warnings', 'weekdays', 'weekdays_Date']
print(r_base.version)
_ platform x86_64-unknown-linux-gnu arch x86_64 os linux-gnu system x86_64, linux-gnu status major 2 minor 15.2 year 2012 month 10 day 26 svn rev 61015 language R version.string R version 2.15.2 (2012-10-26) nickname Trick or Treat
# use Python's `getattr` to access non-standard named identifyers. # e.g. matrix multiplication A = np.array([[1, 1], [1, 7]]) B = np.array([[4, 5], [6, 7]]) matrix_mult = getattr(r_base, "%*%") print(matrix_mult(A, B))
[,1] [,2] [1,] 10 12 [2,] 46 54
r_base.rep(r_base.c("x", "y", "z"), 10)
<StrVector - Python:0xcd00ea8 / R:0xce81440> ['x', 'y', 'z', ..., 'x', 'y', 'z']

Datasets

Use importr from rpy2.robjects.packages to get hold of the dataset package. Then fetch a dataset and retrieve the named entry to get hold of the dataframe.

from rpy2.robjects.packages import importr
# datasets datasets = importr('datasets') # Note: the __rdata__ should be a plain "data", but doesn't work in this development version. faithful = datasets.__rdata__.fetch("faithful")["faithful"] print type(faithful)
<class 'rpy2.robjects.vectors.DataFrame'>
︠6dc17296-b581-4a6c-aa31-34237e79d88d︠ # number of columns! len(faithful)
2
# S3 datatypes in R for each column [column.rclass[0] for column in faithful]
['numeric', 'numeric']
# extract some rows print(faithful.rx(robjects.IntVector([2,3,4,10]), True))
eruptions waiting 2 1.800 54 3 3.333 74 4 2.283 62 10 4.350 85
# extract part of a column print(faithful.rx2("eruptions")[:10])
[1] 3.600 1.800 3.333 2.283 4.533 2.883 4.700 3.600 1.950 4.350

Example: lm

data = robjects.DataFrame({ 'y' : np.array([4, 5, 5.5, 7, 7.6, 8, 11, 12, 13]), 'x' : np.array([1, 2, 3, 4, 4.4, 5, 7, 8, 8.5]) })
lmod = robjects.r.lm("y ~ x", data = data)
print lmod.names
[1] "coefficients" "residuals" "effects" "rank" [5] "fitted.values" "assign" "qr" "df.residual" [9] "xlevels" "call" "terms" "model"
coeffs = lmod.rx2("coefficients") print "R's representation via 'print'" print(coeffs) print print "Same coefficients in Python's floats:" print ([x for x in coeffs])
R's representation via 'print' (Intercept) x 2.328485 1.215469 Same coefficients in Python's floats: [2.328485324947587, 1.2154692791485247]
# max is from Python, iterates naturally over the entries in all residuals print max(lmod.rx2("residuals"))
0.456045395904

Plot

from rpy2.robjects.packages import importr grdevices = importr('grDevices')
# just calling "plot" on the dataframe
_ = grdevices.png(file="rpy2_plot.png", width=640, height=320) _ = robjects.r.plot(data) grdevices.dev_off()
<IntVector - Python:0xb6beab8 / R:0xab10878> [ 2]
salvus.file("rpy2_plot.png")
# Plot of the linear model lmod
_ = grdevices.png(file="rpy2_plot_2.png", width=640, height=520) _ = robjects.reval("par(mfrow=c(2,2))") _ = robjects.r.plot(lmod) grdevices.dev_off()
<IntVector - Python:0xb6da1b8 / R:0xb073838> [ 2]
salvus.file("rpy2_plot_2.png")
# get R's "print" via globalenv, otherwise it's a syntax error in Python! rprint = robjects.globalenv.get("print") volcano = datasets.__rdata__.fetch("volcano")["volcano"] lattice = importr("lattice") _ = grdevices.png(file="rpy2_plot_wireframe.png", width=480, height=480) p = lattice.wireframe(volcano, shade = True, zlab = "", aspect = robjects.FloatVector((61.0/87, 0.5)), light_source = robjects.IntVector((10,0,10))) _ = rprint(p) grdevices.dev_off()
<IntVector - Python:0xc559d40 / R:0xc7bcc38> [ 2]
salvus.file("rpy2_plot_wireframe.png")

Advanced: PCA

USArrests = datasets.__rdata__.fetch("USArrests")["USArrests"] r_stats = importr("stats") pca_usarrest = r_stats.princomp(USArrests, cor=True) print(summary(pca_usarrest))
Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Standard deviation 1.5748783 0.9948694 0.5971291 0.41644938 Proportion of Variance 0.6200604 0.2474413 0.0891408 0.04335752 Cumulative Proportion 0.6200604 0.8675017 0.9566425 1.00000000
grdevices = importr('grDevices') _ = grdevices.png(file="rpy2_plot_pca.png", width=480, height=480) _ = robjects.r.biplot(pca_usarrest) _ = grdevices.dev_off() salvus.file("rpy2_plot_pca.png")
#low level ︠6c76dd7a-200c-43bc-b30f-2e2aabd1c959︠ print(robjects.r.help("sum"))
R Help on ‘sum’sum package:base R Documentation Sum of Vector Elements Description: ‘sum’ returns the sum of all the values present in its arguments. Usage: sum(..., na.rm = FALSE) Arguments: ...: numeric or complex or logical vectors. na.rm: logical. Should missing values (including ‘NaN’) be removed? Details: This is a generic function: methods can be defined for it directly or via the ‘Summary’ group generic. For this to work properly, the arguments ‘...’ should be unnamed, and dispatch is on the first argument. If ‘na.rm’ is ‘FALSE’ an ‘NA’ or ‘NaN’ value in any of the arguments will cause a value of ‘NA’ or ‘NaN’ to be returned, otherwise ‘NA’ and ‘NaN’ values are ignored. Logical true values are regarded as one, false values as zero. For historical reasons, ‘NULL’ is accepted and treated as if it were ‘integer(0)’. Value: The sum. If all of ‘...’ are of type integer or logical, then the sum is integer, and in that case the result will be ‘NA’ (with a warning) if integer overflow occurs. Otherwise it is a length-one numeric or complex vector. *NB:* the sum of an empty set is zero, by definition. S4 methods: This is part of the S4 ‘Summary’ group generic. Methods for it must use the signature ‘x, ..., na.rm’. ‘plotmath’ for the use of ‘sum’ in plot annotation. References: Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S Language_. Wadsworth & Brooks/Cole. See Also: ‘colSums’ for row and column sums.
# via RPy2 wrappers help_sum = robjects.help.Package("base").fetch("sum")
print(help_sum.title())
Sum of Vector Elements
print(help_sum.description())
\code{sum} returns the sum of all the values present in its arguments.
print(help_sum.usage()) ︠a735f44a-9169-40ff-bc30-b0fef56d75dc︠ for arg, descr in help_sum.arguments(): print("%-10s: %s" % (arg, descr))
... : numeric or complex or logical vectors. na.rm : logical. Should missing values (including \code{NaN}) be removed?
print(help_sum.seealso())
\code{\link{colSums}} for row and column sums.
print(help_sum.value())
The sum. If all of \code{\dots} are of type integer or logical, then the sum is integer, and in that case the result will be \code{NA} (with a warning) if integer overflow occurs. Otherwise it is a length-one numeric or complex vector. \strong{NB:} the sum of an empty set is zero, by definition.
help_sum.sections.keys()
('title', 'name', 'alias', 'keyword', 'description', 'usage', 'arguments', 'details', 'value', 'section', 'references', 'seealso')
print(''.join(help_sum.to_docstring(("title", "usage", 'details', "references", "section"))))
title ----- Sum of Vector Elements usage ----- sum( , na.rm = FALSE) details ------- This is a generic function: methods can be defined for it directly or via the Summary group generic. For this to work properly, the arguments should be unnamed, and dispatch is on the first argument. If na.rm is FALSE an NA or NaN value in any of the arguments will cause a value of NA or NaN to be returned, otherwise NA and NaN values are ignored. Logical true values are regarded as one, false values as zero. For historical reasons, NULL is accepted and treated as if it were integer(0) . references ---------- Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language . Wadsworth & Brooks/Cole. section ------- S4 methods This is part of the S4 Summary group generic. Methods for it must use the signature x, , na.rm . plotmath for the use of sum in plot annotation.