Kernel: Apache Spark (Anaconda Python 3)
Apache Spark on Anaconda Python 3
© 2016 - Harald Schilly <[email protected]> - CC BY-SA 4.0
Configuration
Anaconda Python 3 + scientific Python stack
Apache Spark 1.6.1
~/.zshrc
:
$SPARK_HOME/conf/spark-env.sh
:
Start using: $ pyspark
In [63]:
3.5.1 |Anaconda 2.4.1 (64-bit)| (default, Dec 7 2015, 11:16:01)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
Spark Kontext
In [64]:
(<pyspark.context.SparkContext at 0x7f90492d71d0>,
<pyspark.sql.context.SQLContext at 0x7f90262efe80>)
In [65]:
'Apache Spark Version 1.6.1'
Values as parallelized vectors
In [66]:
ParallelCollectionRDD[208] at parallelize at PythonRDD.scala:423
In [68]:
[3, 5, 2, 3, 1, 5, 2, 3, 2, 3, 1]
In [69]:
11
Plot a function
by mapping a range of values with a function
In [6]:
PythonRDD[2] at RDD at PythonRDD.scala:43
In [7]:
In [8]:
(1) PythonRDD[3] at RDD at PythonRDD.scala:43 []
| ParallelCollectionRDD[1] at parallelize at PythonRDD.scala:423 []
In [9]:
In [10]:
[<matplotlib.lines.Line2D at 0x7f9029f775c0>]
Broadcasting
Distribute the value to all workers in the cluster.
In [11]:
In [12]:
In [13]:
(20) UnionRDD[23] at union at NativeMethodAccessorImpl.java:-2 []
| UnionRDD[14] at union at NativeMethodAccessorImpl.java:-2 []
| PythonRDD[12] at RDD at PythonRDD.scala:43 []
| EmptyRDD[4] at emptyRDD at NativeMethodAccessorImpl.java:-2 []
| PythonRDD[13] at RDD at PythonRDD.scala:43 []
| goethe-faust2.txt MapPartitionsRDD[6] at textFile at NativeMethodAccessorImpl.java:-2 []
| goethe-faust2.txt HadoopRDD[5] at textFile at NativeMethodAccessorImpl.java:-2 []
| PythonRDD[22] at RDD at PythonRDD.scala:43 []
| goethe-faust1.txt MapPartitionsRDD[16] at textFile at NativeMethodAccessorImpl.java:-2 []
| goethe-faust1.txt HadoopRDD[15] at textFile at NativeMethodAccessorImpl.java:-2 []
In [14]:
75971
In [15]:
82136
Accumulators
Globale kommmutative Zählvariable.
In [16]:
In [17]:
In [18]:
400674
In [19]:
574503
Map/Reduce
Alle nicht-Buchstaben herausfiltern.
In [20]:
In [70]:
[('und', 1264),
('ich', 1085),
('der', 1063),
('die', 1041),
('zu', 910),
('nicht', 874),
('Und', 822),
('sich', 732),
('ist', 667),
('ein', 626),
('das', 623),
('sie', 554),
('in', 542),
('es', 542),
('den', 498),
('Die', 496),
('du', 494),
('mich', 463),
('MEPHISTOPHELES', 442),
('mir', 437),
('so', 403),
('Ich', 396),
('er', 395),
('dem', 395),
('mit', 393),
('Das', 385),
('Der', 385),
('auf', 360),
('wie', 350),
('FAUST', 349)]
In [22]:
442
Kontrolle: Summe aller Wortlängen
In [23]:
75971
Längste Wörter
In [24]:
[(28, 'Untätigkeits-Entschuldigung:'),
(28, 'Fettbauch-Krummbein-Schelme.'),
(25, 'Kalenderei--Chymisterei--'),
(25, 'Einsiedlerisch-beschränkt'),
(25, 'Dreinamig-Dreigestaltete,'),
(23, 'Schneckeschnickeschnack'),
(22, 'allerliebst-geselliger'),
(22, 'Flügelflatterschlagen!'),
(22, 'Bürger-Nahrungs-Graus,'),
(21, 'heimlich-kätzchenhaft')]
Iterationen
In [25]:
UnionRDD[23] at union at NativeMethodAccessorImpl.java:-2
In [26]:
x[ 0] = 500.00 → v = 5575 (d = 0.50)
x[ 1] = 666.67 → v = 2817 (d = 0.33)
x[ 2] = 500.00 → v = 3802 (d = 0.25)
x[ 3] = 600.00 → v = 2817 (d = 0.20)
x[ 4] = 500.00 → v = 3429 (d = 0.17)
x[ 5] = 571.43 → v = 2817 (d = 0.14)
x[ 6] = 500.00 → v = 3239 (d = 0.12)
x[ 7] = 555.56 → v = 2817 (d = 0.11)
x[ 8] = 500.00 → v = 3154 (d = 0.10)
x[ 9] = 545.45 → v = 2817 (d = 0.09)
x[10] = 500.00 → v = 3094 (d = 0.08)
x[11] = 538.46 → v = 2822 (d = 0.08)
x[12] = 500.00 → v = 3049 (d = 0.07)
x[13] = 533.33 → v = 2822 (d = 0.07)
x[14] = 500.00 → v = 3022 (d = 0.06)
x[15] = 529.41 → v = 2822 (d = 0.06)
x[16] = 500.00 → v = 3001 (d = 0.06)
x[17] = 526.32 → v = 2822 (d = 0.05)
x[18] = 552.63 → v = 2988 (d = 0.05)
x[19] = 526.32 → v = 3130 (d = 0.05)
PageRank
siehe spark-pagerank.ipynb
In [27]:
In [28]:
In [29]:
In [30]:
0 has rank: 1.4925426310288215.
1 has rank: 0.569011704714285.
2 has rank: 0.569011704714285.
3 has rank: 1.5663083638833293.
Histogram
In [31]:
In [32]:
['F', 'A', 'U', 'S', 'T', 'D', 'E', 'R', 'T', 'R', 'A', 'G', 'D', 'I', 'E', 'Z', 'W', 'E', 'I', 'T', 'E', 'R', 'T', 'E', 'I', 'L', 'V', 'O', 'N', 'J', 'O', 'H', 'A', 'N', 'N', 'W', 'O', 'L', 'F', 'G', 'A', 'N', 'G', 'V', 'O', 'N', 'G', 'O', 'E', 'T', 'H', 'E', 'A', 'N', 'M', 'U', 'T', 'I', 'G', 'E', 'G', 'E', 'G', 'E', 'N', 'D', 'H', 'O', 'C', 'H', 'G', 'E', 'W', 'L', 'B', 'T', 'E', 'S', 'E', 'N', 'G', 'E', 'S', 'G', 'O', 'T', 'I', 'S', 'C', 'H', 'E', 'S', 'Z', 'I', 'M', 'M', 'E', 'R', 'V', 'O']
In [33]:
defaultdict(int,
{'A': 18156,
'B': 6805,
'C': 14955,
'D': 17899,
'E': 58860,
'F': 6385,
'G': 11132,
'H': 24159,
'I': 30167,
'J': 712,
'K': 4438,
'L': 15773,
'M': 11007,
'N': 35768,
'O': 8644,
'P': 3231,
'Q': 134,
'R': 25718,
'S': 25371,
'T': 23848,
'U': 14859,
'V': 2613,
'W': 7172,
'X': 97,
'Y': 199,
'Z': 4197})
In [34]:
[]
Lineare Regression
Zufällige Daten generieren …
In [35]:
<matplotlib.collections.PathCollection at 0x7f90264a3278>
In [36]:
Datenpunkte und dazugehöriger Featurevektor (exogen x → endogen y)
In [37]:
In [38]:
(weights=[3.0204407579], intercept=0.0)
In [39]:
Mean Squared Error = 1.02845039572
In [40]:
In [41]:
Isotonic Regression
In [42]:
In [43]:
In [44]:
In [45]:
Mean Squared Error = [ 0.64607526]
In [46]:
In [47]:
In [48]:
In [49]:
Electronegativity, Calculated Radius, First Ionization, Core Configuration, Heat of Vapor, Covalent Radius, Heat of Fusion, Bulk Modulus, Boiling Point, Brinell Hardness, Melting Point, Symbol, STP Density, Young Modulus, Shear Modulus, Vickers Hardness, Name, Common Ions, Second Ionization, Mass, Van der Waals Radius, Specific Heat, Thermal Cond., Third Ionization, Series, Electron Affinity, Atomic Number, Mohs Hardness, Empirical Radius
Converting second to last row into "Row"s
In [52]:
registering Schema
In [54]:
Ausgabe der Tabelle
In [55]:
+----------+----------+------+------+------+
| mass| name|number|radius|symbol|
+----------+----------+------+------+------+
| 1.00794| Hydrogen| 1| 25.0| H|
| 4.002602| Helium| 2| null| He|
| 6.941| Lithium| 3| 145.0| Li|
| 9.012182| Beryllium| 4| 105.0| Be|
| 10.811| Boron| 5| 85.0| B|
| 2352.6| Carbon| 6| 70.0| C|
| 14.0067| Nitrogen| 7| 65.0| N|
| 15.9994| Oxygen| 8| 60.0| O|
|18.9984032| Fluorine| 9| 50.0| F|
| 20.1797| Neon| 10| null| Ne|
| 22.9898| Sodium| 11| 180.0| Na|
| 24.305| Magnesium| 12| 150.0| Mg|
| 26.9815| Aluminum| 13| 125.0| Al|
| 28.0855| Silicon| 14| 110.0| Si|
| 30.9738|Phosphorus| 15| 100.0| P|
| 32.065| Sulfur| 16| 100.0| S|
| 35.453| Chlorine| 17| 100.0| Cl|
| 39.948| Argon| 18| 71.0| Ar|
| 39.0983| Potassium| 19| 220.0| K|
| 40.078| Calcium| 20| 180.0| Ca|
+----------+----------+------+------+------+
only showing top 20 rows
Registering Schema
In [56]:
Inferring types
In [57]:
root
|-- mass: double (nullable = true)
|-- name: string (nullable = true)
|-- number: long (nullable = true)
|-- radius: double (nullable = true)
|-- symbol: string (nullable = true)
Query DataFrame with SQL
In [58]:
[Row(number=6, name='Carbon', mass=2352.6),
Row(number=22, name='Titanium', mass=1309.8),
Row(number=24, name='Chromium', mass=1590.6),
Row(number=26, name='Iron', mass=1561.9),
Row(number=27, name='Cobalt', mass=1648.0),
Row(number=29, name='Copper', mass=1957.9),
Row(number=50, name='Tin', mass=1411.8),
Row(number=58, name='Cerium', mass=1050.0),
Row(number=79, name='Gold', mass=1980.0),
Row(number=80, name='Mercury', mass=1810.0),
Row(number=81, name='Thallium', mass=1971.0),
Row(number=82, name='Lead', mass=1450.5)]
In [59]:
[<matplotlib.lines.Line2D at 0x7f9026210668>]
Query DataFrame
In [60]:
+------+---------+------+----------+
|number| name|radius| mass|
+------+---------+------+----------+
| 1| Hydrogen| 25.0| 1.00794|
| 2| Helium| null| 4.002602|
| 3| Lithium| 145.0| 6.941|
| 4|Beryllium| 105.0| 9.012182|
| 5| Boron| 85.0| 10.811|
| 7| Nitrogen| 65.0| 14.0067|
| 8| Oxygen| 60.0| 15.9994|
| 9| Fluorine| 50.0|18.9984032|
+------+---------+------+----------+
In [61]:
+------+---------+------+----------+
|number| name|radius| mass|
+------+---------+------+----------+
| 1| Hydrogen| 25.0| 1.00794|
| 2| Helium| 0.0| 4.002602|
| 3| Lithium| 145.0| 6.941|
| 4|Beryllium| 105.0| 9.012182|
| 5| Boron| 85.0| 10.811|
| 7| Nitrogen| 65.0| 14.0067|
| 8| Oxygen| 60.0| 15.9994|
| 9| Fluorine| 50.0|18.9984032|
+------+---------+------+----------+
Future: Spark 2.0 hat DataSets
In [ ]: