CoCalc
Shared3DEuclideanSpace_1MSongsKMeansClustering.sagewsOpen in CoCalc
Description: This is a sageMath Worksheet accompaniment for the Apache Spark databricks notebook in Scala for understanding the K-Means clustering on a small sample of the 1M Songs data used in the course on Scalable Data Engineering Science available freely from https://lamastex.github.io/scalable-data-science/sds/2/x/
Author: Raazesh Sainudiin
Views : 3

This is a sageMath (python) Worksheet to get one started with interactively visualizing points in 3D

This is a support notebook in sageMath (actually a Worksheet) for Scalable Data Science Course. It is mostly used as a visual cognitive tool.

SageMath is perhaps the largest open-source effort to do mathematical computing and you can use it for serious mathematical computing:

See

For relevant plotting we will do now see docs here: https://doc.sagemath.org/html/en/reference/plot3d/sage/plot/plot3d/shapes2.html

And 3D interactive visualization possibilities here: http://sagemath.wikispaces.com/point3d http://sagemath.wikispaces.com/plot3d (see 10 minutes long YouTube video in the link).

pts = point3d((4,3,2),size=20,color='red',opacity=.5) show(pts)
3D rendering not yet implemented
# understand the elements of how (1,0,0)+(0,1,0)+(0,0,1)=(1,1,1) point3d([(1,0,0),(0,1,0),(0,0,1),(1,1,1)],size=20,color='red',opacity=.5)
3D rendering not yet implemented
# you can also use list comprehensions as well as anonymous functions using Python lambda functions pts2 = sum([point3d((i,i^2,i^3), size=5) for i in range(100)]) show(pts2)
3D rendering not yet implemented

Plotting the points from a csv file

See https://ask.sagemath.org/question/9393/how-to-plot-data-from-a-file/.

The file has been downloaded from the display in the databricks notebook from https://lamastex.github.io/scalable-data-science/sds/2/2/.

The first 10 lines of the file looks like this:

prediction,loudness,tempo,log_duration
0,-11.422,113.924,5.715779455566171
1,-9.086,149.709,5.128421707524712
1,-12.934,134.957,5.246686488869869
1,-6.552,130.152,5.700695916131519
0,-8.849,96.006,5.337347484616152
0,-20.277,100.777,4.290263120508892
0,-7.877,109.267,5.0043102665494
0,-5.989,114.493,5.630042029827832
0,-11.66,125.022,6.007690245599277

There are 1000 rows in the file that has been uploaded to this sageMath Worksheet in COCALC. This file is in the current directory with the path in the Python open() function below.

f = open('./KMeansClusters10003DFeatures_loudness-tempologDuration_Of1MSongsKMeansfor_015_sds2-2.csv', 'r') # This is just a pedantic implementation for clarity loudnessArrayCluster0=[] tempoArrayCluster0=[] log_durationArrayCluster0=[] loudnessArrayCluster1=[] tempoArrayCluster1=[] log_durationArrayCluster1=[] line=f.readline() # read the first header line away - rolling in a hurry here line=f.readline() dataPointRowNumber = 0 maxdataPointsToPlot = 10 # set the maximumum number of points you want to plot while( line !='' and dataPointRowNumber < maxdataPointsToPlot): pltd = line.split(',') if (float(pltd[0])==0): loudnessArrayCluster0.append(float(pltd[1])) tempoArrayCluster0.append(float(pltd[2])) log_durationArrayCluster0.append(float(pltd[3])) if (float(pltd[0])==1): loudnessArrayCluster1.append(float(pltd[1])) tempoArrayCluster1.append(float(pltd[2])) log_durationArrayCluster1.append(float(pltd[3])) line=f.readline() dataPointRowNumber=dataPointRowNumber+1
zip(loudnessArrayCluster0,tempoArrayCluster0)
[(-11.422, 113.924), (-8.849, 96.006), (-20.277, 100.777), (-7.877, 109.267), (-5.989, 114.493), (-11.66, 125.022), (-14.1, 90.442)]
list_plot(zip(loudnessArrayCluster0,tempoArrayCluster0), color="red", size=20) # a pair-wise scatter plot
list_plot(zip(loudnessArrayCluster1,tempoArrayCluster1), color="blue", size=20) # do the same for cluster 1 in blue color
# let's overlay the two images by using the '+' operator defined for the objects myFig = list_plot(zip(loudnessArrayCluster1,tempoArrayCluster1), color="blue", size=20) myFig += list_plot(zip(loudnessArrayCluster0,tempoArrayCluster0), color="red", size=20) show(myFig)

Let's just see these points in 3D using primitive graphics objects interactively

We can compare the clusterings in 3D with and without taking log of duration and understand their 2D scatter plots

def readCSVFromKMeansAndReturnPointsToPlot(filename,maxdataPointsToPlot): '''this is a function form of the previous step by step explanation of reading CSV from the output of Spark's K-means algorithm inputs: filename is PATH to known CSV - no exception handling implemented sice the file is known output of Spark DataFrame maxdataPointsToPlot is the number of rows starting from the second line you want to plot - maximum is 1000 - no error checks! ''' f = open(filename, 'r') # This is just a pedantic declaration of arrays by cluster for clarity lArrayCluster0=[] tArrayCluster0=[] dArrayCluster0=[] lArrayCluster1=[] tArrayCluster1=[] dArrayCluster1=[] line=f.readline() # read the first header line away - rolling in a hurry here line=f.readline() dataPointRowNumber = 0 #maxdataPointsToPlot = 1000 # set the maximumum number of points you want to plot while( line !='' and dataPointRowNumber < maxdataPointsToPlot): pltd = line.split(',') if (float(pltd[0])==0): lArrayCluster0.append(float(pltd[1])) tArrayCluster0.append(float(pltd[2])) dArrayCluster0.append(float(pltd[3])) if (float(pltd[0])==1): lArrayCluster1.append(float(pltd[1])) tArrayCluster1.append(float(pltd[2])) dArrayCluster1.append(float(pltd[3])) line=f.readline() dataPointRowNumber=dataPointRowNumber+1 return [lArrayCluster0, tArrayCluster0, dArrayCluster0, lArrayCluster1, tArrayCluster1, dArrayCluster1]
# load the CSV file with the results of the k-means algorithm with three features (loudness, tempo, log(duration)) [lArrayCl0_logDuration, tArrayCl0_logDuration, dArrayCl0_logDuration, lArrayCl1_logDuration, tArrayCl1_logDuration, dArrayCl1_logDuration] = \ readCSVFromKMeansAndReturnPointsToPlot('./KMeansClusters10003DFeatures_loudness-tempologDuration_Of1MSongsKMeansfor_015_sds2-2.csv',1000)
# load the CSV file with the results of the k-means algorithm with three features (loudness, tempo, duration) [lArrayCl0_Duration, tArrayCl0_Duration, dArrayCl0_Duration, lArrayCl1_Duration, tArrayCl1_Duration, dArrayCl1_Duration] = \ readCSVFromKMeansAndReturnPointsToPlot('./KMeansClusters10003DFeatures_loudness-tempoDuration_Of1MSongsKMeansfor_015_sds2-2.csv',1000)
# get the minimum for each cluster on each coordinate (or feature) print(min(zip(lArrayCl0_logDuration,tArrayCl0_logDuration, dArrayCl0_logDuration))) print(min(zip(lArrayCl1_logDuration,tArrayCl1_logDuration, dArrayCl1_logDuration)))
(-52.781, 76.42, 6.292396371381782) (-32.349, 134.157, 5.87741203189931)
# get the maximum for each cluster on each coordinate (or feature) print(max(zip(lArrayCl0_logDuration,tArrayCl0_logDuration, dArrayCl0_logDuration))) print(max(zip(lArrayCl1_logDuration,tArrayCl1_logDuration, dArrayCl1_logDuration)))
(-3.069, 95.323, 5.554721829705378) (-2.385, 209.986, 5.201678098897631)
myFigLogDuration = point3d(zip(lArrayCl0_logDuration,tArrayCl0_logDuration, dArrayCl0_logDuration), color="red", opacity=.5, size=10) myFigLogDuration += point3d(zip(lArrayCl1_logDuration,tArrayCl1_logDuration, dArrayCl1_logDuration), color="blue", opacity=.5, size=10) show(myFigLogDuration)
3D rendering not yet implemented
myFigDuration = point3d(zip(lArrayCl0_Duration,tArrayCl0_Duration, dArrayCl0_Duration), color="red", opacity=.5, size=10) myFigDuration += point3d(zip(lArrayCl1_Duration,tArrayCl1_Duration, dArrayCl1_Duration), color="blue", opacity=.5, size=10) show(myFigDuration)
3D rendering not yet implemented

To manipulate the rendering of the interactive 3D Plot above uncomment and put the cursor after the '.' and hit TAB to see methods

Also don't forget sageMath docs http://doc.sagemath.org/html/en/index.html (sage has arithmetic, geometry, cryptography, calculus, and a lot lot more - finally COALC is free for small learning workloads).

#myFig. # uncomment and put the cursor after the '.' and hit TAB to see methods
# putting a '?' after the chosen method and evaluating the cell gives the document BUT no comments can follow the '?' when evaluating the cell myFig.aspect_ratio?
/ext/sage/sage-8.0/local/lib/python2.7/site-packages/urllib3/contrib/pyopenssl.py:46: DeprecationWarning: OpenSSL.rand is deprecated - you should use os.urandom instead import OpenSSL.SSL
File: /ext/sage/sage-8.0/local/lib/python2.7/site-packages/sage/plot/graphics.py Signature : myFig.aspect_ratio(self) Docstring : Get the current aspect ratio, which is the ratio of height to width of a unit square, or 'automatic'. OUTPUT: a positive float (height/width of a unit square), or 'automatic' (expand to fill the figure). EXAMPLES: The default aspect ratio for a new blank Graphics object is 'automatic': sage: P = Graphics() sage: P.aspect_ratio() 'automatic' The aspect ratio can be explicitly set different than the object's default: sage: P = circle((1,1), 1) sage: P.aspect_ratio() 1.0 sage: P.set_aspect_ratio(2) sage: P.aspect_ratio() 2.0 sage: P.set_aspect_ratio('automatic') sage: P.aspect_ratio() 'automatic'