CoCalc Shared FilesSDSS / hr03.sagews
Author: Hal Snyder
Views : 70
Description: use clustering to color star data in H-R diagram subset

# Using Machine Learning in an HR Diagram

This worksheet is a quick exploration of clustering of unlabeled data to classify stars in a Herzsprung-Russell Diagram (HRD) for a sample set of stars.

In this simple test, two clusters of stars are identified. I think a more sophisticated approach is needed for the other customary groups.

Clustering code is from the scikit-learn Python package.

## References

%auto
%default_mode python3

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.neighbors import kneighbors_graph
from sklearn.cluster import AgglomerativeClustering

# Import the hipparcos-2 star catalog
# ftp://cdsarc.u-strasbg.fr/pub/cats/I/311
h2colnames = [
"HIP",
"Sn",
"So",
"Nc",
"Plx",
"pmRA",
"pmDE",
"e_Plx",
"e_pmRA",
"e_pmDE",
"Ntr",
"F2",
"F1",
"var",
"ic",
"Hpmag",
"e_Hpmag",
"sHp",
"VA",
"B-V",
"e_B-V",
"V-I",
]
print('number of columns',len(h2colnames))
h2colspecs = [
(1,7),
(8,11),
(12,13),
(14,15),
(16,29),
(30,43),
(44,51),
(52,60),
(61,69),
(70,76),
(77,83),
(84,90),
(91,97),
(98,104),
(105,108),
(109,114),
(115,117),
(118,124),
(125,129),
(130,137),
(138,144),
(145,150),
(151,152),
(153,159),
(160,165),
(166,172)
]
h2cols2 = [(x-1,y-1) for (x,y) in h2colspecs]
df = pd.read_fwf(fname, names=h2colnames, colspecs=h2cols2, index_col=0, nrows=nrows)
return df

number of columns 26
# create dataframe from the data

# distance and magnitude calculations

# compute distance from observed parallax
df = df[df['Plx'] > 0][['Hpmag','Plx','B-V']]
df['Distance'] = 1000.0/df['Plx']

# compute absolute magnitude from apparent magnitude
df['AbsMag'] =  df['Hpmag'] - 5*np.log10(df['Distance']) + 5
df.shape

(113942, 5)
# display values for the star Sirius
df.loc[32349]

Hpmag -1.087600 Plx 379.210000 B-V 0.009000 Distance 2.637061 AbsMag 1.806799 Name: 32349, dtype: float64
# for this quick test, plot a random sample of the catalog
df3 = df.sample(n=200)
df3.shape

(200, 5)
# plot absolute magnitude vs b-v color for sample set
# this is an HRD subset plot
# use agglomerative clustering to color stars in two groups
X = df3.as_matrix(columns=['AbsMag','B-V'])
scaler = preprocessing.StandardScaler().fit(X)
XT = scaler.transform(X)
connectivity = kneighbors_graph(XT, n_neighbors=4, include_self=False)
ac = AgglomerativeClustering(n_clusters=2, connectivity=connectivity).fit(XT)
df3['color'] = ac.labels_
#cm = 'bgrgrcmyk'
cm = [
'crimson',
'darkblue'
]
cmap = df3['color'].apply(lambda x: cm[x])
df3.plot(x='B-V', y='AbsMag', kind='scatter', figsize=(6,4),
title='AbsMag vs. B-V', color=cmap,
grid=True, legend=False).invert_yaxis()