Using Machine Learning in an HR Diagram

This worksheet is a quick exploration of clustering of unlabeled data to classify stars in a Herzsprung-Russell Diagram (HRD) for a sample set of stars.

In this simple test, two clusters of stars are identified. I think a more sophisticated approach is needed for the other customary groups.

Clustering code is from the scikit-learn Python package.

References

Introduction to HRD: The Hertzsprung-Russell Diagram
Sloan Digital Sky Survey DR14 Projects: The Hertzsprung-Russell Diagram
scikit-learn documentation: Clustering
Hubble Space Telescope: Hertzsprung-Russell diagram animation

%auto
%default_mode python3

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.neighbors import kneighbors_graph
from sklearn.cluster import AgglomerativeClustering

# Import the hipparcos-2 star catalog
# ftp://cdsarc.u-strasbg.fr/pub/cats/I/311
h2colnames = [
    "HIP",  
    "Sn",   
    "So",   
    "Nc",   
    "RArad",
    "DErad",
    "Plx",  
    "pmRA", 
    "pmDE", 
    "e_RArad",
    "e_DErad",
    "e_Plx",
    "e_pmRA",
    "e_pmDE",
    "Ntr",  
    "F2",   
    "F1",   
    "var",  
    "ic",   
    "Hpmag",
    "e_Hpmag",
    "sHp",  
    "VA",   
    "B-V",  
    "e_B-V",
    "V-I", 
]
print('number of columns',len(h2colnames))
h2colspecs = [
    (1,7),
    (8,11),
    (12,13),
    (14,15),
    (16,29),
    (30,43),
    (44,51),
    (52,60),
    (61,69),
    (70,76),
    (77,83),
    (84,90),
    (91,97),
    (98,104),
    (105,108),
    (109,114),
    (115,117),
    (118,124),
    (125,129),
    (130,137),
    (138,144),
    (145,150),
    (151,152),
    (153,159),
    (160,165),
    (166,172)
]
h2cols2 = [(x-1,y-1) for (x,y) in h2colspecs]
def read_hip2(fname="hip2.dat", nrows=10):
    df = pd.read_fwf(fname, names=h2colnames, colspecs=h2cols2, index_col=0, nrows=nrows)
    return df

number of columns 26

# create dataframe from the data
df = read_hip2(nrows=None)

# distance and magnitude calculations
# http://skyserver.sdss.org/dr14/en/proj/advanced/hr/hipparcos2.aspx

# compute distance from observed parallax
df = df[df['Plx'] > 0][['Hpmag','Plx','B-V']]
df['Distance'] = 1000.0/df['Plx']

# compute absolute magnitude from apparent magnitude
df['AbsMag'] =  df['Hpmag'] - 5*np.log10(df['Distance']) + 5
df.shape

(113942, 5)

# display values for the star Sirius
df.loc[32349]

Hpmag        -1.087600
Plx         379.210000
B-V           0.009000
Distance      2.637061
AbsMag        1.806799
Name: 32349, dtype: float64

# for this quick test, plot a random sample of the catalog
df3 = df.sample(n=200)
df3.shape

(200, 5)

# plot absolute magnitude vs b-v color for sample set
# this is an HRD subset plot
# use agglomerative clustering to color stars in two groups
X = df3.as_matrix(columns=['AbsMag','B-V'])
scaler = preprocessing.StandardScaler().fit(X)
XT = scaler.transform(X)
connectivity = kneighbors_graph(XT, n_neighbors=4, include_self=False)
ac = AgglomerativeClustering(n_clusters=2, connectivity=connectivity).fit(XT)
df3['color'] = ac.labels_
#cm = 'bgrgrcmyk'
cm = [
    'crimson',
    'darkblue'
]
cmap = df3['color'].apply(lambda x: cm[x])
df3.plot(x='B-V', y='AbsMag', kind='scatter', figsize=(6,4),
        title='AbsMag vs. B-V', color=cmap,
        grid=True, legend=False).invert_yaxis()