Welcome to actsnclass !¶
Active Learning for Supernova Photometric Classification¶
This tool allows you to reproduce the results presented in Ishida et al., 2019. It was based on the original prototype developed during the COIN Residence Program #4 , which took place in Clermont Ferrand, France, August 2017.
The code has been updated to allow a friendly use and expansion.
Getting started¶
In order to setup a suitable working environment, clone this repository and make sure you have the necessary packages installed.
Dependencies¶
actsnclass
was developed under Python3
. The complete list of dependencies is given below:
- Python>=3.7
- matplotlib>=3.1.1
- numpy>=1.17.0
- pandas>=0.25.0
- setuptools>=41.0.1
- scipy>=1.3.0
- sklearn>=0.20.3
- seaborn>=0.9.0
Installing¶
Clone this repository,
>>> git clone https://github.com/COINtoolbox/ActSNClass
We recommend the use of Anaconda environments to ensure the proper version of all dependencies are installed and do not interfere in your other applications. You can find instructions on how to install it here.
If you wish to use this option, simple navigate to the directory of the repository and do:
>>> conda env create -f environment.yml
Once the environment is set up you can activate it:
>>> conda activate ActSNClass
If everything goes well you will see the name of the environment in the left most side of your command line.
You can now install actsnclass with:
(ActSNClass) >> python setup.py install
Setting up a working directory¶
In another location of your choosing, create the following directory structure:
work_dir
├── plots
├── results
The outputs of actsnclass
will be stored in these directories.
In order to set things properly, from the repository you just cloned, and move the data directory to your chosen working directory and unpack the data.
>>> mv -f actsnclass/data/ work_dir/
>>> cd work_dir/data
>>> tar -xzvf SIMGEN_PUBLIC_DES.tar.gz
This data was provided by Rick Kessler, after the publication of results from the SuperNova Photometric Classification Challenge.
Analysis steps¶

The actsnclass
pipeline is composed of 4 important steps:
- Feature extraction
- Classifier
- Query Strategy
- Metric evaluation
These are arranged in the adaptable learning process (figure to the right).
Using this package¶
Step 1 is considered pre-processing. The current code does the feature extraction using the Bazin parametric function for the complete training and test sample before any machine learning application is used.
Details of the tools available to evaluate different steps on feature extraction can be found in the Feature extraction page.
Alternatively, you can also perform the full light curve fit for the entire sample from the command line:
>>> fit_dataset.py -dd <path_to_data_dir> -o <output_file>
Once the data has been processed you can apply the full Active Learning loop according to your needs. A detail description on how to use this tool is provided in the Learning Loop page.
The command line option require a few more inputs than the feature extraction stage, but it is also available:
>>> run_loop.py -i <input features file> -b <batch size> -n <number of loops>
>>> -d <output metrics file> -q <output queried sample file>
>>> -s <learning strategy> -t <choice of initial training>
We also provide detail explanation on how to use this package to produce other stages of the pipeline like: prepare the Canonical sample, prepare data for time domain and produce plots.
We also provide detail descriptions on how to contribute with other modules in the How to contribute tab.
Enjoy!!
Acknowledgements¶
This work is heavily based on the first prototype developed during COIN Residence Program (CRP#4), held in Clermont Ferrand, France, 2017 and financially supported by Universite Clermont Auvergne and La Region Auvergne-Rhone-Alpes. We thank Emmanuel Gangler for encouraging the realization of this event.
The COsmostatistics INitiative (COIN) receives financial support from CNRS as part of its MOMENTUM programme over the 2018-2020 period, under the project Active Learning for Large Scale Sky Surveys.
This work would not be possible without intensive consultation to online platforms and discussion forums. Although it is not possible to provide a complete list of the open source material consulted in the construction of this material, we recognize their importance and deeply thank all those who contributes to open learning platforms.
Table of Contents¶
Feature Extraction¶
The first stage in consists in transforming the raw data into a uniform data matrix which will subsequently be given as input to the learning algorithm.
The current implementation of actsnclass
text-like data from the SuperNova Photometric Classification Challenge
(SNPCC) which is described in Kessler et al., 2010.
Processing 1 Light curve¶
The raw data looks like this:
SURVEY: DES
SNID: 848233
IAUC: UNKNOWN
PHOTOMETRY_VERSION: DES
SNTYPE: 22
FILTERS: griz
RA: 36.750000 deg
DECL: -4.500000 deg
MAGTYPE: LOG10
MAGREF: AB
FAKE: 2 (=> simulated LC with snlc_sim.exe)
MWEBV: 0.0283 MW E(B-V)
REDSHIFT_HELIO: 0.50369 +- 0.00500 (Helio, z_best)
REDSHIFT_FINAL: 0.50369 +- 0.00500 (CMB)
REDSHIFT_SPEC: 0.50369 +- 0.00500
REDSHIFT_STATUS: OK
HOST_GALAXY_GALID: 17173
HOST_GALAXY_PHOTO-Z: 0.4873 +- 0.0318
SIM_MODEL: NONIA 10 (name index)
SIM_NON1a: 30 (non1a index)
SIM_COMMENT: SN Type = II , MODEL = SDSS-017564
SIM_LIBID: 2
SIM_REDSHIFT: 0.5029
SIM_HOSTLIB_TRUEZ: 0.5000 (actual Z of hostlib)
SIM_HOSTLIB_GALID: 17173
SIM_DLMU: 42.276020 mag [ -5*log10(10pc/dL) ]
SIM_RA: 36.750000 deg
SIM_DECL: -4.500000 deg
SIM_MWEBV: 0.0256 (MilkyWay E(B-V))
SIM_PEAKMAG: 22.48 22.87 22.70 22.82 (griz obs)
SIM_EXPOSURE: 1.0 1.0 1.0 1.0 (griz obs)
SIM_PEAKMJD: 56251.609375 days
SIM_SALT2x0: 1.229e-17
SIM_MAGDIM: 0.000
SIM_SEARCHEFF_MASK: 3 (bits 1,2=> found by software,humans)
SIM_SEARCHEFF: 1.0000 (spectro-search efficiency (ignores pipelines))
SIM_TRESTMIN: -38.24 days
SIM_TRESTMAX: 64.80 days
SIM_RISETIME_SHIFT: 0.0 days
SIM_FALLTIME_SHIFT: 0.0 days
SEARCH_PEAKMJD: 56250.734
# ============================================
# TERSE LIGHT CURVE OUTPUT:
#
NOBS: 108
NVAR: 9
VARLIST: MJD FLT FIELD FLUXCAL FLUXCALERR SNR MAG MAGERR SIM_MAG
OBS: 56194.145 g NULL 7.600e+00 4.680e+00 1.62 99.000 5.000 98.926
OBS: 56194.156 r NULL 3.875e+00 2.752e+00 1.41 99.000 5.000 98.953
OBS: 56194.172 i NULL 3.585e+00 4.628e+00 0.77 99.000 5.000 99.033
OBS: 56194.188 z NULL -2.203e+00 4.463e+00 -0.49 99.000 5.000 98.983
OBS: 56207.188 g NULL -7.008e+00 4.367e+00 -1.60 99.000 5.000 98.926
OBS: 56207.195 r NULL -1.189e+00 3.459e+00 -0.34 99.000 5.000 98.953
OBS: 56207.203 i NULL 8.799e+00 6.249e+00 1.41 99.000 5.000 99.033
You can load this data using:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | >>> from actsnclass.fit_lightcurves import LightCurve >>> path_to_lc = 'data/SIMGEN_PUBLIC_DES/DES_SN848233.DAT' >>> lc = LightCurve() # create light curve instance >>> lc.load_snpcc_lc(path_to_lc) # read data >>> lc.photometry # check structure of photometry mjd band flux fluxerr SNR 0 56194.145 g 7.600 4.680 1.62 1 56194.156 r 3.875 2.752 1.41 ... ... ... ... ... ... 106 56348.008 z 70.690 6.706 10.54 107 56348.996 g 26.000 5.581 4.66 [108 rows x 5 columns] |
Once the data is loaded, you can fit each individual filter to the parametric function proposed by Bazin et al., 2009 in one specific filter.
1 2 3 | >>> rband_features = lc.fit_bazin('r') >>> print(rband_features) [159.25796385, -13.39398527, 55.16210333, 111.81204143, -20.13492354] |
The designation for each parameter are stored in:
It is possible to perform the fit in all filters at once and visualize the result using:
1 2 3 | >>> lc.fit_bazin_all() # perform Bazin fit in all filters >>> lc.plot_bazin_fit(save=True, show=True, output_file='plots/SN' + str(lc.id) + '.png') # save to file |

Processing all light curves in the data set¶
There are 2 way to perform the Bazin fits for the entire SNPCC data set. Using a python interpreter,
1 2 3 4 5 | >>> from actsnclass import fit_snpcc_bazin >>> path_to_data_dir = 'data/SIMGEN_PUBLIC_DES/' # raw data directory >>> output_file = 'results/Bazin.dat' # output file >>> fit_snpcc_bazin(path_to_data_dir=path_to_data_dir, features_file=output_file) |
The above will produce a file called Bazin.dat
in the results directory.
The same result can be achieved using the command line:
>> fit_dataset.py -dd <path_to_data_dir> -o <output_file>
Building the Canonical sample¶
According to the nomenclature used in Ishida et al., 2019, the Canonical sample is a subset of the test sample chosen to hold the same characteristics of the training sample. It was used to mimic the effect of continuously adding elements to the training sample under the traditional strategy.
It was constructed using the following steps:
- From the raw light curve files, build a metadata matrix containing:
[snid, sample, sntype, z, g_pkmag, r_pkmag, i_pkmag, z_pkmag, g_SNR, r_SNR, i_SNR, z_SNR]
wherez
corresponds to redshift,x_pkmag
is the simulated peak magnitude andx_SNR
denotes the mean SNR, both in filter x; - Separate original training and test set in 3 subsets according to SN type: [Ia, Ibc, II];
- For each object in the training sample, find its nearest neighbor within objects of the test sample of the same SN type and considering the photometric parameter space built in step 1.
This will allow you to construct a Canonical sample holding the same characteristics and size of the original training sample but composed of different objects.
actsnclass
allows you to perform this task using the py:mod:actsnclass.build_snpcc_canonical module:
1 2 3 4 5 6 7 8 9 10 11 12 | >>> from snactclass import build_snpcc_canonical >>> # define variables >>> data_dir = 'data/SIMGEN_PUBLIC_DES/' >>> output_sample_file = 'results/Bazin_SNPCC_canonical.dat' >>> output_metadata_file = 'results/Bazin_metadata.dat' >>> features_file = 'results/Bazin.dat' >>> sample = build_snpcc_canonical(path_to_raw_data: data_dir, path_to_features=features_file, >>> output_canonical_file=output_sample_file, >>> output_info_file=output_metadata_file, >>> compute=True, save=True) |
Once the samples is constructed you can compare the distribution in [z, g_pkmag, r_pkmag]
with a plot:
1 2 3 | >>> from actsnclass import plot_snpcc_train_canonical >>> plot_snpcc_train_canonical(sample, output_plot_file='plots/compare_canonical_train.png') |

In the command line, using the same parameters as in the code above, you can do all at once:
>>> build_canonical.py -c <if True compute metadata>
>>> -d <path to raw data dir>
>>> -f <input features file> -m <output file for metadata>
>>> -o <output file for canonical sample> -p <comparison plot file>
>>> -s <if True save metadata to file>
You can check that the file results/Bazin_SNPCC_canonical.dat
is very similar to the original features file.
The only difference is that now a few of the sample
variables are set to queryable
:
id redshift type code sample gA gB gt0 gtfall gtrise rA rB rt0 rtfall rtrise iA iB it0 itfall itrise zA zB zt0 ztfall ztrise
116537 0.5547 II 36 test 10.969309008063526 -2.505571025776927 33.36879338510094 89.92091344407919 -1.4070121479476083 35.57261257957346 -0.97172916012906 47.691951316763436 37.48483229249487 -7.146619117223875 41.16723042762342 0.14005823764049471 47.983238664813264 39.02626334489017 -6.096248676680143 36.82968789062783 -0.373638211418927 48.438610651533445 41.8848763303308 -7.169183522127793
855370 0.5421 Ibc 23 test -5.514648646328689 3.370545820694393 6.890703579070343 127.00223553079377 -0.04721760599586505 27.987087830949765 -0.4446376337515848 51.06299616763716 13.46475077451422 -0.7802021055103384 42.50390337399486 1.217778587283846 63.88727539461748 4.425762504064253 -2.826164280709543 57.08358377564619 -0.9866672975549484 65.3976378960504 2.88096954432307 -2.211749860376304
328118 0.3131 II 37 test 28.134338786167365 0.7147372066065217 45.830405214215425 15.850284787778433 -0.0005766632993162762 29.225476277548275 -1.9734118280637896 45.83230493446332 87.25700882127312 -0.00025821702716214264 24.95217257542528 -0.3731568724509137 40.527255841246365 311.42509172517947 -3.099534332677601 46.782672798921226 -0.05678675661798624 53.51739930097104 50.76716462668245 -4.572685479766832
704481 0.4665 Ia 0 queryable -50.86850174812521 2.4148469184147547 16.05240678384717 3.5459318666713298 -0.4734666030325012 74.65602268994473 -3.763616485144308 48.208444944828855 24.3318092539982 -4.452287612782472 83.29745588526693 -7.371877954771961 50.92270461365078 38.76468635410394 -10.931632426569717 73.35112115534632 -1.2509966370291774 40.053959252846106 44.453394158157614 -0.18674652754319326
43679 0.5756 II 33 test 28.24271470397688 -3.438072722932048 23.521675700587007 32.4401288159836 -0.2295765027048151 -37.86668398190429 6.8580060036559365 22.252525376185087 2.3940753934318044 -1.611409074593934 -20.420915833911547 9.2659565057976 7.0218302478113035 23.713442135755557 -0.027609543521757457 14.76124690750807 -5.175821286895905 32.58560788340983 115.86494837233313 -0.2587648450330448
172648 0.7592 II 31 test -5.942021205498 2.5808480681448858 72.24216865162195 83.43696533883242 -0.04052830859563986 17.05919635984848 1.005755998955811 18.148318002391164 33.16119808959254 -0.12803412647871454 12.153694246253906 -1.2962293252577974 17.493068792921502 89.98548146319197 -0.14950758787782462 13.355316206445695 -2.4143982246591293 23.84002028961246 118.87985827861259 -1.4858837093947788
762146 0.7245 Ibc 22 test 10.734014319410377 -0.696725251384634 92.36623187978644 0.5112285753252996 -0.4900012400030447 12.968161724599275 -0.94670057528261 55.7516252880299 25.59410571631452 -1.971658945324412 19.03779421586546 -2.228264147418322 57.66412361316971 35.09222360219662 -3.325944814741228 22.877393161444374 0.3070939958786501 57.9675613727551 49.63155574417528 -1.8832424871891025
This means that you can use the actsnclass.learn_loop
module in combination with a RandomSampling
strategy but
reading data from the canonical sample. In this way, at each iteration the code will select a random object from the test sample
but a query will only be made is the selected object belongs to the canonical sample.
In the command line, this looks like:
>>> run_loop.py -i results/Bazin_SNPCC_canonical.dat -b <batch size> -n <number of loops>
>>> -d <output metrics file> -q <output queried sample file>
>>> -s RandomSampling -t <choice of initial training>
Prepare data for time domain¶
In order to mimic the realistic situation where only a limited number of observed epochs is available at each
day, it is necessary to prepare our simulate data resemble this scenario. In actsnclass
this is done in
5 steps:
- Determine minimum and maximum MJD for the entire SNPCC sample;
- For each day of the survey, run through the entire data sample and select only the observed epochs which were obtained prior to it;
- Perform the feature extraction process considering only the photometric points which survived item 2.
- Check if in the MJD in question the object is available for querying.
- Join all information in a standard features file.
You can perform the entire analysis for one day of the survey using the actsnclass.time_domain
module:
1 2 3 4 5 6 7 8 9 10 | >>> from actsnclass.time_domain import SNPCCPhotometry >>> path_to_data = 'data/SIMGEN_PUBLIC_DES/' >>> output_dir = 'results/time_domain/' >>> day = 20 >>> data = SNPCCPhotometry() >>> data.create_daily_file(output_dir=output_dir, day=day) >>> data.build_one_epoch(raw_data_dir=path_to_data, day_of_survey=day, time_domain_dir=output_dir) |
Alternatively you can use the command line to prepare a sequence of days in one batch:
>>> build_time_domain.py -d 20 21 22 23 -p <path to raw data dir> -o <path to output time domain dir>
Active Learning loop¶
Details on running 1 loop¶
Once the data has been pre-processed, analysis steps 2-4 can be performed directly using the DataBase
object.
For start, we can load the feature information:
1 2 3 4 5 6 7 | >>> from actsnclass import DataBase >>> path_to_features_file = 'results/Bazin.dat' >>> data = DataBase() >>> data.load_features(path_to_features_file, method='Bazin') Loaded 21284 samples! |
Notice that this data has some pre-determine separation between training and test sample:
1 2 | >>> data.metadata['sample'].unique() array(['test', 'train'], dtype=object) |
You can choose to start your first iteration of the active learning loop from the original training sample flagged int he file OR from scratch. As this is our first example, let’s do the simple thing and start from the original training sample. The code below build the respective samples and performs the classification:
1 2 3 4 5 6 7 8 9 10 11 | >>> data.build_samples(initial_training='original', nclass=2) Training set size: 1093 Test set size: 20191 >>> data.classify(method='RandomForest') >>> data.classprob # check classification probabilities array([[0.461, 0.539], [0.346, 0.654], ..., [0.398, 0.602], [0.396, 0.604]]) |
Hint
If you wish to start from scratch, just set the initial_training=N where N is the number of objects in you want in the initial training. The code will then randomly select N objects from the entire sample as the initial training sample. It will also impose that at least half of them are SNe Ias.
For a binary classification, the output from the classifier for each object (line) is presented as a pair of floats, the first column corresponding to the probability of the given object being a Ia and the second column its complement.
Given the output from the classifier we can calculate the metric(s) of choice:
1 2 3 4 5 6 7 | >>> data.evaluate_classification(metric_label='snpcc') >>> print(data.metrics_list_names) # check metric header ['acc', 'eff', 'pur', 'fom'] >>> print(data.metrics_list_values) # check metric values [0.5975434599574068, 0.9024767801857585, 0.34684684684684686, 0.13572404702012383] |
and save results for this one loop to file:
1 2 3 4 5 6 7 | >>> path_to_features_file = 'results/Bazin.dat' >>> metrics_file = 'results/metrics.dat' >>> queried_sample_file = 'results/queried_sample.dat' >>> data.save_metrics(loop=0, output_metrics_file=metrics_file) >>> data.save_queried_sample(loop=0, queried_sample_file=query_file, >>> full_sample=False) |
You should now have in your results
directory a metrics.dat
file which looks like this:
day accuracy efficiency purity fom query_id
0 0.4560942994403447 0.5545490350531705 0.23933367329593744 0.05263972502898026 81661
Running a number of iterations in sequence¶
We provide a function where all the above steps can be done in sequence for a number of iterations.
In interactive mode, you must define the required variables and use the actsnclass.learn_loop
function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | >>> from actsnclass.learn_loop import learn_loop >>> nloops = 1000 # number of iterations >>> method = 'Bazin' # only option in v1.0 >>> ml = 'RandomForest' # only option in v1.0 >>> strategy = 'RandomSampling' # learning strategy >>> input_file = 'results/Bazin.dat' # input features file >>> metric = 'results/metrics.dat' # output metrics file >>> queried = 'results/queried.dat' # output query file >>> train = 'original' # initial training >>> batch = 1 # size of batch >>> learn_loop(nloops=nloops, features_method=method, classifier=ml, >>> strategy=strategy, path_to_features=input_file, output_metrics_file=metrics, >>> output_queried_file=queried, training=train, batch=batch) |
Alternatively you can also run everything from the command line:
>>> run_loop.py -i <input features file> -b <batch size> -n <number of loops>
>>> -d <output metrics file> -q <output queried sample file>
>>> -s <learning strategy> -t <choice of initial training>
The queryable sample¶
In the example shown above, when reading the data from the features file there was only 2 possibilities for the sample variable:
1 2 | >>> data.metadata['sample'].unique() array(['test', 'train'], dtype=object) |
This corresponds to an unrealistic scenario where we are able to obtain spectra for any object at any time.
Hint
If you wish to restrict the sample available for querying, just change the sample variable to queryable for the objects available for querying. Whenever this keywork is encountered in a file of extracted features, the code automatically restricts the query selection to the objects flagged as queryable.
Active Learning loop in time domain¶
Considering that you have previously prepared the time domain data, you can run the active learning loop
in its current form either by using the actsnclass.time_domain_loop
or by using the command line
interface:
>>> run_time_domain.py -d <first day of survey> <last day of survey>
>>> -m <output metrics file> -q <output queried file> -f <features directory>
>>> -s <learning strategy> -t <choice of initial training>
Make sure you check the full documentation of the module to understand which variables are required depending on the case you wish to run.
For example, to run with SNPCC data, the larges survey interval you can run is between 20 and 182 days, the corresponding option will be -d 20 182.
In the example above, if you choose to start from the original training sample, -t original you must also input the path to the file containing the full light curve analysis - so the full initial training can be read. This option corresponds to -t original -fl <path to full lc features>.
More details can be found in the corresponding docstring.
Once you ran one or more options, you can use the actsnclass.plot_results
module, as described in the produce plots page.
The result will be something like the plot below (accounting for variations due to initial training).

Warning
At this point there is no Canonical sample option implemented for the time domain module.
Plotting¶
Once you have the metrics results for a set of learning strategies you can plot the behaviour the evolution of the metrics:
- Accuracy: fraction of correct classifications;
- Efficiency: fraction of total SN Ia correctly classified;
- Purity: fraction of correct Ia classifications;
- Figure of merit: efficiency x purity with a penalty factor of 3 for false positives (contamination).
The class Canvas <https://actsnclass.readthedocs.io/en/latest/api/actsnclass.Canvas.html#actsnclass.Canvas>_ enables you do to it using:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | >>> from actsnclass.plot_results import Canvas >>> # define parameters >>> path_to_files = ['results/metrics_canonical.dat', >>> 'results/metrics_random.dat', >>> 'results/metrics_unc.dat'] >>> strategies_list = ['Canonical', 'RandomSampling', 'UncSampling'] >>> output_plot = 'plots/metrics.png' >>> #Initiate the Canvas object, read and plot the results for >>> # each metric and strategy. >>> cv = Canvas() >>> cv.load_metrics(path_to_files=path_to_files, >>> strategies_list=strategies_list) >>> cv.set_plot_dimensions() >>> cv.plot_metrics(output_plot_file=output_plot, >>> strategies_list=strategies_list) |
This will generate:

Alternatively, you can use it directly from the command line.
For example, the result above could also be obtained doing:
>>> make_metrics_plots.py -m <path to canonical metrics> <path to rand sampling metrics> <path to unc sampling metrics>
>>> -o <path to output plot file> -s Canonical RandomSampling UncSampling
OBS: the color pallete for this project was chosen to honor the work of Piet Mondrian.
How to contribute¶
Below you will find general guidance on how to prepare your piece of code to be integrated to the
actsnclass
environment.
Add a new data set¶
The main challenge of adding a new data set is to build the infrastructure necessary to handle the new data.
The function below show how the basic structure required to deal with 1 light curve:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | >>> import pandas as pd >>> def load_one_lightcurve(path_to_data, *args): >>> """Load 1 light curve at a time. >>> >>> Parameters >>> ---------- >>> path_to_data: str >>> Complete path to data file. >>> ... >>> ... >>> >>> Returns >>> ------- >>> pd.DataFrame >>> """ >>> >>> #################### >>> # Do something ##### >>> #################### >>> >>> # structure of light curve >>> lc = {} >>> lc['dataset_name'] = XXXX # name of the data set >>> lc['filters'] = [X, Y, Z] # list of filters >>> lc['id'] = XXX # identification number >>> lc['redshift'] = X # redshift (optional, important for building canonical) >>> lc['sample'] = XXXXX # train, test or queryable (none is mandatory) >>> lc['sntype'] = X # Ia or non-Ia >>> lc['photometry' = pd.DataFrame() # min keys: MJD, filter, FLUX, FLUXERR >>> # bonus: MAG, MAGERR, SNR >>> return lc |
Feel free to also provide other keywords which might be important to handle your data. Given a function like this we should be capable of incorporating it into the pipeline.
Please refer to the actsnclass.fit_lightcurves
module for a closer look at this part of the code.
Add a new feature extraction method¶
Currently actsnclass
only deals with Bazin features.
The snipet below show an example of friendly code for a new feature extraction method.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | >>> def new_feature_extraction_method(time, flux, *args): >>> """Extract features from light curve. >>> >>> Parameters >>> ---------- >>> time: 1D - np.array >>> Time of observation. >>> flux: 1D - np.array of floats >>> Measured flux. >>> ... >>> ... >>> >>> Returns >>> ------- >>> set of features >>> """ >>> >>> ################################ >>> ### Do something ########## >>> ################################ >>> >>> return features |
You can check the current feature extraction tools for the Bazin parametrization at actsnclass.bazin
module.
Add a new classifier¶
A new classifier should be warp in a function such as:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | >>> def new_classifier(train_features, train_labels, test_features, *args): >>> """Random Forest classifier. >>> >>> Parameters >>> ---------- >>> train_features: np.array >>> Training sample features. >>> train_labels: np.array >>> Training sample classes. >>> test_features: np.array >>> Test sample features. >>> ... >>> ... >>> >>> Returns >>> ------- >>> predictions: np.array >>> Predicted classes - 1 class per object. >>> probabilities: np.array >>> Classification probability for all objects, [pIa, pnon-Ia]. >>> """ >>> >>> ####################################### >>> ####### Do something ############# >>> ####################################### >>> >>> return predictions, probabilities |
The only classifier implemented at this point is a Random Forest and can be found at the
actsnclass.classifiers
module.
Important
Remember that in order to be effective in the active learning frame work a classifier should not be heavy on the required computational resources and must be sensitive to small changes in the training sample. Otherwise the evolution will be difficult to tackle.
Add a new query strategy¶
A query strategy is a protocol which evaluates the current state of the machine learning model and makes an informed decision about which objects should be included in the training sample.
This is very general, and the function can receive as input any information regarding the physical properties of the test and/or target samples and current classification results.
A minimum structure for such function would be:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | >>> def new_query_strategy(class_prob, test_ids, queryable_ids, batch, *args): >>> """New query strategy. >>> >>> Parameters >>> ---------- >>> class_prob: np.array >>> Classification probability. One value per class per object. >>> test_ids: np.array >>> Set of ids for objects in the test sample. >>> queryable_ids: np.array >>> Set of ids for objects available for querying. >>> batch: int >>> Number of objects to be chosen in each batch query. >>> ... >>> ... >>> >>> Returns >>> ------- >>> query_indx: list >>> List of indexes identifying the objects from the test sample >>> to be queried in decreasing order of importance. >>> """ >>> >>> ############################################ >>> ##### Do something ########## >>> ############################################ >>> >>> return list of indexes of size batch |
The current available strategies are Passive Learning (or Random Sampling) and Uncertainty Sampling. Both can be scrutinized at the :py:mod:actsnclass.`query_strategies` module.
Add a new diagnostic metric¶
Beyond the criteria for choosing an object to be queried one could also think about the possibility to test different metrics to evaluate the performance of the classifier at each learning loop.
A new diagnostic metrics can then be provided in the form:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | >>> def new_metric(label_pred: list, label_true: list, ia_flag, *args): >>> """Calculate efficiency. >>> >>> Parameters >>> ---------- >>> label_pred: list >>> Predicted labels >>> label_true: list >>> True labels >>> ia_flag: number, symbol >>> Flag used to identify Ia objects. >>> ... >>> ... >>> >>> Returns >>> ------- >>> a number or set of numbers >>> Tells us how good the fit was. >>> """ >>> >>> ########################################### >>> ##### Do something ! ################## >>> ########################################### >>> >>> return a number or set of numbers |
The currently implemented diagnostic metrics are those used in the
SNPCC (Kessler et al., 2009) and can be found at the
actsnclass.metrics
module.
Reference / API¶
Pre-processing¶
Light curve analysis¶
Performing feature extraction for 1 light curve
LightCurve () |
Light Curve object, holding meta and photometric data. |
LightCurve.load_snpcc_lc (path_to_data) |
Reads one LC from SNPCC data. |
LightCurve.fit_bazin (band) |
Extract Bazin features for one filter. |
LightCurve.fit_bazin_all () |
Perform Bazin fit for all filters independently and concatenate results. |
LightCurve.plot_bazin_fit ([save, show, …]) |
Plot data and Bazin fitted function. |
Fitting an entire data set
fit_snpcc_bazin (path_to_data_dir, features_file) |
Fit Bazin functions to all filters in training and test samples. |
Basic light curve analysis tools
bazin.bazin (time, a, b, t0, tfall, trise) |
Parametric light curve function proposed by Bazin et al., 2009. |
bazin.errfunc (params, time, flux) |
Absolute difference between theoretical and measured flux. |
bazin.fit_scipy (time, flux) |
Find best-fit parameters using scipy.least_squares. |
Canonical sample¶
The Canonical object for holding the entire sample.
Canonical () |
Canonical sample object. |
Canonical.snpcc_get_canonical_info (…[, …]) |
Load SNPCC metada data required to characterize objects. |
Canonical.snpcc_identify_samples () |
Identify training and test sample. |
Canonical.find_neighbors () |
Identify 1 nearest neighbor for each object in training. |
Functions to populate the Canonical object
build_snpcc_canonical (path_to_raw_data, …) |
Build canonical sample for SNPCC data. |
plot_snpcc_train_canonical (sample[, …]) |
Plot comparison between training and canonical samples. |
Build time domain data base¶
SNPCCPhotometry () |
Handles photometric information for entire SNPCC data. |
SNPCCPhotometry.get_lim_mjds (raw_data_dir) |
Get minimum and maximum MJD for complete sample. |
SNPCCPhotometry.create_daily_file (…[, header]) |
Create one file for a given day of the survey. |
SNPCCPhotometry.build_one_epoch (…[, …]) |
Fit bazin for all objects with enough points in a given day. |
DataBase¶
Object upon which the learning process is performed
DataBase () |
DataBase object, upon which the active learning loop is performed. |
DataBase.load_bazin_features (path_to_bazin_file) |
Load Bazin features from file. |
DataBase.load_features (path_to_file[, …]) |
Load features according to the chosen feature extraction method. |
DataBase.build_samples (initial_training[, …]) |
Separate train and test samples. |
DataBase.classify ([method, screen, n_est, …]) |
Apply a machine learning classifier. |
DataBase.evaluate_classification ([…]) |
Evaluate results from classification. |
DataBase.make_query ([strategy, batch, seed, …]) |
Identify new object to be added to the training sample. |
DataBase.update_samples (query_indx, loop[, …]) |
Add the queried obj(s) to training and remove them from test. |
DataBase.save_metrics (loop, …[, batch]) |
Save current metrics to file. |
DataBase.save_queried_sample (…[, …]) |
Save queried sample to file. |
Classifiers¶
random_forest (train_features, train_labels, …) |
Random Forest classifier. |
Query strategies¶
random_sampling (test_ids, queryable_ids[, …]) |
Randomly choose an object from the test sample. |
uncertainty_sampling (class_prob, test_ids, …) |
Search for the sample with highest uncertainty in predicted class. |
Metrics¶
Individual metrics
accuracy (label_pred, label_true) |
Calculate accuracy. |
efficiency (label_pred, label_true[, ia_flag]) |
Calculate efficiency. |
purity (label_pred, label_true[, ia_flag]) |
Calculate purity. |
fom (label_pred, label_true[, ia_flag, penalty]) |
Calculate figure of merit. |
Metrics agregated by category or use
get_snpcc_metric (label_pred, label_true[, …]) |
Calculate the metric parameters used in the SNPCC. |
Active Learning loop¶
Full light curve
learn_loop (nloops, strategy, …[, …]) |
Perform the active learning loop. |
Time domain
get_original_training (path_to_features[, …]) |
Read original full light curve training sample |
time_domain_loop (days, output_metrics_file, …) |
Perform the active learning loop. |
Plotting¶
Canvas () |
Canvas object, handles and plot information from multiple strategies. |
Canvas.load_metrics (path_to_files, …[, …]) |
Load and identify set of metrics. |
Canvas.set_plot_dimensions () |
Set directives for plot sizes. |
Canvas.plot_metrics (output_plot_file, …) |
Generate plot for all metrics in files and strategies given as input. |
Scripts¶
build_canonical (user_choices) |
Build canonical sample for SNPCC data set fitted with Bazin features. |
build_time_domain (user_choice) |
Generates features files for a list of days of the survey. |
fit_dataset (user_choices) |
Fit the entire sample with the Bazin function. |
make_metrics_plots (user_input) |
Generate metric plots. |
run_loop (args) |
Command line interface to run the active learning loop. |
run_time_domain (user_choice) |
Command line interface to the Time Domain Active Learning scenario. |