Workflow mode
Phenonaut may be used in a workflow mode, removing the requirement for users to be able to use Python.
With the definition of workflows in simple YAML files (https://en.wikipedia.org/wiki/YAML/), standardised workflows can be created that have access to all Phenonuat functionality. Whilst YAML files are a simple and convenient format, JSON dictionaries respresented as text files in the .json format may also be passed. For the purposes of this documentation, the YAML format is used to exemplify Phenonaut capabilities. Phenonaut will determine if it is working with YAML or JSON files through the workflow filename extension.
The example YAML file for prediction of 1 year survival rates is as follows:
---
Example_1_predict_survival:
- load:
dataset: TCGA
- predict:
target: survives_1_yr
Here, we see the YAML file begin with 3 dashes - as is customary in the YAML file format and then a dictionary begins, with a key denoting the name of the job ‘Example_1_predict_survival’. The value/item under this dictionary key is a list of key:value pairs, denoting workflow functions being called. The first item is the load key, with a dictionary of arguments and their values for the load function given as values for the key.
The second item in the ‘Example_1_predict_survival’ task is a dictionary with the key ‘predict’, denoting that Phenonaut will be calling its prediction capabilities, and under this as values is another key:value pair denoting the target column/feature within the dataset.
This page documents all workflow enabled functions such as ‘load’ and ‘predict’ as given above.
The publication SI material contains a worked example using workflow mode which can also be found here: Example 1 - TCGA.
Documentation on the Python API is useful for understanding the implementation details of workflow mode.
I/O
load
Load a dataset in the format denoted by a given file extension, or a PakcagedDataset known to Phenonaut. There are 2 main routes to loading a dataset in workflow mode.
Loading a packaged dataset (using the dataset key)
Loading a user supplied file (using the file key) - normally a CSV file, but can be any file format given below, with filetype inferred through the filename extension.
- dataset
The name of the packaged dataset known to Phenonaut that should be loaded. The currently available datasets are listed below:
Available packaged datasets (case insensitive) Name
Description
TCGA
The Cancer Genome Atlas (Weinstein 2013) prepared using methods described by Lee (Lee 2021).
CMAP
Connectivity Map (Lamb 2006).
iris
The Iris dataset (Yann).
BreastCancer
The Breast cancer Wisconsin (diagnostic) dataset (Wisconsin).
- References:
Weinstein, John N., et al. “The cancer genome atlas pan-cancer analysis project.” Nature genetics 45.10 (2013): 1113-1120.
Lee, Changhee, and Mihaela van der Schaar. “A variational information bottleneck approach to multi-omics data integration.” International Conference on Artificial Intelligence and Statistics. PMLR, 2021.
Lamb, Justin, et al. “The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease.” science 313.5795 (2006): 1929-1935.
Yann, http://yann.lecun.com/exdb/mnist/, (visited 27/4/2022)
Wisconsin, https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic), (visited 27/4/2022) *
--- Example dataset load usage: - load: dataset: TCGA- file
The path to the file should be supplied using the file key. This is not required if loading a packaged dataset known to Phenonaut which is loaded using the dataset key described above
Workflow file formats (* can be <ext>.gz indicating gzip compression is applied) Extension
Description
csv*
Standard CSV (Comma separated values) file
tsv*
Tab separated values file
h5
H5 data. A ‘key’ argument must also be passed in conjunction with this file format,
denoting the key within the H5 file that the (normally, pd.DataFrame resides under)
Filetype is inferred from the file name extension. Note that .csv and .tsv files may be gzipped (.csv.gz and .tsv.gz).
- metadata
In addition to specifying the file location, a metadata dictionary may also be given, allowing special keywords for the reading of data in different formats. Special keys for the metadata dictionary are listed below (See Pandas documentation for read_csv for more in-depth information):
‘sep’: define separator for fields within the file (by default ‘,’, but if none is supplied and a TAB character is found in the first line of the file, then it is assumed sep should be a tab character)
‘skiprows’: define a number of rows to skip at the beginning of a file
‘header_row_number’ : may be a single number, or list of numbers denoting header rows.
‘transpose’ : In the case of some transcriptomics raw data, transposition is required to have samples row-wise. Therefore the table must be transposed. Set to True to transpose.
‘index_col’ : Column to use as the row labels of the CSV given as either the column names or as numerical indexes. Note: index_col=False can be used to force pandas to not use the first column as the index, e.g. when you have a malformed file with delimiters at the end of each line.
‘key’ : If an h5 file is specified, then this is the key under which the pandas dataframe representing the dataset resides.
--- Example load usage: - load: file: cell_paint.csv metadata: sep: ',' transpose: false features_prefix: - feat_
write_csv
Writes a dataset to a CSV file. As well as the keys defined below, additional keys may be supplied which are passed to Pandas.DataFrame.to_csv as kwargs, allowing fully flexible output.
See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html#pandas.DataFrame.to_csv for a full listing of additional arguments.
- target_dataset
Index or name of dataset which should be writen to the file. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.
- file
The path to the file where output should be writen.
--- Example write_csv usage: - load: file: cell_paint.csv metadata: sep: ',' transpose: false features_prefix: - feat_ - pca: ndims: 2 -write_csv: file: pca_cell_paint.csv
write_multiple_csvs
Often it is useful to write a CSV file per plate within the dataset, or group the data by some other identifier. This write_multiple_csvs workflow function provides this functionality.
Additional keys may be supplied which are passed to Pandas.DataFrame.to_csv as kwargs, allowing fully flexible output. See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html#pandas.DataFrame.to_csv for a full listing of additional arguments.
- split_by_column
the name of the column to be split on
- output_dir
the target output directory
- file_prefix
optional prefix for each file.
- file suffix: str
optional suffix for each file.
- file_extension: str
optional file extension, by default ‘.csv’
--- Example write_multiple_csvs usage: - load: file: cell_paint.csv metadata: sep: ',' transpose: false features_prefix: - feat_ - pca: ndims: 2 -write_multiple_csvs: split_by_column: plate_id output_dir: split_output file_prefix: split
Transforms
add_well_id
Often, we would like to use well and column numbers to resolve a more traditional alpha-numeric WellID notation, such as A1, A2, etc. This can be achieved through calling this workflow function.
If a dataset contains numerical row and column names, then they may be translated into standard letter-number well IDs. The arguments dictionary may contain the following keys, with their values denoted as bellow:
- numerical_column_name
Name of column containing numeric column number, if not supplied, then behaves as if “COLUMN”.
- numerical_row_name
Name of column containing numeric column number, if not supplied, then behaves as if “ROW”.
- plate_type
Plate type - note, at present, only 384 well plate format is supported, if not supplied, then behaves as if 384.
- new_well_column_name
Name of new column containing letter-number well ID, if not supplied, then behaves as if default “Well”.
- add_empty_wells
Should all wells from a plate be inserted, even when missing from the data, if not supplied, then behaves as if False.
- plate_barcode_column
Multiple plates may be in a dataset, this column contains their unique ID, if not supplied, then bahaves as if None.
- no_sort
Do not resort the dataset by well ID, if not supplied, then behaves as if False
--- Example add_well_id: - load: file: cell_paint.csv metadata: sep: ',' transpose: false features_prefix: - feat_ - add_well_id: numerical_column_name: COLUMN_X numerical_row_name: ROW_Y plate_type: 384 add_empty_wells: true
copy_column
Copys the values of one column within a dataset to another.
Supply either ‘to’ and ‘from’ keys, or alternatively, simply from:to key-value pairs denoting how to perform the copy operation.
Argument details
--- Example copying a column: - load: file: cell_paint.csv metadata: sep: ',' transpose: false features_prefix: - feat_ - copy_column: - from: feat_1 - to: new_feat_1 - copy_column: - feat_2: new_feat_2
rename_column and rename_columns
Renames a single, or multiple columns. The arguments dictionary should contain key:value pairs, where the key is the old column name and the value is the new column name.
Supply either ‘name_from’ and ‘name_to’ keys, or alternatively, simply from:to key-value pairs denoting how to perform the rename operation. Multiple columns can be renamed in a single call, using multiple dictionary entries.
--- Example rename columns: - load: file: cell_paint.csv metadata: sep: ',' transpose: false features_prefix: - feat_ - rename_column: - feat_1: new_feat_1 - feat_2: new_feat_2 - name_from: feat_3 - name_to: new_feat_3
Filtering
filter_columns
Datasets may have columns defined for keeping or removal. This function also provides a convenient way to reorder dataframe columns.
- keep
Only matching columns are kept if true. If false, they are removed. By default, True.
- column_names
Can be a string, or list of strings to match (or regular expressions). Can be used interchangably with column_name.
- column_name
Singular column to keep (or regular expression to match). Can be used interchangably with column_names.
- regex
If true, perform regular expression matching. By default false.
--- Example filtering coiumns: - load: file: cell_paint.csv metadata: sep: ',' transpose: false features_prefix: - feat_ - filter_columns: column_name: feat_The above example removes all metadata, keeping only feature (or columns beginning with feat_).
filter_rows
Alows keeping of only rows with a certain value in a certain column. Takes as arguments a dictionary containing query_column key:value pair and one of query_value, query_values or values key:value pairs.
- query_column
Name of the column that should match the value below
- query_value
Value to match. Note: query_value, query_values and values can be used interchangably.
- query_values
values to match (as a list). Note: query_value, query_values and values can be used interchangably.
- values
values to match (as a list.) Note: query_value, query_values and values can be used interchangably.
- keep
If True then rows matching the query are kept, if False, then rows matching are discarded, and non-matching rows kept.
--- Example filtering rows: - load: file: cell_paint.csv metadata: sep: ',' transpose: false features_prefix: - feat_ - filter_rows: query_column: valid_well query_value: - 1 - true - yesThe above example only keeps rows where the value in the valid_well column is 1, true, or yes.
VIF_filter_features
Performs variance inflation factor (VIF) filtering on a dataset, removing features which are not detrimental to capturing variance. More information available: https://en.wikipedia.org/wiki/Variance_inflation_factor
This can be a computationally expensive process as the number of linear regressions required to be run is almost N^2 with features.
- target_dataset
Index or name of dataset which should have features filtered. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.
- vif_cutoff
The VIF cutoff value, above which features are removed. Features with VIF scores above 5.0 are considered highly correlated. If not supplied, then behaviour is as if a value of 5.0 was supplied.
- min_features
Removal of too many features can be detrimental. Setting this value sets a lower limit on the number of features which must remain. If absent, then behaviour is as if a value of 2 was given.
- drop_columns
If drop columns is True, then not only will features be removed from the dataset features list, but the columns for these features will be removed from the dataframe. If absent, then behaviour is as if False was supplied.
--- Example VIF filtering: - load: file: cell_paint.csv metadata: sep: ',' transpose: false features_prefix: - feat_ -VIF_filter_features: min_features: 3
if_blank_also_blank
Often it is required to clean or remove rows not needed for inclusion into further established pipelines/ workflows. This workflow function allows the ability to remove values from a column on the condition that onther column is empty.
- query_column
Value is the name of the column to perform the query on.
- regex_query
Value is a boolean denoting if the query column value should be matched using a regular expression. If omitted, then behaves as if present and False.
- target_column
Value is a string, denoting the name of the column which should be blanked.
- target_columns
Value is a list of strings, denoting the names of columns which should be blanked.
- regex_targets
Value is a boolean denoting if the target column or multiple target columns defined in target_columns should be matched using a regular expression. If absent, then behaves as if False was supplied.
--- Example conditionally removing columns: - load: file: cell_paint.csv metadata: sep: ',' transpose: false features_prefix: - feat_ - if_blank_also_blank: query_column: dispense_volume target_column: valid_well
Dimensionality reduction
pca
Perform PCA dimensionality reduction technique. If no arguments are given then 2D PCA is applied to the dataset with the highest index which is usually the last inserted dataset.
- target_dataset
Index or name of dataset which should have the dimensionality reduction applied. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.
- ndims
Number of dimensions to which the PCA should reduce the features. If absent, then defaults to 2.
- center_on_perturbation_id
PCA should be recentered on the perturbation with ID. If absent, then defaults to None, and no centering is performed.
- center_by_median
If true, then median of center_on_perturbation is used, if False, then the mean is used.
- fit_perturbation
PCA may be fit to only the included IDs, before the transform is applied to the whole dataset.
--- Example PCA: - load: file: cell_paint.csv metadata: sep: ',' transpose: false features_prefix: - feat_ - pca: ndims: 2
tnse
Perform the t-SNE dimensionality reduction technique. If no arguments are given then 2D t-SNE is applied to the dataset with the highest index which is usually the last inserted dataset.
- target_dataset
Index or name of dataset which should have the dimensionality reduction applied. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.
- ndims
Number of dimensions to which the t-SNE should reduce the features. If absent, then defaults to 2.
- center_on_perturbation_id
t-SNE should be recentered on the perturbation with ID. If absent, then defaults to None, and no centering is performed.
- center_by_median
If true, then median of center_on_perturbation is used, if False, then the mean is used.
--- Example t-SNE: - load: file: cell_paint.csv metadata: sep: ',' transpose: false features_prefix: - feat_ - tsne: ndims: 2
umap
Perform the UMAP dimensionality reduction technique. If no arguments are given then 2D UMAP is applied to the dataset with the highest index which is usually the last inserted dataset.
- target_dataset
Index or name of dataset which should have the dimensionality reduction applied. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.
- ndims
Number of dimensions to which the UMAP should reduce the features. If absent, then defaults to 2.
- center_on_perturbation_id
UMAP should be recentered on the perturbation with ID. If absent, then defaults to None, and no centering is performed.
- center_by_median
If true, then median of center_on_perturbation is used, if False, then the mean is used.
--- Example UMAP: - load: file: cell_paint.csv metadata: sep: ',' transpose: false features_prefix: - feat_ - umap: ndims: 2
Measurements
scalar projection
Add a column for the scalar projection. Calculates the scalar projection and scalar rejection, quantifying on and off target phenotypes, as used in: Heiser, Katie, et al. “Identification of potential treatments for COVID-19 through artificial intelligence-enabled phenomic analysis of human cells infected with SARS-CoV-2.” BioRxiv (2020).
- target_dataset
Index or name of dataset which should have features filtered. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.
- target_perturbation_column_name
Column name to match the target_perturbation_column value defined below, usually a column capturing control names.
- target_perturbation_column_value
value to be found in the column defined in target_perturbation_column_name
- output_column_label
- Output from the scalar projection will have the form:
on_target_<output_column_label>
off_target_<output_column_label>
if this is missing, then it is set to target_perturbation_column_value
--- Example scalar projection: - load: file: cell_paint.csv metadata: sep: ',' transpose: false features_prefix: - feat_ - pca: ndims: 2 - scalar_projection: target_perturbation_column_name: control_name target_perturbation_column_value: positive_control output_column_label: sproj
euclidean_distance
Add a column for the euclidean distance between all perturbations and a target perturbation.
- target_dataset
Index or name of dataset which should have features filtered. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.
- target_perturbation_column_name
Column name to match the target_perturbation_column value defined below, usually a column capturing control names.
- target_perturbation_column_value
value to be found in the column defined in target_perturbation_column_name
- output_column_label
Output column name for the distance measurements. If this is missing, then it is set to target_perturbation_column_value
--- Example scalar projection: - load: file: cell_paint.csv metadata: sep: ',' transpose: false features_prefix: - feat_ - pca: ndims: 2 - euclidean_distance: target_perturbation_column_name: control_name target_perturbation_column_value: positive_control output_column_label: dist
manhattan_distance
Add a column for the manhattan distance between all perturbations and a target perturbation (also known as the cityblock distance).
- target_dataset
Index or name of dataset which should have features filtered. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.
- target_perturbation_column_name
Column name to match the target_perturbation_column value defined below, usually a column capturing control names.
- target_perturbation_column_value
value to be found in the column defined in target_perturbation_column_name
- output_column_label
Output column name for the distance measurements. If this is missing, then it is set to target_perturbation_column_value
--- Example scalar projection: - load: file: cell_paint.csv metadata: sep: ',' transpose: false features_prefix: - feat_ - pca: ndims: 2 - manhattan_distance: target_perturbation_column_name: control_name target_perturbation_column_value: positive_control output_column_label: dist
cityblock_distance
Add a column for the cityblock distance between all perturbations and a target perturbation (also known as the manhattan distance).
- target_dataset
Index or name of dataset which should have features filtered. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.
- target_perturbation_column_name
Column name to match the target_perturbation_column value defined below, usually a column capturing control names.
- target_perturbation_column_value
value to be found in the column defined in target_perturbation_column_name
- output_column_label
Output column name for the distance measurements. If this is missing, then it is set to target_perturbation_column_value
--- Example scalar projection: - load: file: cell_paint.csv metadata: sep: ',' transpose: false features_prefix: - feat_ - pca: ndims: 2 - cityblock_distance: target_perturbation_column_name: control_name target_perturbation_column_value: positive_control output_column_label: dist
mahalanobis_distance
Add a column for the mahalanobis distance between all perturbations and a target perturbations.
- target_dataset
Index or name of dataset which should have features filtered. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.
- target_perturbation_column_name
Column name to match the target_perturbation_column value defined below, usually a column capturing control names.
- target_perturbation_column_value
value to be found in the column defined in target_perturbation_column_name
- output_column_label
Output column name for the distance measurements. If this is missing, then it is set to target_perturbation_column_value
--- Example scalar projection: - load: file: cell_paint.csv metadata: sep: ',' transpose: false features_prefix: - feat_ - pca: ndims: 2 - mahalanobis_distance: target_perturbation_column_name: control_name target_perturbation_column_value: positive_control output_column_label: dist
Plotting
set_perturbation_column
Within a dataset, a special flag on a column may be set indicating that this is the perturbation column. This is useful in plotting whereby similar perturbations/repeats should be the same colour etc.
- target_dataset
Index or name of dataset within which we wish to set the perturbation column. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.
- column
column name to mark as containing perturbations ids.
--- Example set_perturbation_column: - load: file: cell_paint.csv metadata: sep: ',' transpose: false features_prefix: - feat_ - set_perturbation_column: column: compound_id
scatter
produce a scatter plot from a dataset. Used in conjunction with set_perturbation_column, similar pertubations will be grouped together.
- target_dataset
Index or name of dataset which should have features plotted. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.
- figsize
A tuple denoting the target output size in inches (!), if absent, then the default of (8,6) is used.
- title
Title for the plot, if absent, then the default “2D scatter” is used.
- peturbations
Can be a list of peturbations - as denoted in the perturbations column of the dataframe to include in the plot. If absent, then all perturbations are included.
- destination
Output location for the PNG - required field. An error will be thrown if omitted.
--- Example scatter: - load: file: cell_paint.csv metadata: sep: ',' transpose: false features_prefix: - feat_ - pca: ndims: 2 - set_perturbation_column: column: compound_id - scatter: destination: scatter_pca_cell_paint.png
Prediction
predict
Profile predictors in their ability to predict a given target.
Phenonaut provides functionality to profile the performance of multiple predictors against multiple views of data. This is exemplified in the TCGA example used in the Phenonaut paper - see Example 1 - TCGA for a full walkthrough of applying this functionality to The Cancer Genome Atlas. With a given ‘target’ for prediction which is in the dataset, predict selects all appropriate predictors (classifiers for classification, regressors for regression and multiregressors for multi regression/view targets). Then, enumerating all views of the data and all predictors, hyperparameter optimisation coupled with 5-fold cross validation using Optuna is employed, before finally testing the best hyperparameter sets with retained test sets. This process is automatic and requires the data, and a prediction target. Output from this process is extensive and it may take a long time to complete, depending on the characteristics of your input data. Writen output from the profiling process consists of performance heatmaps highlighting best view/predictor combinations in bold, boxplots for each view combination and a PPTX presentation file allowing easy sharing of data, along with machine readable CSV and JSON results.
For each unique view combination and predictor, perform the following:
Merge views and remove samples which do not have features across currently needed views.
Shuffle the samples.
Withhold 20% of the data as a test set, to be tested against the trained and hyperparameter optimised predictor.
Split the data using 5-fold cross validation into train and validation sets.
For each fold, perform Optuna hyperparameter optimisation for the given predictor using the train sets, using hyperparameters described by the default predictors for classification, regression and multiregression.
- output_directory
Directory into which profiling output (boxplots, heatmaps, CSV, JSON and PPTX should be written).
- dataset_combinations
If multiple datasets are already loaded, then lists of ‘views’ may be specified for exploration. If None, or this argument is absent, then all combinations of available views/Datasets are enumerated and used.
- target
The prediction target, denoted by a column name given here which exists in loaded datasets.
- n_splits
Number of splits to use in the N-fold cross validation, if absent, then the default of 5 is used.
- n_optuna_trials
Number of Optuna trials for hyperparameter optimisation, by default 20. This drastically impacts runtime, so if things are taking too long, you may wish to lower this number. For a more thorough exploration of hyperparameter space, increase this number.
- optuna_merge_folds
By default, each fold has hyperparameters optimised and the trained predictor with parameters reported. If this optuna_merge_folds is true, then each fold is trained on and and hyperparameters optimised across folds (not per-fold). Setting this to False may be useful depending on the intended use of the predictor. It is believed that when False, and parameters are not optimised across folds, then more accurate prediction variance/accuracy estimates are produced. If absent, behaves as if false.
- test_set_fraction
When optimising a predictor, by default a fraction of the total data is held back for testing, separate from the train-validation splits. This test_set_fraction controls the size of this split. If absent, then the default value of 0.2 is used.
--- Example_1_predict_survival: - load: dataset: TCGA - predict: target: survives_1_yr