phenonaut package
Subpackages
- phenonaut.data package
- Submodules
- phenonaut.data.dataset module
Dataset
Dataset.add_well_id()
Dataset.copy()
Dataset.data
Dataset.df_to_csv()
Dataset.df_to_multiple_csvs()
Dataset.distance_df()
Dataset.divide_mean()
Dataset.divide_median()
Dataset.drop_absent_features()
Dataset.drop_columns()
Dataset.drop_nans_with_cutoff()
Dataset.drop_rows()
Dataset.features
Dataset.filter_columns()
Dataset.filter_columns_with_prefix()
Dataset.filter_inplace()
Dataset.filter_on_identifiers()
Dataset.filter_rows()
Dataset.get_df_features_perturbation_column()
Dataset.get_ds_from_query()
Dataset.get_feature_ranges()
Dataset.get_history()
Dataset.get_non_feature_columns()
Dataset.get_unique_perturbations()
Dataset.get_unique_treatments()
Dataset.groupby()
Dataset.history
Dataset.impute_nans()
Dataset.new_aggregated_dataset()
Dataset.num_features
Dataset.pcol
Dataset.perturbation_column
Dataset.pivot()
Dataset.remove_blocklist_features()
Dataset.remove_features_with_outliers()
Dataset.remove_low_variance_features()
Dataset.rename_column()
Dataset.rename_columns()
Dataset.replace_str()
Dataset.shrink()
Dataset.split_column()
Dataset.subtract_func_results_on_features()
Dataset.subtract_mean()
Dataset.subtract_median()
Dataset.subtract_median_perturbation()
Dataset.transpose()
TransformationHistory
- phenonaut.data.platemap_querier module
- phenonaut.data.recipes module
- Module contents
- phenonaut.integration package
- phenonaut.metrics package
- Subpackages
- Submodules
- phenonaut.metrics.distances module
- phenonaut.metrics.measures module
- phenonaut.metrics.non_ds_phenotypic_metrics module
- phenonaut.metrics.utils module
- Module contents
auroc()
euclidean()
feature_correlation_to_target()
get_cdu_performance_df()
mahalanobis()
manhattan()
mp_value_score()
percent_compact()
percent_replicating()
pertmutation_test_distinct_from_query_group()
pertmutation_test_type2_distinct_from_query_group()
run_cdu_benchmarks()
scalar_projection()
silhouette_score()
treatment_spread_euclidean()
write_cdu_json()
- phenonaut.output package
- phenonaut.packaged_datasets package
- Submodules
- phenonaut.packaged_datasets.base module
- phenonaut.packaged_datasets.breast_cancer module
- phenonaut.packaged_datasets.cmap module
- phenonaut.packaged_datasets.iris module
- phenonaut.packaged_datasets.lincs module
- phenonaut.packaged_datasets.metadata_moa module
- phenonaut.packaged_datasets.tcga module
- Module contents
- phenonaut.predict package
- Subpackages
- Submodules
- phenonaut.predict.optuna_functions module
- phenonaut.predict.predict_utils module
- phenonaut.predict.predictor_dataclasses module
HyperparameterCategorical
HyperparameterFloat
HyperparameterInt
HyperparameterLog
OptunaHyperparameter
OptunaHyperparameterNumber
PhenonautPredictionMetric
PhenonautPredictor
PhenonautPredictor.conditional_hyperparameter_generator
PhenonautPredictor.conditional_hyperparameter_generator_constructor_keyword
PhenonautPredictor.constructor_kwargs
PhenonautPredictor.dataset_size_cutoff
PhenonautPredictor.embed_in_results
PhenonautPredictor.max_classes
PhenonautPredictor.max_optuna_trials
PhenonautPredictor.name
PhenonautPredictor.num_views
PhenonautPredictor.optuna
PhenonautPredictor.predictor
- phenonaut.predict.profile module
- Module contents
- phenonaut.transforms package
- Submodules
- phenonaut.transforms.dimensionality_reduction module
- phenonaut.transforms.generic_transformations module
- phenonaut.transforms.imputers module
- phenonaut.transforms.preparative module
- phenonaut.transforms.supervised_transformer module
SupervisedTransformer
SupervisedTransformer.callable_args
SupervisedTransformer.fit()
SupervisedTransformer.fit_transform()
SupervisedTransformer.has_fit
SupervisedTransformer.has_fit_transform
SupervisedTransformer.has_transform
SupervisedTransformer.is_callable
SupervisedTransformer.method
SupervisedTransformer.method_kwargs
SupervisedTransformer.new_feature_names
SupervisedTransformer.transform()
- phenonaut.transforms.transformer module
- Module contents
Submodules
phenonaut.errors module
- exception phenonaut.errors.NotEnoughRowsError
Bases:
Exception
phenonaut.phenonaut module
- class phenonaut.phenonaut.Phenonaut(dataset: Dataset | list[Dataset] | PackagedDataset | Bunch | DataFrame | Path | str | None = None, name: str = 'Phenonaut object', kind: str | None = None, packaged_dataset_name_filter: str | list[str] | None = None, metadata: dict | list[dict] | None = {}, features: list[str] | None = None, dataframe_name: str | list[str] | None = None, init_hash: str | bytes | None = None)
Bases:
object
Phenonaut object constructor
Holds multiple datasets of different type, applys transforms, load and tracking operations.
May be initialised with:
Phenonaut Datasets
Phenonaut PackageDataset
Scikit Bunch
pd.DataFrame
by passing the object as an optional dataset argument.
- Parameters:
dataset (Optional[Union[Dataset, list[Dataset], PackagedDataset, Bunch,) – pd.DataFrame, Path, str]], optional Initialise Phenonaut object with a Dataset, list of datasets, or PackagedDataset, by default None.
name (str) – A name may be given to the phenonaut object. This is useful in naming collections of datasets. For example, The Cancer Genome Atlas contains 4 different views on tumors - mRNA, miRNA, methylation and RPPA, collectively, these 4 datasets loaded into a phenonaut object may be named ‘TCGA’ - or ‘The Cancer Genome Atlas dataset’. If set to None, then the phenonaut object takes the name “Phenonaut data”, however, not in the case where construction of the object occurs with a phenonaut packaged dataset or already named phenonaut object, where it takes the name of the passed object/dataset.
kind (Optional[str]) – Instead of providing metadata, some presets are available, which make reading in things like DRUG-Seq easier. This argument only has an effect when reading in a raw data file, like CSV or H5 and directs Phenonaut to use a predefind set of parameters/transforms. If used as well as metadata, then the preset metadata dictionary from the kind argument is first loaded, then updated with anything in the metadata dictionary, this therefore allows overriding specific presets present in kind dictionaries. Available ‘kind’ dictionaries may be listed by examining: phenonaut.data.recipes.recipes.keys()
packaged_dataset_name_filter (Optional[Union[list[str], str]], optional) – If a PackagedDataset is supplied for the data argument, then import only datasets from it named in the name_filter argument. If None, then all PackagedDataset datasets are imported. Can be a single string or list of strings. If None, and PackagedDataset is supplied, then all Datasets are loaded. Has no effect if data is not a PackagedDataset, by default None.
metadata (Optional[Union[dict, list[dict]]]) – Used when a pandas DataFrame is passed to the constructor of the phenonaut object. Metadata typically contains features or feature_prefix keys telling Phenonaut which columns should be treated as Dataset features. Can also be a list of metadata dicitonaries if a list of pandas DataFrames are supplied to the constructor. Has no effect if the type of dataset passed is not a pandas DataFrame or list of pandas DataFrames. If a list of pandas DataFrames is passed to data but only one metadata dictionary is given, then this dictionary is applied to all DataFrames. By default {}.
features (Optional[list[str]] = None) – May be used as a shortcut to including features in the metadata dictionary. Only used if the metadata is a dict and does not contain a features key.
dataframe_name (Optional[Union[dict, list[dict]]]) – Used when a pandas DataFrame, or str, or Path to a CSV file is passed to the constructor of the phenonaut object. Optional name to give to the dataset object constructed from the pandas DataFrame. If multiple DataFrames are given in a list, then this dataframe_name argument can be a list of strings as names to assign to the new Dataset objects.
init_hash (Optional[Union[str, bytes]]) – Cryptographic hashing within Phenonaut can be initialised with a starting/seed hash. This is useful in the creation of blockchain-like chains of hashes. In environments where timestamping is unavailable, hashes may be published and then used as input to subsequent experiments. Building up a provable chain along the way. By default None, implying an empty bytes array.
- add_well_id(numerical_column_name: str = 'COLUMN', numerical_row_name: str = 'ROW', plate_type: int = 384, new_well_column_name: str = 'Well', add_empty_wells: bool = False, plate_barcode_column: str | None = None, no_sort: bool = False)
Add standard well IDs - such as A1, A2, etc to ALL loaded Datasets.
If a dataset contains numerical row and column names, then they may be translated into standard letter-number well IDs. This is applied to all loaded Datasets. If you wish only one to be annotated, then call add_well_id on that individual dataset.
- Parameters:
numerical_column_name (str, optional) – Name of column containing numeric column number, by default “COLUMN”.
numerical_row_name (str, optional) – Name of column containing numeric column number, by default “ROW”.
plate_type (int, optional) – Plate type - note, at present, only 384 well plate format is supported, by default 384.
new_well_column_name (str, optional) – Name of new column containing letter-number well ID, by default “Well”.
add_empty_wells (bool, optional) – Should all wells from a plate be inserted, even when missing from the data, by default False.
plate_barcode_column (str, optional) – Multiple plates may be in a dataset, this column contains their unique ID, by default None.
no_sort (bool, optional) – Do not resort the dataset by well ID, by default False
- aggregate_dataset(composite_identifier_columns: list[str], datasets: Iterable[int] | Iterable[str] | int | str = -1, new_names_or_prefix: list[str] | tuple[str] | str = 'Aggregated_', inplace: bool = False, transformation_lookup: dict[str, Callable | str] | None = None, tranformation_lookup_default_value: str | Callable = 'mean')
Aggregate multiple or single phenonaut dataset rows
If we have a Phenonaut object containing data derived from 2 fields of view from a microscopy image, a sensible approach is averaging features. If we have the DataFrame below, we may merge FOV 1 and FOV 2, taking the mean of all features. As strings such as filenames should be kept, they are concatenated together, separated by a comma, unless the strings are the same, in which case just one is used.
Here we test a df as follows:
ROW
COLUMN
BARCODE
feat_1
feat_2
feat_3
filename
FOV
1
1
Plate1
1.2
1.2
1.3
FileA.png
1
1
1
Plate1
1.3
1.4
1.5
FileB.png
2
1
1
Plate2
5.2
5.1
5
FileC.png
1
1
1
Plate2
6.2
6.1
6.8
FileD.png
2
1
2
Plate1
0.1
0.2
0.3
FileE.png
1
1
2
Plate1
0.2
0.2
0.38
FileF.png
2
With just this loaded into a phenonaut object, we can call:
phe.aggregate_dataset([‘ROW’,’COLUMN’,’BARCODE’])
Will merge and proeduce another, secondary dataset in the phe object containing:
ROW
COLUMN
BARCODE
feat_1
feat_2
feat_3
filename
FOV
1
1
Plate1
1.25
1.3
1.40
fileA.png,FileB.png
1.5
1
1
Plate2
5.70
5.6
5.90
FileC.png,FileD.png
1.5
1
2
Plate1
0.15
0.2
0.34
FileF.png,fileE.png
1.5
if inplace=True is passed in the call to aggregate_dataset, then the phenonaut object will contain just one dataset, the new aggregated dataset.
- Parameters:
composite_identifier_columns (list[str]) – If a biochemical assay evaluated through imaging is identified by a row, column, and barcode (for the plate) but multiple images taken from a well, then these multiple fields of view can be merged, creating averaged features using row, column and barcode as the composite identifier on which to merge fields of view.
datasets (Union[list[int], list[str], int, str]) – Which datasets to apply the aggregation to. If int, then the dataset with that index undergoes aggregation. If a string, then the dataset with that name undergoes aggregation. It may also be a list or tuple of mixed int and string types, with ints specifying dataset indexes and strings indicating dataset names. By default, this value is -1, indicating that the last added dataset should undergo aggregation.
new_names_or_prefix (Union[list[str], tuple[str], str]) – If a list or tuple of strings is passed, then use them as the names for the new datasets after aggregation. If a single string is passed, then use this as a prefix for the new dataset. By default “Aggregated_”.
inplace (bool) – Perform the aggregation in place, overwriting the origianl dataframes. By default False.
transformation_lookup (dict[str,Union[Callable, str]]) – Dictionary mapping data types to aggregations. When None, it is as if the dictionary: {np.dtype(“O”): lambda x: “,”.join([f”{item}” for item in set(x)])} was provided, concatenating strings together (separated by a comma) if they are different and just using one if they are the same across rows. If a type not present in the dictionary is encountered (such as int, or float) in the above example, then the default specified by transformation_lookup_default_value is returned. By default, None.
tranformation_lookup_default_value (Union[str, Callable]) – Transformation to apply if the data type is not found in the transformation_lookup_dictionary, can be a callable or string to pandas defined string to function shortcut mappings. By default “mean”.
- append(dataset: Dataset) None
Add a dataset to the Phenonaut object
- Parameters:
dataset (Dataset) – Dataset to be addded
- clone_dataset(existing_dataset: Dataset | str | int, new_dataset_name: str, overwrite_existing: bool = False) None
Clone a dataset into a new dataset
- Parameters:
existing_dataset (Union[Dataset, str, int]) – The name or index of an existing Phenonaut Dataset held in the Phenonaut object. Can also be a Phenonaut.data.Dataset object passed directly.
new_dataset_name (str) – A name for the new cloned Dataset.
overwrite_existing (bool, optional) – If a dataset by this name exists, then overwrite it, otherwise, an exception is raised, by default False.
- Raises:
ValueError – Dataset by the name given already exists and overwrite_existing was False.
ValueError – The existing_dataset argument should be a str, int or Phenonaut.data.Dataset.
- combine_datasets(dataset_ids_to_combine: list[str] | list[int] | None = None, new_name: str | None = None, features: list | None = None)
Combine multiple datasets into a single dataset
Often, large datasets are split across multiple CSV files. For example, one CSV file per screening plate. In this instance, it is prudent to combine the datasets into one.
- Parameters:
dataset_ids_to_combine (Optional[Union[list[str], list[int]]]) – List of dataset indexes, or list of names of datasets to combine. For example, after loading in 2 datasets, the list [0,1] would be given, or a list of their names resulting in a new third dataset in datasets[2]. If None, then all present datasets are used for the merge. By default, None.
new_name (new_name:Optional[str]) – Name that should be given to the newly created dataset. If None, then it is assigned as: “Combined_dataset from datasets[DS_INDEX_LIST]”, where DS_INDEX_LIST is a list of combined dataset indexes.
features (list, optional) – List of new features which should used by the newly created dataset if None, then features of combined datasets are used. By default None.
- Raises:
DataError – Error raised if the combined datasets do not have the same features.
- property data: Dataset
Return the data of the highest index in phenonaut.data.datasets
Calling phe.data is the same as calling phe.ds.data
- Returns:
Last added/highest indexed Dataset features
- Return type:
np.ndarray
- Raises:
DataError – No datasets loaded
- describe() None
- property df: DataFrame
Return the pd.DataFrame of the last added/highest indexed Dataset
Returns the internal pd.Dataframe of the Dataset contained within the Phenonaut instance’s datasets list.
- Returns:
_description_
- Return type:
pd.DataFrame
- property ds: Dataset
Return the dataset with the highest index in phenonaut.data.datasets
- Returns:
Last added/highest indexed Dataset
- Return type:
- Raises:
DataError – No datasets loaded
- filter_datasets_on_identifiers(filter_field: str, filter_data: list[str] | dict | Path, dict_path: str | None = None, additional_items: list[str] | None = None) None
- get_dataset_combinations(min_datasets: int | None = None, max_datasets: int | None = None, return_indexes: bool = False)
Get tuple of all dataset name combinations, picking 1 to n datasets
The function to return all combinations from 1 to n dataset names in combination, where n is the number of loaded datasets. This is useful in multiomics settings where we test A,B, and C alone, A&B, A&C, B&C, and finally A&B&C.
A limit on the number of datasets in a combination can be imposed using the max_datasets argument. In the example above with datasets A, B and C, passing max_datasets=2 would return the following tuple: ((A), (B), (C), (A, B), (A, C), (B, C)) leaving out the tripple length combination (A, B, C).
Similarly, the argument min_datasets can specify a lower limit on the number of dataset combinations.
Using the example with datasets A, B, and C, and setting min_datasets=2 with no limit on max_datasets on the above example would return the following tuple: ((A, B), (A, C), (B, C), (A, B, C))
If return_indexes is True, then the indexes of Datasets are returned. As directly above, datasets A, B, and C, setting min_datasets=2 with no limit on max_datasets and passing return_indexes=True would return the following tuple: ((0, 1), (0, 2), (1, 2), (0, 1, 2))
- Parameters:
min_datasets (Optional[int], optional) – Minimum number of datasets to return in a combination. If None, then it behaves as if 1 is given, by default None.
max_datasets (Optional[int], optional) – Maximum number of datasets to return in a combination. If None, then it behaves as if len(datasets) is given, by default None.
return_indexes (bool) – Return indexes of Datasets, instead of their names, by default False.
- get_dataset_index_from_name(name: str | list[str] | tuple[str]) int | list[int]
Get dataset index from name
Given the name of a dataset, return the index of it in datasets list. Accepts single string query, or a list/tuple of names to return lists of indices.
- Parameters:
name (Union[str, list[str], tuple[str]]) – If string, then this is the dataset name being searched for. Its index in the datasets list will be returned. If a list or tuple of names, then the index of each is searched and an index list returned.
- Returns:
If name argument is a string, then the dataset index is returned. If name argument is a list or tuple, then a list of indexes for each dataset name index is returned.
- Return type:
Union[int, list[int]]
- Raises:
ValueError – Error raised if no datasets were found to match a requested name.
- get_dataset_names() list[str]
Get a list of dataset names
- Returns:
List containing the names of datasets within this Phenonaut object.
- Return type:
list[str]
- get_df_features_perturbation_column(ds_index=-1, quiet: bool = False) tuple[DataFrame, list[str], str | None]
Helper function to obtain DataFrame, features and perturbation column name.
Some Phenonaut functions allow passing of a Phenonaut object, or DataSet. They then access the underlying pd.DataFrame for calculations. This helper function is present on Phenonaut objects and Dataset objects, allowing more concise code and less replication when obtaining the underlying data. If multiple Datasets are present, then the last added Dataset is used to obtain data, but this behaviour can be changed by passing the ds_index argument.
- Parameters:
ds_index (int, optional) – Index of the Dataset to be returned. By default -1, which uses the last added Dataset.
quiet (bool) – When checking if perturbation is set, check without inducing a warning if it is None.
- Returns:
Tuple containing the Dataframe, a list of features and the perturbation column name.
- Return type:
tuple[pd.DataFrame, list[str], str]
- get_hash_dictionary() dict
Returns dictionary containing SHA256 hashes
Returns a dictionary of base64 encoded UTF-8 strings representing the SHA256 hashes of datasets (along with names), combined datasets, and the Phenonaut object (including name).
- Returns:
Dictionary of base64 encoded SHA256 representing datasets and the Phenonaut object which created them.
- Return type:
dict
- groupby_datasets(by: str | List[str], ds_index=-1, remove_original=True)
Perform a groupby operation on a dataset
Akin to performing a groupby operation on a pd.DataFrame, this splits a dataset by column(s), optionally keeping or removing (by default) the original.
- Parameters:
by (Union[str, list]) – Columns in the dataset’s DataFrames which should be used for grouping
ds_index (int, optional) – Index of the Dataset to be returned. By default -1, which uses the last added Dataset.
remove_original (bool, optional) – If True, then the original split dataset is deleted after splitting
- keys() list[str]
Return a list of all dataset names
- Returns:
List of dataset names, empty list if no datasets are loaded.
- Return type:
list(str)
- classmethod load(filepath: str | Path) Phenonaut
Class method to load a compressed Phenonaut object
Loads a gzipped Python pickle containing a Phenonaut object
- Parameters:
filepath (Union[str, Path]) – Location of gzipped Phenonaut object pickle
- Returns:
Loaded Phenonaut object.
- Return type:
- Raises:
FileNotFoundError – File not found, unable to load pickled Phenonaut object.
- load_dataset(dataset_name: str, input_file_path: Path | str, metadata: dict | None = None, h5_key: str | None = None, features: list[str] | None = None)
Load a dataset from a CSV, optionally suppying metadata and a name
- Parameters:
dataset_name (str) – Name to be assigned to the dataset
input_file_path (Union[Path, str]) – CSV/TSV/H5 file location
metadata (dict, optional) – Metadata dictionary describing the CSV data format, by default None
h5_key (Optional[str]) – If input_file_path is an h5 file, then a key to access the target DataFrame must be supplied.
features (Optional[list[str]]) – Optionally supply a list of features here. If None, then the features/feature finding related keys in metadata are used. You may also explicitly supply an empty list to explicitly specify that the dataset has no features. This is not recommended.
- merge_datasets(datasets: List[Dataset] | List[int] | Literal['all'] = 'all', new_dataset_name: str | None = 'Merged Dataset', return_merged: bool = False, remove_merged: bool | None = True)
Merge datasets
After performing a groupby operation on Phenonaut.data.Dataset objects, a list of datasets may be merged into a single Dataset using this method.
- Parameters:
datasets (Union[List[Dataset], List[int], List[str]]) – Datasets which should be groupbed together. May be a list of Datasets, in which case these are merged together and inserted into the Phenonaut object, or a list of integers or dataset names which will be used to look up datasets in the current Phenonaut object. Mixing of ints and dataset string identifiers is acceptable but mixing of any identifier and Datasets is not supported. If ‘all’, then all datasets in the Phenonaut object are merged. By default ‘all’
return_merged (bool) – If True the merged dataset is returned by this function. If False, then the new merged dataset is added to the current Phenonaut object. By default False
remove_merged (bool) – If True and return_merged is False causing the new object to be merged into the Phenonaut object, then source datasets which are in the current object (addressed by index or int in the datasets list) are removed from the Phenonaut object. By default True
- new_dataset_from_query(name, query: str, query_dataset_name_or_index: int | str = -1, raise_error_on_empty: bool = True, overwrite_existing: bool = False)
Add new dataset through a pandas query of existing dataset
- Parameters:
query (str) – The pandas query used to select the new dataset
name (str) – A name for the new dataset
query_dataset_name_or_index (Union[int, str], optional) – The dataset to be queried, can be an int index, or the name of an existing dataset. List indexing can also be used, such that -1 uses the last dataset in Phenonaut.data.datasets list, by default -1.
raise_error_on_empty (bool) – Raise a ValueError is the query returns an empty dataset. By default True.
overwrite_existing (bool) – If a dataset already exists with the name given in the name argument, then this argument can be used to overwrite it, by default False.
- revert() None
Revert a Phenonaut object that was recently saved to it’s previous state
Upon calling save on a Phenonaut object, the record stores the output file location, allowing a quick was to revert changes made to a file by calling .revert(). This returns the object to it’s saved state.
- Raises:
FileNotFoundError – File not found, Phenonaut object has never been written out.
- save(output_filename: str | Path, overwrite_existing: bool = False) None
Save Phenonaut object and contained Data to a pickle
Writes a gzipped Python pickle file. If no compression, or another compressison format is required, then the user should use a custom pickle.dump and not rely on this helper function.
- Parameters:
output_filename (Union[str, Path]) – Output filename for the gzipped pickle
overwrite_existing (bool, optional) – If True and the file exists, overwrite it. By default False.
- shrink(keep_prefix: str | list[str] | None = 'Metadata_')
Reduce the size of all datasets by removing unused columns from the internal DataFrames
Often datasets contain intermediate features or unused columns which can be removed. This function removes every column from a Dataset internal DataFrame that is not the perturbation_column, or has a given prefix. By default, this prefix is “Metadata_”, however this can be removed, changed, or a new list of prefixes supplied using the keep_prefix argument.
- Parameters:
keep_prefix (str | list[str] | None, optional) – Prefix for columns which should be kept during shirinking of the dataset. This prefix applies to columns which are not features (which are kept automatically). Can be a list of prefixes, or None, by default “Metadata_”
- subtract_median_perturbation(perturbation_label: str, per_column_name: str | None = None, new_features_prefix: str = 'SMP_')
Subtract the median perturbation from all features for all datasets.
Useful for normalisation within a well/plate format. The median feature may be identified through the per_column_name variable, and perturbation label. Newly generated features may have their prefixes controled via the new_features_prefix argument.
- Parameters:
perturbation_label (str) – The perturbation label which should be used to calculate the median
per_column_name (Optional[str], optional) – The perturbation column name. This is optional and can be None, as the Dataset may already have perturbation column set. By default, None.
new_features_prefix (str) – Prefix for new features, each with the median perturbation subtracted. By default ‘SMP_’ (for subtracted median perturbation).
- phenonaut.phenonaut.load(input_file: Path | str, shrink=False, shrink_keep_prefix: str | list[str] | None = 'Metadata_') Phenonaut
Convenience function allowing phenonaut.load
Allows calling of phenonaut.load() rather than phenonaut.Phenonaut.load()
- Parameters:
input_file (Path | str) – Pickle file path of the Phenonaut object which is to be loaded
shrink (bool) – If True then datasets are shrunk after loading to remove unused columns, by default False
shrink_keep_prefix (str | list[str] | None, optional) – Prefix for columns which should be kept during shirinking of the dataset. This prefix applies to columns which are not features (which are kept automatically). Can be a list of prefixes, or None, by default “Metadata_”
- Returns:
Loaded Phenonaut object
- Return type:
- phenonaut.phenonaut.match_perturbation_columns(*args) None
Make all dataset perturbation_columns match that of the first supplied dataset
Two or more datasets supplied as args to this function will have the second dataset (and any subsequent dataset) perturbation_column set to that of the first - also renaming underlying DataFrame columns as appropriate. If Phenonaut objects are given, then everything matches the first Dataset of the first given argument.
- Raises:
ValueError – 2 or more datasets must be supplied
phenonaut.utils module
- phenonaut.utils.check_path(p: Path | str, is_dir: bool = False, make_parents: bool = True, make_dir_if_dir: bool = True) Path
Check a user supplied path (str or Path), ensuring parents exist etc
- Parameters:
p (Union[Path, str]) – File or directory path supplied by user
is_dir (bool, optional) – If the path supplied by the user should be a directory, then set it as such by assigning is_dir to true, by default False
make_parents (bool, optional) – If the parent directories of the supplied path do not exist, the make them, by default True
make_dir_if_dir (bool, optional) – If the supplied path is a directory, but it does not exist, the make it, by default True
- Returns:
Path object pointing to the user supplied path, with parents made (if requested), and the directory itself made (if a directory and requested)
- Return type:
Path
- Raises:
ValueError – Parent does not exist, and make_parents was False
ValueError – Passed path was not a string or pathlib.Path
- phenonaut.utils.load_dict(file_path: str | Path | None, cast_none_to_dict=False)
phenonaut.workflow module
- class phenonaut.workflow.Workflow(workflow_path: Path | str | dict)
Bases:
object
Phenonaut Workflows allow operation through simple YAML workflows.
Workflows may be defined in Phenonaut, and the module executed directly, rather than imported and used by a Python program. Workflows are defined using the simple YAML file format. However, due to the way in which they are read in, JSON files may also be used. As YAML files can contain multiple YAML entries, we build on this concept, allowing multiple workflows to be defined in a single YAML (or JSON) file. Once read in, workflows are dictionaries. From Python 3.6 onwards, dictionaries are ordered. We can therefore define our workflows in order and guarantee that they will be executed in the defined order. A dictionary defining workflows have the following structure: {job_name: task_list}, where job_name is a string, and task list is a list defining callable functions, or tasks required to complete the job.
The job list takes the form of a list of dictionaries, each containning only one key which is the name of the task to be peformed. The value indexed by this key is a dictionary of argument:value pairs to be passed to the function responsible for performing the task. The structure is best understood with an example. Here, we see a simple workflow contained within a YAML file for calculation of the scalar projection phenotypic metric. YAML files start with 3 dashes.
--- scalar_projection_example: - load: file: screening_data.csv metadata: features_prefix: - feat_ - scalar_projection: target_treatment_column_name: control target_treatment_column_value: pos output_column_label: target_phenotype - write_multiple_csvs: split_by_column: PlateID output_dir: scalar_projection_output
The equivalent JSON with clearer (for Python programmers) formatting for the above is:
{ "scalar_projection_example": [ { "load": { "file": "screening_data.csv", "metadata": { "features_prefix": ["feat_"]} } }, { "scalar_projection": { "target_treatment_column_name": "control", "target_treatment_column_value": "pos", "output_column_label": "target_phenotype", } }, {"write_multiple_csvs":{ "split_by_column": "PlateID", "output_dir": "scalar_projection_output/" } }, ] }
The workflow define above in the example YAML and JSON formats, has the name “scalar_projection_example”, and consists of 3 commands.
load
scalar_projection
write_multiple_csvs
See the user guide for a full listing of commands.
- Parameters:
workflow_path (Union[Path, str, dict]) – Workflows can be defined in YML or JSON files with their locations supplied or jobs passed as dictionaries. Dictionary keys denote the job names. Values under these keys should be lists of dictionaries. Each dictionary should have one key, denoting the name of the task and values under this key contain options for the called functions/ tasks.
- Raises:
TypeError – Supplied Path or str to file location does not appear to be a YAML or JSON file.
- VIF_filter_features(arguments: dict)
Workflow function: Perform VIF feature filter
Designed to be called from a workflow, performs variance inflation factor (VIF) filtering on a dataset, removing features which are not detrimental to capturing variance. More information available: https://en.wikipedia.org/wiki/Variance_inflation_factor
This can be a computationally expensive process as the number of linear regressions required to be run is almost N^2 with features.
- Parameters:
arguments (dict) –
- target_dataset:
Index or name of dataset which should have variance inflation filter applied. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.
- vif_cutoff:
float or int indicating the VIF cutoff to apply. A good balance and value often used is 5.0. If this key:value pair is absent, then behaviour is as if 5.0 was supplied.
- min_features:
removal of too many features can be detrimental. Setting this value sets a lower limit on the number of features which must remain. If absent, then behaviour is as if a value of 2 was given.
- drop_columns:
value is a boolean, denoting if columns should be dropped from the data table, as well as being removed from features. If not supplied, then the behaviour is as if False was supplied.
- add_well_id(arguments: dict)
Workflow function: Add well IDs
Designed to be called from a workflow. Often, we would like to use well and column numbers to resolve a more traditional alpha-numeric WellID notation, such as A1, A2, etc. This can be achieved through calling this workflow function.
If a dataset contains numerical row and column names, then they may be translated into standard letter-number well IDs. The arguments dictionary may contain the following keys, with their values denoted as bellow:
- numerical_column_namestr, optional
Name of column containing numeric column number, if not supplied, then behaves as if “COLUMN”.
- numerical_row_namestr, optional
Name of column containing numeric column number, if not supplied, then behaves as if “ROW”.
- plate_typeint, optional
Plate type - note, at present, only 384 well plate format is supported, if not supplied, then behaves as if 384.
- new_well_column_namestr, optional
Name of new column containing letter-number well ID, if not supplied, then behaves as if default “Well”.
- add_empty_wellsbool, optional
Should all wells from a plate be inserted, even when missing from the data, if not supplied, then behaves as if False.
- plate_barcode_columnstr, optional
Multiple plates may be in a dataset, this column contains their unique ID, if not supplied, then bahaves as if None.
- no_sortbool, optional
Do not resort the dataset by well ID, if not supplied, then behaves as if False
- Parameters:
arguments (dict) – Dictionary containing arguments to the Dataset.add_well_id function, see API documentation for further details, or function help.
- cityblock_distance(arguments: dict)
Workflow function: Add a column for the cityblock distance to a target perturbation.
Designed to be called from a workflow, calculates the cityblock distance in feature space. Also known as the Manhattan distance.
- Parameters:
arguments (dict, should contain:) –
- target_dataset
Index or name of dataset which should be used in the measurment. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.
- target_perturbation_column_name
normally a ‘control’ column
- target_perturbation_column_value:
value to be found in the column defined previously.
- output_column_label:
Output column for the measurement. If this is missing, then it is set to target_perturbation_column_value.
- copy_column(arguments: dict)
Workflow function: copy a dataset column
Designed to be called from a workflow, copys the values of one column within a dataset to another. The arguments dictionary can contain ‘to’ and ‘from’ keys with values for column names, or alternatively, simply from:to key-value pairs denoting how to perform the copy operation.
- Parameters:
arguments (dict) –
- Options for the command. Should include either:
1. dictionary with keys “to” and “from”, with item names related to the columns that should be used.
2.dictionary with the form {from_column:to_column}, which will copy the column with title from_column to to_column
Note, if any dictionary items (to) are lists, then multiple copies will be made.
- Raises:
KeyError – Column was not found in the Pandas DataFrame.
- euclidean_distance(arguments: dict)
Workflow function: Add a column for the euclidean distance to a target perturbation.
Designed to be called from a workflow, calculates the euclidean distance in feature space.
- Parameters:
arguments (dict, should contain:) –
- target_dataset
Index or name of dataset which should be used in the measurment. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.
- target_perturbation_column_name
normally a ‘control’ column
- target_perturbation_column_value:
value to be found in the column defined previously.
- output_column_label:
Output column for the measurement. If this is missing, then it is set to target_perturbation_column_value.
- filter_columns(arguments: dict)
Workflow function: filter columns
Designed to be called from a workflow, Datasets may have columns defined for keeping or removal. This function also provides a convenient way to reorder dataframe columns.
- Parameters:
arguments (dict) –
- Dictionary of options, can include the following keys:
- keep: bool, optional, by default True
Only matching columns are kept if true. If false, they are removed.
- column_names: [list, str]
List of columns to keep (or regular expressions to match)
- column_name: str
Singular column to keep (or regular expressions to match)
- regex: bool, optional, by default False.
perform regular expression matching
Workflow function: Filter features by highly correlated then VIF.
Designed to be called from a workflow, Ideally, VIF would be applied to very large datasets. Due to the almost n^2 number of linear regression required as features increase, this is not possible on datasets with a large number of features - such as methylation datasets. We therefore must use other methods to reduce the features to a comfortable level allowing VIF to be performed. This class calculates correlations between all features and iteratively (Pearson correlation coefficient), removing features with the highest R^2 against another feature. Once the number of featurs is reduced to a level suitable for VIF, VIF is performed.
More information available: https://en.wikipedia.org/wiki/Variance_inflation_factor
- Parameters:
arguments (dict) –
- target_dataset:
Index or name of dataset which should have features filtered. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.
- n_before_vif :
Number of features to remove before applying VIF. This is required when dealing with large datasets which would be too time consuming to process entirely with VIF. Features are removed iteratively, selecting the most correlated features and removing them. If this key:value pair is absent, then it is as if the value of 1000 has been supplied.
- vif_cutoff :
The VIF cutoff value, above which features are removed. Features with VIF scores above 5.0 are considered highly correlated. If not supplied, then behaviour is as if a value of 5.0 was supplied.
- drop_columns :
If drop columns is True, then not only will features be removed from the dataset features list, but the columns for these features will be removed from the dataframe. If absent, then behaviour is as if False was supplied.
Workflow function: Perform filter of highly correlated features
Designed to be called from a workflow, performs filtering of highly correlated features (as calculated by Pearson correlation coefficient) by either by removal of features correlated above a given theshold, or uses the iterative removal of features with the highest R^2 against another feature. The arguments dictionary should contain a threshold or n key:value pair, not both. A key of threshold and float value defines the correlation above which, features should be removed. If the n key is present, then features are iteratively removeduntil n features remain.
- Parameters:
arguments (dict) –
- target_dataset
Index or name of dataset which should have features filtered. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.
- threshold
If this key is present, then it activates threshold mode, where by calculated correlations, above which should be removed. A good value for this threshold is 0.9.
- n
If this key is present, then the number of features to keep is defined this way. The process works through iteratively removing features ordered by the most correlated until the number of features is equal to n. If threshold is also present, then n acts as a minimum number of features and feature removal will stop, no matter the correlations present in the dataset.
- drop_columnsbool, optional
If drop columns is True, then not only will features be removed from the dataset features list, but the columns for these features will be removed from the dataframe. If absent, then the behaviour is as if False was supplied as a value to this key:value pair.
- filter_rows(arguments: dict)
Workflow function: Filter rows
Designed to be called from a workflow, filter_rows alows keeping of only rows with a certain value in a certain column. Takes as arguments a dictionary containing query_column key:value pair and one of query_value, query_values or values key:value pairs:
- query_column
name of the column that should match the value below
- query_value
value to match
- query_values
values to match (as a list)
- values
values to match (as a list)
Additionally, a key “keep” with a boolean value may be included. If True then rows matching the query are kept, if False, then rows matching are discarded, and non-matching rows kept.
- Parameters:
arguments (dict) – Dictionary containing query_column key and value defining the column name, and one of the following keys: query_value, query_values, values. If plural, then values under the key should be a list containing values to perform matching on, otherwise, singular value.
- Raises:
DataError – [description]
- if_blank_also_blank(arguments: dict)
Workflow function: if column is empty, also blank
Designed to be called from a workflow, often it is required to clean or remove rows not needed for inclusion into further established pipelines/ workflows. This workflow function allows the ability to remove values from a column on the condition that onther column is empty.
- Parameters:
arguments (dict) –
Dictionary containing the following key:value pairs:
- query_column
value is the name of the column to perform the query on.
- regex_query
value is a boolean denoting if the query column value should be matched using a regular expression. If omitted, then behaves as if present and False.
- target_column
value is a string, denoting the name of the column which should be blanked.
- target_columns
value is a list of strings, denoting the names of columns which should be blanked.
- regex_targets
value is a boolean denoting if the target column or multiple target columns defined in target_columns should be matched using a regular expression. If absent, then behaves as if False was supplied.
- Raises:
KeyError – ‘query_column’ not found in arguments dictionary
IndexError – Multiple columns matched query_column using the regex
KeyError – No target columns found for if_blank_also_blank, use target_column keys
- load(arguments: dict)
Workflow function: load a dataset (CSV or PakcagedDataset)
Workflow runnable function allowing loading of dataset from CSV or a PackagedDataset. As with all workflow runnable functions, this is designed to be called from a worflow.
There are 2 possible options for loading in a dataset.
Firstly, loading a user supplied CSV file. This option is initiated through inclusion of the ‘file’ key within arguments. The value under the ‘file’ key should be a string or Path to the CSV file. In addition, a ‘metadata’ key is also required to be present in arguments, with a dictionary as the value. Within this dictionary under ‘metadata’, special keywords allow the reading of data in different formats. Special keys for the metadata dictionary are listed below (See Pandas documentation for read_csv for more in-depth information):
- sep
define separator for fields within the file (by default ‘,’)
- skiprows
define a number of rows to skip at the beginning of a file
- header_row_number
may be a single number, or list of numbers denoting header rows.
- transpose
In the case of some transcriptomics raw data, transposition is required to have samples row-wise. Therefore the table must be transposed. Set to True to transpose.
- index_col
Column to use as the row labels of the CSV given as either the column names or as numerical indexes. Note: index_col=False can be used to force pandas to not use the first column as the index, e.g. when you have a malformed file with delimiters at the end of each line.
- key
If an h5 file is supplied, then this is the key to the underlying pandas dataframe.
Secondly, we may load a packaged dataset, by including the ‘dataset’ key within the arguments dictionary. The value under this key should be one of the current packaged datasets supported by workflow mode - currently TCGA, CMAP, Iris, Iris_2_views, BreastCancer. An additional key in the dictionary can be ‘working_dir’ with the value being a string denoting the location of the file data on a local filesystem, or the location that it should be downloaded to and stored.
- Parameters:
arguments (dict) – Dictionary containing file and metadata keys, see function description/help/docstring.
- mahalanobis_distance(arguments: dict)
Workflow function: Add a column for the Mahalanobis distance to target perturbations.
Designed to be called from a workflow, calculates the Mahalanobis distance in feature space.
- Parameters:
arguments (dict, should contain:) –
- target_dataset
Index or name of dataset which should be used in the measurment. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.
- target_perturbation_column_name
normally a ‘control’ column
- target_perturbation_column_value:
value to be found in the column defined previously.
- output_column_label:
Output column for the measurement. If this is missing, then it is set to target_perturbation_column_value.
- manhattan_distance(arguments: dict)
Workflow function: Add a column for the Manhattan distance to target perturbation.
Designed to be called from a workflow, calculates the Manhattan distance in feature space. Also known as the cityblock distance.
- Parameters:
arguments (dict, should contain:) –
- target_dataset
Index or name of dataset which should be used in the measurment. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.
- target_perturbation_column_name
normally a ‘control’ column
- target_perturbation_column_value:
value to be found in the column defined previously.
- output_column_label:
Output column for the measurement. If this is missing, then it is set to target_perturbation_column_value.
- pca(arguments: dict)
Workflow function: Perform PCA dimensionality reduction technique.
Designed to be called from a workflow, performs the principal component dimensionality reduction technique. If no arguments are given or the arguments dictionary is empty, then 2D PCA is applied to the dataset with the highest index, equivalent of phe[-1], which is usually the last inserted dataset.
- Parameters:
arguments (dict) –
Dictionary of arguments to used to direct the PCA process, can contain the following keys and values
- target_dataset
Index or name of dataset which should have the dimensionality reduction applied. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.
- ndims
Number of dimensions to which the PCA should reduce the features. If absent, then defaults to 2.
- center_on_perturbation_id
PCA should be recentered on the perturbation with ID. If absent, then defaults to None, and no centering is performed.
- center_by_median
If true, then median of center_on_perturbation is used, if False, then the mean is used.
- fit_perturbation_ids
PCA may be fit to only the included IDs, before the transform is applied to the whole dataset.
- rename_column(arguments: dict)
Workflow function: Rename column
Designed to be called from a workflow, ranames a single, or multiple columns. The arguments dictionary should contain key:value pairs, where the key is the old column name and the value is the new column name.
- Parameters:
arguments (dict) – Dictionary containing name_from:name_to key value pairs, which will cause the column named ‘name_from’ to be renamed ‘name_to’. Multiple columns can be renamed in a single call, using multiple dictionary entries.
- Raises:
ValueError – ‘arguments’ was not a dictionary of type: str:str
- rename_columns(arguments: dict)
Workflow function: Rename columns
Designed to be called from a workflow, ranames a single, or multiple columns. The arguments dictionary should contain key:value pairs, where the key is the old column name and the value is the new column name.
- Parameters:
arguments (dict) – Dictionary containing name_from:name_to key value pairs, which will cause the column named ‘name_from’ to be renamed ‘name_to’. Multiple columns can be renamed in a single call, using multiple dictionary entries.
- Raises:
ValueError – ‘arguments’ was not a dictionary of type: str:str
- run_workflow()
Run the workflow defined in the workflow object instance
- scalar_projection(arguments: dict)
Workflow function: Add a column for the scalar projection to a target perturbation.
Designed to be called from a workflow, calculates the scalar projection and scalar rejection, quantifying on and off target phenotypes, as used in: Heiser, Katie, et al. “Identification of potential treatments for COVID-19 through artificial intelligence-enabled phenomic analysis of human cells infected with SARS-CoV-2.” BioRxiv (2020).
- Parameters:
arguments (dict, should contain:) –
- target_dataset
Index or name of dataset which should be used in the measurment. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.
- target_perturbation_column_name
normally a ‘control’ column
- target_perturbation_column_value:
value to be found in the column defined previously.
- output_column_label:
Output from the scalar projection will have the form: on_target_<output_column_label> and off_target_<output_column_label> if this is missing, then it is set to target_perturbation_column_value
- scatter(arguments: dict)
Workflow function: Make scatter plot.
Designed to be called from a workflow, produce a scatter plot from a dataframe.
- Parameters:
arguments (dict) –
- target_dataset
Index or name of dataset which should have features plotted. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.
- figsize
A tuple denoting the target output size in inches (!), if absent, then the default of (8,6) is used.
- title
Title for the plot, if absent, then the default “2D scatter” is used.
- peturbations
Can be a list of peturbations - as denoted in the perturbations column of the dataframe to include in the plot. If absent, then all perturbations are included.
- destination
Output location for the PNG - required field. An error will be thrown if omitted.
- set_perturbation_column(arguments: dict)
Workflow function: Set the perturbation column
Designed to be called from a workflow, the perturbation column can be set on a dataset to help with plotting/scatters.
- Parameters:
arguments (dict) –
- target_dataset:
Index or name of dataset within which we wish to set the perturbation column. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.
- column:
str type giving the new column name which will be set to mark perturbations.
- tsne(arguments: dict)
Workflow function: Perform t-SNE dimensionality reduction technique.
Designed to be called from a workflow, performs the t-SNE dimensionality reduction technique. If no arguments are given or the arguments dictionary is empty, then 2D t-SNE is applied to the dataset with the highest index, equivalent of phe[-1], which is usually the last inserted dataset.
- Parameters:
arguments (dict) –
Dictionary of arguments to used to direct the t-SNE process, can contain the following keys and values:
- target_dataset
Index or name of dataset which should have the dimensionality reduction applied. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.
- ndims
number of dimensions to which the t-SNE should reduce the features. If absent, then defaults to 2.
- center_on_perturbation_id
tSNE should be recentered on the perturbation with ID.If absent, then defaults to None, and no centering is performed.
- center_by_median
If true, then median of center_on_perturbation is used, if False, then the mean is used.
- umap(arguments: dict)
Workflow function: Perform UMAP dimensionality reduction technique.
Designed to be called from a workflow, performs the UMAP dimensionality reduction technique. If no arguments are given or the arguments dictionary is empty, then 2D UMAP is applied to the dataset with the highest index, equivalent of phe[-1], which is usually the last inserted dataset.
- Parameters:
arguments (dict) –
Dictionary of arguments used to direct the UMAP transform function. Can contain the following keys and values.
- target_dataset
Index or name of dataset which should have the dimensionality reduction applied. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.
- ndims
Number of dimensions to which the UMAP should reduce the features. If absent, then defaults to 2.
- center_on_perturbation_id
UMAP should be recentered on the perturbation with ID. If absent, then defaults to None, and no centering is performed.
- center_by_median
If true, then median of center_on_perturbation is used, if False, then the mean is used.
- write_csv(arguments: dict)
Workflow function: Write dataframe to CSV file.
Designed to be called from a workflow, writes a CSV file using the Pandas.DataFrame.to_csv function. Expects a dictionary as arguments containing a ‘path’ key, with a string value pointing at the destination location of the CSV file. Additional keys within the supplied dictionary are supplied to Pandas.DataFrame.to_csv as kwargs, allowing fully flexible output.
- Parameters:
arguments (dict) – Dictionary should contain a ‘path’ key, may also contain a target_dataset key which if absent, defaults to -1 (usually the last added dataset).
- write_multiple_csvs(arguments: dict)
Workflow function: Write multiple CSV files
Designed to be called from a workflow. Often it is useful to write a CSV file per plate within the dataset, or group the data by some other identifier.
- Parameters:
arguments (dict) –
Dictionary, should contain:
- split_by_column: str
the column to be split on
- output_dir: str
the target output directory
- file_prefix: str
optional prefix for each file.
- file suffix: str
optional suffix for each file.
- file_extension: str
optional file extension, by default ‘.csv’
- phenonaut.workflow.predict(self, arguments: dict)
Workflow function: predict
Profile predictors in their ability to predict a given target.
Phenonaut provides functionality to profile the performance of multiple predictors against multiple views of data. This is exemplified in the TCGA example used in the Phenonaut paper - see Example 1 - TCGA for a full walkthrough of applying this functionality to The Cancer Genome Atlas. With a given ‘target’ for prediction which is in the dataset, predict selects all appropriate predictors (classifiers for classification, regressors for regression and multiregressors for multi regression/view targets). Then, enumerating all views of the data and all predictors, hyperparameter optimisation coupled with 5-fold cross validation using Optuna is employed, before finally testing the best hyperparameter sets with retained test sets. This process is automatic and requires the data, and a prediction target. Output from this process is extensive and it may take a long time to complete, depending on the characteristics of your input data. Writen output from the profiling process consists of performance heatmaps highlighting best view/predictor combinations in bold, boxplots for each view combination and a PPTX presentation file allowing easy sharing of data, along with machine readable CSV and JSON results.
For each unique view combination and predictor, perform the following:
Merge views and remove samples which do not have features across currently needed views.
Shuffle the samples.
Withhold 20% of the data as a test set, to be tested against the trained and hyperparameter optimised predictor.
Split the data using 5-fold cross validation into train and validation sets.
For each fold, perform Optuna hyperparameter optimisation for the given predictor using the train sets, using hyperparameters described by the default predictors for classification, regression and multiregression.
- Parameters:
arguments (dict) –
- output_directory
Directory into which profiling output (boxplots, heatmaps, CSV, JSON and PPTX should be written).
- dataset_combinations
If multiple datasets are already loaded, then lists of ‘views’ may be specified for exploration. If None, or this argument is absent, then all combinations of available views/Datasets are enumerated and used.
- target
The prediction target, denoted by a column name given here which exists in loaded datasets.
- n_splits
Number of splits to use in the N-fold cross validation, if absent, then the default of 5 is used.
- n_optuna_trials
Number of Optuna trials for hyperparameter optimisation, by default 20. This drastically impacts runtime, so if things are taking too long, you may wish to lower this number. For a more thorough exploration of hyperparameter space, increase this number.
- optuna_merge_folds
By default, each fold has hyperparameters optimised and the trained predictor with parameters reported. If this optuna_merge_folds is true, then each fold is trained on and and hyperparameters optimised across folds (not per-fold). Setting this to False may be useful depending on the intended use of the predictor. It is believed that when False, and parameters are not optimised across folds, then more accurate prediction variance/accuracy estimates are produced. If absent, behaves as if false.
- test_set_fraction
When optimising a predictor, by default a fraction of the total data is held back for testing, separate from the train-validation splits. This test_set_fraction controls the size of this split. If absent, then the default value of 0.2 is used.
Module contents
- class phenonaut.Phenonaut(dataset: Dataset | list[Dataset] | PackagedDataset | Bunch | DataFrame | Path | str | None = None, name: str = 'Phenonaut object', kind: str | None = None, packaged_dataset_name_filter: str | list[str] | None = None, metadata: dict | list[dict] | None = {}, features: list[str] | None = None, dataframe_name: str | list[str] | None = None, init_hash: str | bytes | None = None)
Bases:
object
Phenonaut object constructor
Holds multiple datasets of different type, applys transforms, load and tracking operations.
May be initialised with:
Phenonaut Datasets
Phenonaut PackageDataset
Scikit Bunch
pd.DataFrame
by passing the object as an optional dataset argument.
- Parameters:
dataset (Optional[Union[Dataset, list[Dataset], PackagedDataset, Bunch,) – pd.DataFrame, Path, str]], optional Initialise Phenonaut object with a Dataset, list of datasets, or PackagedDataset, by default None.
name (str) – A name may be given to the phenonaut object. This is useful in naming collections of datasets. For example, The Cancer Genome Atlas contains 4 different views on tumors - mRNA, miRNA, methylation and RPPA, collectively, these 4 datasets loaded into a phenonaut object may be named ‘TCGA’ - or ‘The Cancer Genome Atlas dataset’. If set to None, then the phenonaut object takes the name “Phenonaut data”, however, not in the case where construction of the object occurs with a phenonaut packaged dataset or already named phenonaut object, where it takes the name of the passed object/dataset.
kind (Optional[str]) – Instead of providing metadata, some presets are available, which make reading in things like DRUG-Seq easier. This argument only has an effect when reading in a raw data file, like CSV or H5 and directs Phenonaut to use a predefind set of parameters/transforms. If used as well as metadata, then the preset metadata dictionary from the kind argument is first loaded, then updated with anything in the metadata dictionary, this therefore allows overriding specific presets present in kind dictionaries. Available ‘kind’ dictionaries may be listed by examining: phenonaut.data.recipes.recipes.keys()
packaged_dataset_name_filter (Optional[Union[list[str], str]], optional) – If a PackagedDataset is supplied for the data argument, then import only datasets from it named in the name_filter argument. If None, then all PackagedDataset datasets are imported. Can be a single string or list of strings. If None, and PackagedDataset is supplied, then all Datasets are loaded. Has no effect if data is not a PackagedDataset, by default None.
metadata (Optional[Union[dict, list[dict]]]) – Used when a pandas DataFrame is passed to the constructor of the phenonaut object. Metadata typically contains features or feature_prefix keys telling Phenonaut which columns should be treated as Dataset features. Can also be a list of metadata dicitonaries if a list of pandas DataFrames are supplied to the constructor. Has no effect if the type of dataset passed is not a pandas DataFrame or list of pandas DataFrames. If a list of pandas DataFrames is passed to data but only one metadata dictionary is given, then this dictionary is applied to all DataFrames. By default {}.
features (Optional[list[str]] = None) – May be used as a shortcut to including features in the metadata dictionary. Only used if the metadata is a dict and does not contain a features key.
dataframe_name (Optional[Union[dict, list[dict]]]) – Used when a pandas DataFrame, or str, or Path to a CSV file is passed to the constructor of the phenonaut object. Optional name to give to the dataset object constructed from the pandas DataFrame. If multiple DataFrames are given in a list, then this dataframe_name argument can be a list of strings as names to assign to the new Dataset objects.
init_hash (Optional[Union[str, bytes]]) – Cryptographic hashing within Phenonaut can be initialised with a starting/seed hash. This is useful in the creation of blockchain-like chains of hashes. In environments where timestamping is unavailable, hashes may be published and then used as input to subsequent experiments. Building up a provable chain along the way. By default None, implying an empty bytes array.
- add_well_id(numerical_column_name: str = 'COLUMN', numerical_row_name: str = 'ROW', plate_type: int = 384, new_well_column_name: str = 'Well', add_empty_wells: bool = False, plate_barcode_column: str | None = None, no_sort: bool = False)
Add standard well IDs - such as A1, A2, etc to ALL loaded Datasets.
If a dataset contains numerical row and column names, then they may be translated into standard letter-number well IDs. This is applied to all loaded Datasets. If you wish only one to be annotated, then call add_well_id on that individual dataset.
- Parameters:
numerical_column_name (str, optional) – Name of column containing numeric column number, by default “COLUMN”.
numerical_row_name (str, optional) – Name of column containing numeric column number, by default “ROW”.
plate_type (int, optional) – Plate type - note, at present, only 384 well plate format is supported, by default 384.
new_well_column_name (str, optional) – Name of new column containing letter-number well ID, by default “Well”.
add_empty_wells (bool, optional) – Should all wells from a plate be inserted, even when missing from the data, by default False.
plate_barcode_column (str, optional) – Multiple plates may be in a dataset, this column contains their unique ID, by default None.
no_sort (bool, optional) – Do not resort the dataset by well ID, by default False
- aggregate_dataset(composite_identifier_columns: list[str], datasets: Iterable[int] | Iterable[str] | int | str = -1, new_names_or_prefix: list[str] | tuple[str] | str = 'Aggregated_', inplace: bool = False, transformation_lookup: dict[str, Callable | str] | None = None, tranformation_lookup_default_value: str | Callable = 'mean')
Aggregate multiple or single phenonaut dataset rows
If we have a Phenonaut object containing data derived from 2 fields of view from a microscopy image, a sensible approach is averaging features. If we have the DataFrame below, we may merge FOV 1 and FOV 2, taking the mean of all features. As strings such as filenames should be kept, they are concatenated together, separated by a comma, unless the strings are the same, in which case just one is used.
Here we test a df as follows:
ROW
COLUMN
BARCODE
feat_1
feat_2
feat_3
filename
FOV
1
1
Plate1
1.2
1.2
1.3
FileA.png
1
1
1
Plate1
1.3
1.4
1.5
FileB.png
2
1
1
Plate2
5.2
5.1
5
FileC.png
1
1
1
Plate2
6.2
6.1
6.8
FileD.png
2
1
2
Plate1
0.1
0.2
0.3
FileE.png
1
1
2
Plate1
0.2
0.2
0.38
FileF.png
2
With just this loaded into a phenonaut object, we can call:
phe.aggregate_dataset([‘ROW’,’COLUMN’,’BARCODE’])
Will merge and proeduce another, secondary dataset in the phe object containing:
ROW
COLUMN
BARCODE
feat_1
feat_2
feat_3
filename
FOV
1
1
Plate1
1.25
1.3
1.40
fileA.png,FileB.png
1.5
1
1
Plate2
5.70
5.6
5.90
FileC.png,FileD.png
1.5
1
2
Plate1
0.15
0.2
0.34
FileF.png,fileE.png
1.5
if inplace=True is passed in the call to aggregate_dataset, then the phenonaut object will contain just one dataset, the new aggregated dataset.
- Parameters:
composite_identifier_columns (list[str]) – If a biochemical assay evaluated through imaging is identified by a row, column, and barcode (for the plate) but multiple images taken from a well, then these multiple fields of view can be merged, creating averaged features using row, column and barcode as the composite identifier on which to merge fields of view.
datasets (Union[list[int], list[str], int, str]) – Which datasets to apply the aggregation to. If int, then the dataset with that index undergoes aggregation. If a string, then the dataset with that name undergoes aggregation. It may also be a list or tuple of mixed int and string types, with ints specifying dataset indexes and strings indicating dataset names. By default, this value is -1, indicating that the last added dataset should undergo aggregation.
new_names_or_prefix (Union[list[str], tuple[str], str]) – If a list or tuple of strings is passed, then use them as the names for the new datasets after aggregation. If a single string is passed, then use this as a prefix for the new dataset. By default “Aggregated_”.
inplace (bool) – Perform the aggregation in place, overwriting the origianl dataframes. By default False.
transformation_lookup (dict[str,Union[Callable, str]]) – Dictionary mapping data types to aggregations. When None, it is as if the dictionary: {np.dtype(“O”): lambda x: “,”.join([f”{item}” for item in set(x)])} was provided, concatenating strings together (separated by a comma) if they are different and just using one if they are the same across rows. If a type not present in the dictionary is encountered (such as int, or float) in the above example, then the default specified by transformation_lookup_default_value is returned. By default, None.
tranformation_lookup_default_value (Union[str, Callable]) – Transformation to apply if the data type is not found in the transformation_lookup_dictionary, can be a callable or string to pandas defined string to function shortcut mappings. By default “mean”.
- append(dataset: Dataset) None
Add a dataset to the Phenonaut object
- Parameters:
dataset (Dataset) – Dataset to be addded
- clone_dataset(existing_dataset: Dataset | str | int, new_dataset_name: str, overwrite_existing: bool = False) None
Clone a dataset into a new dataset
- Parameters:
existing_dataset (Union[Dataset, str, int]) – The name or index of an existing Phenonaut Dataset held in the Phenonaut object. Can also be a Phenonaut.data.Dataset object passed directly.
new_dataset_name (str) – A name for the new cloned Dataset.
overwrite_existing (bool, optional) – If a dataset by this name exists, then overwrite it, otherwise, an exception is raised, by default False.
- Raises:
ValueError – Dataset by the name given already exists and overwrite_existing was False.
ValueError – The existing_dataset argument should be a str, int or Phenonaut.data.Dataset.
- combine_datasets(dataset_ids_to_combine: list[str] | list[int] | None = None, new_name: str | None = None, features: list | None = None)
Combine multiple datasets into a single dataset
Often, large datasets are split across multiple CSV files. For example, one CSV file per screening plate. In this instance, it is prudent to combine the datasets into one.
- Parameters:
dataset_ids_to_combine (Optional[Union[list[str], list[int]]]) – List of dataset indexes, or list of names of datasets to combine. For example, after loading in 2 datasets, the list [0,1] would be given, or a list of their names resulting in a new third dataset in datasets[2]. If None, then all present datasets are used for the merge. By default, None.
new_name (new_name:Optional[str]) – Name that should be given to the newly created dataset. If None, then it is assigned as: “Combined_dataset from datasets[DS_INDEX_LIST]”, where DS_INDEX_LIST is a list of combined dataset indexes.
features (list, optional) – List of new features which should used by the newly created dataset if None, then features of combined datasets are used. By default None.
- Raises:
DataError – Error raised if the combined datasets do not have the same features.
- property data: Dataset
Return the data of the highest index in phenonaut.data.datasets
Calling phe.data is the same as calling phe.ds.data
- Returns:
Last added/highest indexed Dataset features
- Return type:
np.ndarray
- Raises:
DataError – No datasets loaded
- describe() None
- property df: DataFrame
Return the pd.DataFrame of the last added/highest indexed Dataset
Returns the internal pd.Dataframe of the Dataset contained within the Phenonaut instance’s datasets list.
- Returns:
_description_
- Return type:
pd.DataFrame
- property ds: Dataset
Return the dataset with the highest index in phenonaut.data.datasets
- Returns:
Last added/highest indexed Dataset
- Return type:
- Raises:
DataError – No datasets loaded
- filter_datasets_on_identifiers(filter_field: str, filter_data: list[str] | dict | Path, dict_path: str | None = None, additional_items: list[str] | None = None) None
- get_dataset_combinations(min_datasets: int | None = None, max_datasets: int | None = None, return_indexes: bool = False)
Get tuple of all dataset name combinations, picking 1 to n datasets
The function to return all combinations from 1 to n dataset names in combination, where n is the number of loaded datasets. This is useful in multiomics settings where we test A,B, and C alone, A&B, A&C, B&C, and finally A&B&C.
A limit on the number of datasets in a combination can be imposed using the max_datasets argument. In the example above with datasets A, B and C, passing max_datasets=2 would return the following tuple: ((A), (B), (C), (A, B), (A, C), (B, C)) leaving out the tripple length combination (A, B, C).
Similarly, the argument min_datasets can specify a lower limit on the number of dataset combinations.
Using the example with datasets A, B, and C, and setting min_datasets=2 with no limit on max_datasets on the above example would return the following tuple: ((A, B), (A, C), (B, C), (A, B, C))
If return_indexes is True, then the indexes of Datasets are returned. As directly above, datasets A, B, and C, setting min_datasets=2 with no limit on max_datasets and passing return_indexes=True would return the following tuple: ((0, 1), (0, 2), (1, 2), (0, 1, 2))
- Parameters:
min_datasets (Optional[int], optional) – Minimum number of datasets to return in a combination. If None, then it behaves as if 1 is given, by default None.
max_datasets (Optional[int], optional) – Maximum number of datasets to return in a combination. If None, then it behaves as if len(datasets) is given, by default None.
return_indexes (bool) – Return indexes of Datasets, instead of their names, by default False.
- get_dataset_index_from_name(name: str | list[str] | tuple[str]) int | list[int]
Get dataset index from name
Given the name of a dataset, return the index of it in datasets list. Accepts single string query, or a list/tuple of names to return lists of indices.
- Parameters:
name (Union[str, list[str], tuple[str]]) – If string, then this is the dataset name being searched for. Its index in the datasets list will be returned. If a list or tuple of names, then the index of each is searched and an index list returned.
- Returns:
If name argument is a string, then the dataset index is returned. If name argument is a list or tuple, then a list of indexes for each dataset name index is returned.
- Return type:
Union[int, list[int]]
- Raises:
ValueError – Error raised if no datasets were found to match a requested name.
- get_dataset_names() list[str]
Get a list of dataset names
- Returns:
List containing the names of datasets within this Phenonaut object.
- Return type:
list[str]
- get_df_features_perturbation_column(ds_index=-1, quiet: bool = False) tuple[DataFrame, list[str], str | None]
Helper function to obtain DataFrame, features and perturbation column name.
Some Phenonaut functions allow passing of a Phenonaut object, or DataSet. They then access the underlying pd.DataFrame for calculations. This helper function is present on Phenonaut objects and Dataset objects, allowing more concise code and less replication when obtaining the underlying data. If multiple Datasets are present, then the last added Dataset is used to obtain data, but this behaviour can be changed by passing the ds_index argument.
- Parameters:
ds_index (int, optional) – Index of the Dataset to be returned. By default -1, which uses the last added Dataset.
quiet (bool) – When checking if perturbation is set, check without inducing a warning if it is None.
- Returns:
Tuple containing the Dataframe, a list of features and the perturbation column name.
- Return type:
tuple[pd.DataFrame, list[str], str]
- get_hash_dictionary() dict
Returns dictionary containing SHA256 hashes
Returns a dictionary of base64 encoded UTF-8 strings representing the SHA256 hashes of datasets (along with names), combined datasets, and the Phenonaut object (including name).
- Returns:
Dictionary of base64 encoded SHA256 representing datasets and the Phenonaut object which created them.
- Return type:
dict
- groupby_datasets(by: str | List[str], ds_index=-1, remove_original=True)
Perform a groupby operation on a dataset
Akin to performing a groupby operation on a pd.DataFrame, this splits a dataset by column(s), optionally keeping or removing (by default) the original.
- Parameters:
by (Union[str, list]) – Columns in the dataset’s DataFrames which should be used for grouping
ds_index (int, optional) – Index of the Dataset to be returned. By default -1, which uses the last added Dataset.
remove_original (bool, optional) – If True, then the original split dataset is deleted after splitting
- keys() list[str]
Return a list of all dataset names
- Returns:
List of dataset names, empty list if no datasets are loaded.
- Return type:
list(str)
- classmethod load(filepath: str | Path) Phenonaut
Class method to load a compressed Phenonaut object
Loads a gzipped Python pickle containing a Phenonaut object
- Parameters:
filepath (Union[str, Path]) – Location of gzipped Phenonaut object pickle
- Returns:
Loaded Phenonaut object.
- Return type:
- Raises:
FileNotFoundError – File not found, unable to load pickled Phenonaut object.
- load_dataset(dataset_name: str, input_file_path: Path | str, metadata: dict | None = None, h5_key: str | None = None, features: list[str] | None = None)
Load a dataset from a CSV, optionally suppying metadata and a name
- Parameters:
dataset_name (str) – Name to be assigned to the dataset
input_file_path (Union[Path, str]) – CSV/TSV/H5 file location
metadata (dict, optional) – Metadata dictionary describing the CSV data format, by default None
h5_key (Optional[str]) – If input_file_path is an h5 file, then a key to access the target DataFrame must be supplied.
features (Optional[list[str]]) – Optionally supply a list of features here. If None, then the features/feature finding related keys in metadata are used. You may also explicitly supply an empty list to explicitly specify that the dataset has no features. This is not recommended.
- merge_datasets(datasets: List[Dataset] | List[int] | Literal['all'] = 'all', new_dataset_name: str | None = 'Merged Dataset', return_merged: bool = False, remove_merged: bool | None = True)
Merge datasets
After performing a groupby operation on Phenonaut.data.Dataset objects, a list of datasets may be merged into a single Dataset using this method.
- Parameters:
datasets (Union[List[Dataset], List[int], List[str]]) – Datasets which should be groupbed together. May be a list of Datasets, in which case these are merged together and inserted into the Phenonaut object, or a list of integers or dataset names which will be used to look up datasets in the current Phenonaut object. Mixing of ints and dataset string identifiers is acceptable but mixing of any identifier and Datasets is not supported. If ‘all’, then all datasets in the Phenonaut object are merged. By default ‘all’
return_merged (bool) – If True the merged dataset is returned by this function. If False, then the new merged dataset is added to the current Phenonaut object. By default False
remove_merged (bool) – If True and return_merged is False causing the new object to be merged into the Phenonaut object, then source datasets which are in the current object (addressed by index or int in the datasets list) are removed from the Phenonaut object. By default True
- new_dataset_from_query(name, query: str, query_dataset_name_or_index: int | str = -1, raise_error_on_empty: bool = True, overwrite_existing: bool = False)
Add new dataset through a pandas query of existing dataset
- Parameters:
query (str) – The pandas query used to select the new dataset
name (str) – A name for the new dataset
query_dataset_name_or_index (Union[int, str], optional) – The dataset to be queried, can be an int index, or the name of an existing dataset. List indexing can also be used, such that -1 uses the last dataset in Phenonaut.data.datasets list, by default -1.
raise_error_on_empty (bool) – Raise a ValueError is the query returns an empty dataset. By default True.
overwrite_existing (bool) – If a dataset already exists with the name given in the name argument, then this argument can be used to overwrite it, by default False.
- revert() None
Revert a Phenonaut object that was recently saved to it’s previous state
Upon calling save on a Phenonaut object, the record stores the output file location, allowing a quick was to revert changes made to a file by calling .revert(). This returns the object to it’s saved state.
- Raises:
FileNotFoundError – File not found, Phenonaut object has never been written out.
- save(output_filename: str | Path, overwrite_existing: bool = False) None
Save Phenonaut object and contained Data to a pickle
Writes a gzipped Python pickle file. If no compression, or another compressison format is required, then the user should use a custom pickle.dump and not rely on this helper function.
- Parameters:
output_filename (Union[str, Path]) – Output filename for the gzipped pickle
overwrite_existing (bool, optional) – If True and the file exists, overwrite it. By default False.
- shrink(keep_prefix: str | list[str] | None = 'Metadata_')
Reduce the size of all datasets by removing unused columns from the internal DataFrames
Often datasets contain intermediate features or unused columns which can be removed. This function removes every column from a Dataset internal DataFrame that is not the perturbation_column, or has a given prefix. By default, this prefix is “Metadata_”, however this can be removed, changed, or a new list of prefixes supplied using the keep_prefix argument.
- Parameters:
keep_prefix (str | list[str] | None, optional) – Prefix for columns which should be kept during shirinking of the dataset. This prefix applies to columns which are not features (which are kept automatically). Can be a list of prefixes, or None, by default “Metadata_”
- subtract_median_perturbation(perturbation_label: str, per_column_name: str | None = None, new_features_prefix: str = 'SMP_')
Subtract the median perturbation from all features for all datasets.
Useful for normalisation within a well/plate format. The median feature may be identified through the per_column_name variable, and perturbation label. Newly generated features may have their prefixes controled via the new_features_prefix argument.
- Parameters:
perturbation_label (str) – The perturbation label which should be used to calculate the median
per_column_name (Optional[str], optional) – The perturbation column name. This is optional and can be None, as the Dataset may already have perturbation column set. By default, None.
new_features_prefix (str) – Prefix for new features, each with the median perturbation subtracted. By default ‘SMP_’ (for subtracted median perturbation).
- class phenonaut.PlatemapQuerier(platemap_directory: str | Path, platemap_csv_files: list | str | Path | None = None, plate_name_before_underscore_in_filename=True)
Bases:
object
- get_compound_locations(cpd, plates: str | list | None = None)
- plate_to_cpd_to_well_dict = {}
- platemap_files = None
- phenonaut.dataset_intersection(datasets: list[Dataset], groupby: str | list[str], inplace=False)
Perform intersection of datasets on common column values
This is useful to match experimental data using a groupby of [‘cpd’, ‘conc’], datasets can be filtered to contain only compounds at concentrations present in all dataframes. This is useful in integration work where each dataset represents a different view/assay technology]
- Parameters:
datasets (list[phenonaut.data.Dataset]) – List of datasets to perform treatment intersection filtering
groupby (str | list[str]) – Columns present in each dataset which which rows must match across all datasets
- Returns:
List of filtered datasets if inplace is False, else datasets are altered in place and None is returned
- Return type:
list[phenonaut.data.Dataset] | None
- Raises:
ValueError – Error if groupby fields are not found in all Dataset DataFrame columns
RuntimeError – Temporary column exists in dataframe. This error should not ever occur
- phenonaut.load(input_file: Path | str, shrink=False, shrink_keep_prefix: str | list[str] | None = 'Metadata_') Phenonaut
Convenience function allowing phenonaut.load
Allows calling of phenonaut.load() rather than phenonaut.Phenonaut.load()
- Parameters:
input_file (Path | str) – Pickle file path of the Phenonaut object which is to be loaded
shrink (bool) – If True then datasets are shrunk after loading to remove unused columns, by default False
shrink_keep_prefix (str | list[str] | None, optional) – Prefix for columns which should be kept during shirinking of the dataset. This prefix applies to columns which are not features (which are kept automatically). Can be a list of prefixes, or None, by default “Metadata_”
- Returns:
Loaded Phenonaut object
- Return type:
- phenonaut.match_perturbation_columns(*args) None
Make all dataset perturbation_columns match that of the first supplied dataset
Two or more datasets supplied as args to this function will have the second dataset (and any subsequent dataset) perturbation_column set to that of the first - also renaming underlying DataFrame columns as appropriate. If Phenonaut objects are given, then everything matches the first Dataset of the first given argument.
- Raises:
ValueError – 2 or more datasets must be supplied