phenonaut package

Subpackages

Submodules

phenonaut.errors module

exception phenonaut.errors.NotEnoughRowsError

Bases: Exception

phenonaut.phenonaut module

class phenonaut.phenonaut.Phenonaut(dataset: Dataset | list[Dataset] | PackagedDataset | Bunch | DataFrame | Path | str | None = None, name: str = 'Phenonaut object', kind: str | None = None, packaged_dataset_name_filter: str | list[str] | None = None, metadata: dict | list[dict] | None = {}, features: list[str] | None = None, dataframe_name: str | list[str] | None = None, init_hash: str | bytes | None = None)

Bases: object

Phenonaut object constructor

Holds multiple datasets of different type, applys transforms, load and tracking operations.

May be initialised with:

  • Phenonaut Datasets

  • Phenonaut PackageDataset

  • Scikit Bunch

  • pd.DataFrame

by passing the object as an optional dataset argument.

Parameters:
  • dataset (Optional[Union[Dataset, list[Dataset], PackagedDataset, Bunch,) – pd.DataFrame, Path, str]], optional Initialise Phenonaut object with a Dataset, list of datasets, or PackagedDataset, by default None.

  • name (str) – A name may be given to the phenonaut object. This is useful in naming collections of datasets. For example, The Cancer Genome Atlas contains 4 different views on tumors - mRNA, miRNA, methylation and RPPA, collectively, these 4 datasets loaded into a phenonaut object may be named ‘TCGA’ - or ‘The Cancer Genome Atlas dataset’. If set to None, then the phenonaut object takes the name “Phenonaut data”, however, not in the case where construction of the object occurs with a phenonaut packaged dataset or already named phenonaut object, where it takes the name of the passed object/dataset.

  • kind (Optional[str]) – Instead of providing metadata, some presets are available, which make reading in things like DRUG-Seq easier. This argument only has an effect when reading in a raw data file, like CSV or H5 and directs Phenonaut to use a predefind set of parameters/transforms. If used as well as metadata, then the preset metadata dictionary from the kind argument is first loaded, then updated with anything in the metadata dictionary, this therefore allows overriding specific presets present in kind dictionaries. Available ‘kind’ dictionaries may be listed by examining: phenonaut.data.recipes.recipes.keys()

  • packaged_dataset_name_filter (Optional[Union[list[str], str]], optional) – If a PackagedDataset is supplied for the data argument, then import only datasets from it named in the name_filter argument. If None, then all PackagedDataset datasets are imported. Can be a single string or list of strings. If None, and PackagedDataset is supplied, then all Datasets are loaded. Has no effect if data is not a PackagedDataset, by default None.

  • metadata (Optional[Union[dict, list[dict]]]) – Used when a pandas DataFrame is passed to the constructor of the phenonaut object. Metadata typically contains features or feature_prefix keys telling Phenonaut which columns should be treated as Dataset features. Can also be a list of metadata dicitonaries if a list of pandas DataFrames are supplied to the constructor. Has no effect if the type of dataset passed is not a pandas DataFrame or list of pandas DataFrames. If a list of pandas DataFrames is passed to data but only one metadata dictionary is given, then this dictionary is applied to all DataFrames. By default {}.

  • features (Optional[list[str]] = None) – May be used as a shortcut to including features in the metadata dictionary. Only used if the metadata is a dict and does not contain a features key.

  • dataframe_name (Optional[Union[dict, list[dict]]]) – Used when a pandas DataFrame, or str, or Path to a CSV file is passed to the constructor of the phenonaut object. Optional name to give to the dataset object constructed from the pandas DataFrame. If multiple DataFrames are given in a list, then this dataframe_name argument can be a list of strings as names to assign to the new Dataset objects.

  • init_hash (Optional[Union[str, bytes]]) – Cryptographic hashing within Phenonaut can be initialised with a starting/seed hash. This is useful in the creation of blockchain-like chains of hashes. In environments where timestamping is unavailable, hashes may be published and then used as input to subsequent experiments. Building up a provable chain along the way. By default None, implying an empty bytes array.

add_well_id(numerical_column_name: str = 'COLUMN', numerical_row_name: str = 'ROW', plate_type: int = 384, new_well_column_name: str = 'Well', add_empty_wells: bool = False, plate_barcode_column: str | None = None, no_sort: bool = False)

Add standard well IDs - such as A1, A2, etc to ALL loaded Datasets.

If a dataset contains numerical row and column names, then they may be translated into standard letter-number well IDs. This is applied to all loaded Datasets. If you wish only one to be annotated, then call add_well_id on that individual dataset.

Parameters:
  • numerical_column_name (str, optional) – Name of column containing numeric column number, by default “COLUMN”.

  • numerical_row_name (str, optional) – Name of column containing numeric column number, by default “ROW”.

  • plate_type (int, optional) – Plate type - note, at present, only 384 well plate format is supported, by default 384.

  • new_well_column_name (str, optional) – Name of new column containing letter-number well ID, by default “Well”.

  • add_empty_wells (bool, optional) – Should all wells from a plate be inserted, even when missing from the data, by default False.

  • plate_barcode_column (str, optional) – Multiple plates may be in a dataset, this column contains their unique ID, by default None.

  • no_sort (bool, optional) – Do not resort the dataset by well ID, by default False

aggregate_dataset(composite_identifier_columns: list[str], datasets: Iterable[int] | Iterable[str] | int | str = -1, new_names_or_prefix: list[str] | tuple[str] | str = 'Aggregated_', inplace: bool = False, transformation_lookup: dict[str, Callable | str] | None = None, tranformation_lookup_default_value: str | Callable = 'mean')

Aggregate multiple or single phenonaut dataset rows

If we have a Phenonaut object containing data derived from 2 fields of view from a microscopy image, a sensible approach is averaging features. If we have the DataFrame below, we may merge FOV 1 and FOV 2, taking the mean of all features. As strings such as filenames should be kept, they are concatenated together, separated by a comma, unless the strings are the same, in which case just one is used.

Here we test a df as follows:

ROW

COLUMN

BARCODE

feat_1

feat_2

feat_3

filename

FOV

1

1

Plate1

1.2

1.2

1.3

FileA.png

1

1

1

Plate1

1.3

1.4

1.5

FileB.png

2

1

1

Plate2

5.2

5.1

5

FileC.png

1

1

1

Plate2

6.2

6.1

6.8

FileD.png

2

1

2

Plate1

0.1

0.2

0.3

FileE.png

1

1

2

Plate1

0.2

0.2

0.38

FileF.png

2

With just this loaded into a phenonaut object, we can call:

phe.aggregate_dataset([‘ROW’,’COLUMN’,’BARCODE’])

Will merge and proeduce another, secondary dataset in the phe object containing:

ROW

COLUMN

BARCODE

feat_1

feat_2

feat_3

filename

FOV

1

1

Plate1

1.25

1.3

1.40

fileA.png,FileB.png

1.5

1

1

Plate2

5.70

5.6

5.90

FileC.png,FileD.png

1.5

1

2

Plate1

0.15

0.2

0.34

FileF.png,fileE.png

1.5

if inplace=True is passed in the call to aggregate_dataset, then the phenonaut object will contain just one dataset, the new aggregated dataset.

Parameters:
  • composite_identifier_columns (list[str]) – If a biochemical assay evaluated through imaging is identified by a row, column, and barcode (for the plate) but multiple images taken from a well, then these multiple fields of view can be merged, creating averaged features using row, column and barcode as the composite identifier on which to merge fields of view.

  • datasets (Union[list[int], list[str], int, str]) – Which datasets to apply the aggregation to. If int, then the dataset with that index undergoes aggregation. If a string, then the dataset with that name undergoes aggregation. It may also be a list or tuple of mixed int and string types, with ints specifying dataset indexes and strings indicating dataset names. By default, this value is -1, indicating that the last added dataset should undergo aggregation.

  • new_names_or_prefix (Union[list[str], tuple[str], str]) – If a list or tuple of strings is passed, then use them as the names for the new datasets after aggregation. If a single string is passed, then use this as a prefix for the new dataset. By default “Aggregated_”.

  • inplace (bool) – Perform the aggregation in place, overwriting the origianl dataframes. By default False.

  • transformation_lookup (dict[str,Union[Callable, str]]) – Dictionary mapping data types to aggregations. When None, it is as if the dictionary: {np.dtype(“O”): lambda x: “,”.join([f”{item}” for item in set(x)])} was provided, concatenating strings together (separated by a comma) if they are different and just using one if they are the same across rows. If a type not present in the dictionary is encountered (such as int, or float) in the above example, then the default specified by transformation_lookup_default_value is returned. By default, None.

  • tranformation_lookup_default_value (Union[str, Callable]) – Transformation to apply if the data type is not found in the transformation_lookup_dictionary, can be a callable or string to pandas defined string to function shortcut mappings. By default “mean”.

clone_dataset(existing_dataset: Dataset | str | int, new_dataset_name: str, overwrite_existing: bool = False) None

Clone a dataset into a new dataset

Parameters:
  • existing_dataset (Union[Dataset, str, int]) – The name or index of an existing Phenonaut Dataset held in the Phenonaut object. Can also be a Phenonaut.Dataset object passed directly.

  • new_dataset_name (str) – A name for the new cloned Dataset.

  • overwrite_existing (bool, optional) – If a dataset by this name exists, then overwrite it, otherwise, an exception is raised, by default False.

Raises:
  • ValueError – Dataset by the name given already exists and overwrite_existing was False.

  • ValueError – The existing_dataset argument should be a str, int or Phenonaut.Dataset.

combine_datasets(dataset_ids_to_combine: list[str] | list[int] | None = None, new_name: str | None = None, features: list | None = None)

Combine multiple datasets into a single dataset

Often, large datasets are split across multiple CSV files. For example, one CSV file per screening plate. In this instance, it is prudent to combine the datasets into one.

Parameters:
  • dataset_ids_to_combine (Optional[Union[list[str], list[int]]]) – List of dataset indexes, or list of names of datasets to combine. For example, after loading in 2 datasets, the list [0,1] would be given, or a list of their names resulting in a new third dataset in datasets[2]. If None, then all present datasets are used for the merge. By default, None.

  • new_name (new_name:Optional[str]) – Name that should be given to the newly created dataset. If None, then it is assigned as: “Combined_dataset from datasets[DS_INDEX_LIST]”, where DS_INDEX_LIST is a list of combined dataset indexes.

  • features (list, optional) – List of new features which should used by the newly created dataset if None, then features of combined datasets are used. By default None.

Raises:

DataError – Error raised if the combined datasets do not have the same features.

property data: Dataset

Return the data of the highest index in phenonaut.datasets

Calling phe.data is the same as calling phe.ds.data

Returns:

Last added/highest indexed Dataset features

Return type:

np.ndarray

Raises:

DataError – No datasets loaded

property df: DataFrame

Return the pd.DataFrame of the last added/highest indexed Dataset

Returns the internal pd.Dataframe of the Dataset contained within the Phenonaut instance’s datasets list.

Returns:

_description_

Return type:

pd.DataFrame

property ds: Dataset

Return the dataset with the highest index in phenonaut.datasets

Returns:

Last added/highest indexed Dataset

Return type:

Dataset

Raises:

DataError – No datasets loaded

get_dataset_combinations(min_datasets: int | None = None, max_datasets: int | None = None, return_indexes: bool = False)

Get tuple of all dataset name combinations, picking 1 to n datasets

The function to return all combinations from 1 to n dataset names in combination, where n is the number of loaded datasets. This is useful in multiomics settings where we test A,B, and C alone, A&B, A&C, B&C, and finally A&B&C.

A limit on the number of datasets in a combination can be imposed using the max_datasets argument. In the example above with datasets A, B and C, passing max_datasets=2 would return the following tuple: ((A), (B), (C), (A, B), (A, C), (B, C)) leaving out the tripple length combination (A, B, C).

Similarly, the argument min_datasets can specify a lower limit on the number of dataset combinations.

Using the example with datasets A, B, and C, and setting min_datasets=2 with no limit on max_datasets on the above example would return the following tuple: ((A, B), (A, C), (B, C), (A, B, C))

If return_indexes is True, then the indexes of Datasets are returned. As directly above, datasets A, B, and C, setting min_datasets=2 with no limit on max_datasets and passing return_indexes=True would return the following tuple: ((0, 1), (0, 2), (1, 2), (0, 1, 2))

Parameters:
  • min_datasets (Optional[int], optional) – Minimum number of datasets to return in a combination. If None, then it behaves as if 1 is given, by default None.

  • max_datasets (Optional[int], optional) – Maximum number of datasets to return in a combination. If None, then it behaves as if len(datasets) is given, by default None.

  • return_indexes (bool) – Return indexes of Datasets, instead of their names, by default False.

get_dataset_index_from_name(name: str | list[str] | tuple[str]) int | list[int]

Get dataset index from name

Given the name of a dataset, return the index of it in datasets list. Accepts single string query, or a list/tuple of names to return lists of indices.

Parameters:

name (Union[str, list[str], tuple[str]]) – If string, then this is the dataset name being searched for. Its index in the datasets list will be returned. If a list or tuple of names, then the index of each is searched and an index list returned.

Returns:

If name argument is a string, then the dataset index is returned. If name argument is a list or tuple, then a list of indexes for each dataset name index is returned.

Return type:

Union[int, list[int]]

Raises:

ValueError – Error raised if no datasets were found to match a requested name.

get_dataset_names() list[str]

Get a list of dataset names

Returns:

List containing the names of datasets within this Phenonaut object.

Return type:

list[str]

get_df_features_perturbation_column(ds_index=-1, quiet: bool = False) tuple[DataFrame, list[str], str | None]

Helper function to obtain DataFrame, features and perturbation column name.

Some Phenonaut functions allow passing of a Phenonaut object, or DataSet. They then access the underlying pd.DataFrame for calculations. This helper function is present on Phenonaut objects and Dataset objects, allowing more concise code and less replication when obtaining the underlying data. If multiple Datasets are present, then the last added Dataset is used to obtain data, but this behaviour can be changed by passing the ds_index argument.

Parameters:
  • ds_index (int, optional) – Index of the Dataset to be returned. By default -1, which uses the last added Dataset.

  • quiet (bool) – When checking if perturbation is set, check without inducing a warning if it is None.

Returns:

Tuple containing the Dataframe, a list of features and the perturbation column name.

Return type:

tuple[pd.DataFrame, list[str], str]

get_hash_dictionary() dict

Returns dictionary containing SHA256 hashes

Returns a dictionary of base64 encoded UTF-8 strings representing the SHA256 hashes of datasets (along with names), combined datasets, and the Phenonaut object (including name).

Returns:

Dictionary of base64 encoded SHA256 representing datasets and the Phenonaut object which created them.

Return type:

dict

groupby_datasets(by: str | List[str], ds_index=-1, remove_original=True)

Perform a groupby operation on a dataset

Akin to performing a groupby operation on a pd.DataFrame, this splits a dataset by column(s), optionally keeping or removing (by default) the original.

Parameters:
  • by (Union[str, list]) – Columns in the dataset’s DataFrames which should be used for grouping

  • ds_index (int, optional) – Index of the Dataset to be returned. By default -1, which uses the last added Dataset.

  • remove_original (bool, optional) – If True, then the original split dataset is deleted after splitting

keys() list[str]

Return a list of all dataset names

Returns:

List of dataset names, empty list if no datasets are loaded.

Return type:

list(str)

classmethod load(filepath: str | Path) Phenonaut

Class method to load a compressed Phenonaut object

Loads a gzipped Python pickle containing a Phenonaut object

Parameters:

filepath (Union[str, Path]) – Location of gzipped Phenonaut object pickle

Returns:

Loaded Phenonaut object.

Return type:

phenonaut.Phenonaut

Raises:

FileNotFoundError – File not found, unable to load pickled Phenonaut object.

load_dataset(dataset_name: str, input_file_path: Path | str, metadata: dict | None = None, h5_key: str | None = None, features: list[str] | None = None)

Load a dataset from a CSV, optionally suppying metadata and a name

Parameters:
  • dataset_name (str) – Name to be assigned to the dataset

  • input_file_path (Union[Path, str]) – CSV/TSV/H5 file location

  • metadata (dict, optional) – Metadata dictionary describing the CSV data format, by default None

  • h5_key (Optional[str]) – If input_file_path is an h5 file, then a key to access the target DataFrame must be supplied.

  • features (Optional[list[str]]) – Optionally supply a list of features here. If None, then the features/feature finding related keys in metadata are used. You may also explicitly supply an empty list to explicitly specify that the dataset has no features. This is not recommended.

merge_datasets(datasets: List[Dataset] | List[int] | Literal['all'] = 'all', new_dataset_name: str | None = 'Merged Dataset', return_merged: bool = False, remove_merged: bool | None = True)

Merge datasets

After performing a groupby operation on Phenonaut.Dataset objects, a list of datasets may be merged into a single Dataset using this method.

Parameters:
  • datasets (Union[List[Dataset], List[int], List[str]]) – Datasets which should be groupbed together. May be a list of Datasets, in which case these are merged together and inserted into the Phenonaut object, or a list of integers or dataset names which will be used to look up datasets in the current Phenonaut object. Mixing of ints and dataset string identifiers is acceptable but mixing of any identifier and Datasets is not supported. If ‘all’, then all datasets in the Phenonaut object are merged. By default ‘all’

  • return_merged (bool) – If True the merged dataset is returned by this function. If False, then the new merged dataset is added to the current Phenonaut object. By default False

  • remove_merged (bool) – If True and return_merged is False causing the new object to be merged into the Phenonaut object, then source datasets which are in the current object (addressed by index or int in the datasets list) are removed from the Phenonaut object. By default True

new_dataset_from_query(name, query: str, query_dataset_name_or_index: int | str = -1, raise_error_on_empty: bool = True, overwrite_existing: bool = False)

Add new dataset through a pandas query of existing dataset

Parameters:
  • query (str) – The pandas query used to select the new dataset

  • name (str) – A name for the new dataset

  • query_dataset_name_or_index (Union[int, str], optional) – The dataset to be queried, can be an int index, or the name of an existing dataset. List indexing can also be used, such that -1 uses the last dataset in Phenonaut.datasets list, by default -1.

  • raise_error_on_empty (bool) – Raise a ValueError is the query returns an empty dataset. By default True.

  • overwrite_existing (bool) – If a dataset already exists with the name given in the name argument, then this argument can be used to overwrite it, by default False.

revert() None

Revert a Phenonaut object that was recently saved to it’s previous state

Upon calling save on a Phenonaut object, the record stores the output file location, allowing a quick was to revert changes made to a file by calling .revert(). This returns the object to it’s saved state.

Raises:

FileNotFoundError – File not found, Phenonaut object has never been written out.

save(output_filename: str | Path, overwrite_existing: bool = False) None

Save Phenonaut object and contained Data to a pickle

Writes a gzipped Python pickle file. If no compression, or another compressison format is required, then the user should use a custom pickle.dump and not rely on this helper function.

Parameters:
  • output_filename (Union[str, Path]) – Output filename for the gzipped pickle

  • overwrite_existing (bool, optional) – If True and the file exists, overwrite it. By default False.

subtract_median_perturbation(perturbation_label: str, per_column_name: str | None = None, new_features_prefix: str = 'SMP_')

Subtract the median perturbation from all features for all datasets.

Useful for normalisation within a well/plate format. The median feature may be identified through the per_column_name variable, and perturbation label. Newly generated features may have their prefixes controled via the new_features_prefix argument.

Parameters:
  • perturbation_label (str) – The perturbation label which should be used to calculate the median

  • per_column_name (Optional[str], optional) – The perturbation column name. This is optional and can be None, as the Dataset may already have perturbation column set. By default, None.

  • new_features_prefix (str) – Prefix for new features, each with the median perturbation subtracted. By default ‘SMP_’ (for subtracted median perturbation).

phenonaut.phenonaut.random() x in the interval [0, 1).

phenonaut.utils module

phenonaut.utils.check_path(p: Path | str, is_dir: bool = False, make_parents: bool = True, make_dir_if_dir: bool = True) Path

Check a user supplied path (str or Path), ensuring parents exist etc

Parameters:
  • p (Union[Path, str]) – File or directory path supplied by user

  • is_dir (bool, optional) – If the path supplied by the user should be a directory, then set it as such by assigning is_dir to true, by default False

  • make_parents (bool, optional) – If the parent directories of the supplied path do not exist, the make them, by default True

  • make_dir_if_dir (bool, optional) – If the supplied path is a directory, but it does not exist, the make it, by default True

Returns:

Path object pointing to the user supplied path, with parents made (if requested), and the directory itself made (if a directory and requested)

Return type:

Path

Raises:
  • ValueError – Parent does not exist, and make_parents was False

  • ValueError – Passed path was not a string or pathlib.Path

phenonaut.utils.load_dict(file_path: str | Path | None, cast_none_to_dict=False)

phenonaut.workflow module

class phenonaut.workflow.Workflow(workflow_path: Path | str | dict)

Bases: object

Phenonaut Workflows allow operation through simple YAML workflows.

Workflows may be defined in Phenonaut, and the module executed directly, rather than imported and used by a Python program. Workflows are defined using the simple YAML file format. However, due to the way in which they are read in, JSON files may also be used. As YAML files can contain multiple YAML entries, we build on this concept, allowing multiple workflows to be defined in a single YAML (or JSON) file. Once read in, workflows are dictionaries. From Python 3.6 onwards, dictionaries are ordered. We can therefore define our workflows in order and guarantee that they will be executed in the defined order. A dictionary defining workflows have the following structure: {job_name: task_list}, where job_name is a string, and task list is a list defining callable functions, or tasks required to complete the job.

The job list takes the form of a list of dictionaries, each containning only one key which is the name of the task to be peformed. The value indexed by this key is a dictionary of argument:value pairs to be passed to the function responsible for performing the task. The structure is best understood with an example. Here, we see a simple workflow contained within a YAML file for calculation of the scalar projection phenotypic metric. YAML files start with 3 dashes.

---
scalar_projection_example:
- load:
    file: screening_data.csv
    metadata:
        features_prefix:
            - feat_
- scalar_projection:
    target_treatment_column_name: control
    target_treatment_column_value: pos
    output_column_label: target_phenotype
- write_multiple_csvs:
    split_by_column: PlateID
    output_dir: scalar_projection_output

The equivalent JSON with clearer (for Python programmers) formatting for the above is:

{
    "scalar_projection_example": [
        {
            "load": {
            "file": "screening_data.csv",
            "metadata": {
            "features_prefix": ["feat_"]}
            }
        },
        {
            "scalar_projection": {
                "target_treatment_column_name": "control",
                "target_treatment_column_value": "pos",
                "output_column_label": "target_phenotype",
            }
        },
        {"write_multiple_csvs":{
            "split_by_column": "PlateID",
            "output_dir": "scalar_projection_output/"
            }
        },
    ]
}

The workflow define above in the example YAML and JSON formats, has the name “scalar_projection_example”, and consists of 3 commands.

  1. load

  2. scalar_projection

  3. write_multiple_csvs

See the user guide for a full listing of commands.

Parameters:

workflow_path (Union[Path, str, dict]) – Workflows can be defined in YML or JSON files with their locations supplied or jobs passed as dictionaries. Dictionary keys denote the job names. Values under these keys should be lists of dictionaries. Each dictionary should have one key, denoting the name of the task and values under this key contain options for the called functions/ tasks.

Raises:

TypeError – Supplied Path or str to file location does not appear to be a YAML or JSON file.

VIF_filter_features(arguments: dict)

Workflow function: Perform VIF feature filter

Designed to be called from a workflow, performs variance inflation factor (VIF) filtering on a dataset, removing features which are not detrimental to capturing variance. More information available: https://en.wikipedia.org/wiki/Variance_inflation_factor

This can be a computationally expensive process as the number of linear regressions required to be run is almost N^2 with features.

Parameters:

arguments (dict) –

  • target_dataset:

    Index or name of dataset which should have variance inflation filter applied. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.

  • vif_cutoff:

    float or int indicating the VIF cutoff to apply. A good balance and value often used is 5.0. If this key:value pair is absent, then behaviour is as if 5.0 was supplied.

  • min_features:

    removal of too many features can be detrimental. Setting this value sets a lower limit on the number of features which must remain. If absent, then behaviour is as if a value of 2 was given.

  • drop_columns:

    value is a boolean, denoting if columns should be dropped from the data table, as well as being removed from features. If not supplied, then the behaviour is as if False was supplied.

add_well_id(arguments: dict)

Workflow function: Add well IDs

Designed to be called from a workflow. Often, we would like to use well and column numbers to resolve a more traditional alpha-numeric WellID notation, such as A1, A2, etc. This can be achieved through calling this workflow function.

If a dataset contains numerical row and column names, then they may be translated into standard letter-number well IDs. The arguments dictionary may contain the following keys, with their values denoted as bellow:

numerical_column_namestr, optional

Name of column containing numeric column number, if not supplied, then behaves as if “COLUMN”.

numerical_row_namestr, optional

Name of column containing numeric column number, if not supplied, then behaves as if “ROW”.

plate_typeint, optional

Plate type - note, at present, only 384 well plate format is supported, if not supplied, then behaves as if 384.

new_well_column_namestr, optional

Name of new column containing letter-number well ID, if not supplied, then behaves as if default “Well”.

add_empty_wellsbool, optional

Should all wells from a plate be inserted, even when missing from the data, if not supplied, then behaves as if False.

plate_barcode_columnstr, optional

Multiple plates may be in a dataset, this column contains their unique ID, if not supplied, then bahaves as if None.

no_sortbool, optional

Do not resort the dataset by well ID, if not supplied, then behaves as if False

Parameters:

arguments (dict) – Dictionary containing arguments to the Dataset.add_well_id function, see API documentation for further details, or function help.

cityblock_distance(arguments: dict)

Workflow function: Add a column for the cityblock distance to a target perturbation.

Designed to be called from a workflow, calculates the cityblock distance in feature space. Also known as the Manhattan distance.

Parameters:

arguments (dict, should contain:) –

target_dataset

Index or name of dataset which should be used in the measurment. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.

target_perturbation_column_name

normally a ‘control’ column

target_perturbation_column_value:

value to be found in the column defined previously.

output_column_label:

Output column for the measurement. If this is missing, then it is set to target_perturbation_column_value.

copy_column(arguments: dict)

Workflow function: copy a dataset column

Designed to be called from a workflow, copys the values of one column within a dataset to another. The arguments dictionary can contain ‘to’ and ‘from’ keys with values for column names, or alternatively, simply from:to key-value pairs denoting how to perform the copy operation.

Parameters:

arguments (dict) –

Options for the command. Should include either:

1. dictionary with keys “to” and “from”, with item names related to the columns that should be used.

2.dictionary with the form {from_column:to_column}, which will copy the column with title from_column to to_column

Note, if any dictionary items (to) are lists, then multiple copies will be made.

Raises:

KeyError – Column was not found in the Pandas DataFrame.

euclidean_distance(arguments: dict)

Workflow function: Add a column for the euclidean distance to a target perturbation.

Designed to be called from a workflow, calculates the euclidean distance in feature space.

Parameters:

arguments (dict, should contain:) –

target_dataset

Index or name of dataset which should be used in the measurment. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.

target_perturbation_column_name

normally a ‘control’ column

target_perturbation_column_value:

value to be found in the column defined previously.

output_column_label:

Output column for the measurement. If this is missing, then it is set to target_perturbation_column_value.

filter_columns(arguments: dict)

Workflow function: filter columns

Designed to be called from a workflow, Datasets may have columns defined for keeping or removal. This function also provides a convenient way to reorder dataframe columns.

Parameters:

arguments (dict) –

Dictionary of options, can include the following keys:
keep: bool, optional, by default True

Only matching columns are kept if true. If false, they are removed.

column_names: [list, str]

List of columns to keep (or regular expressions to match)

column_name: str

Singular column to keep (or regular expressions to match)

regex: bool, optional, by default False.

perform regular expression matching

filter_correlated_and_VIF_features(arguments: dict)

Workflow function: Filter features by highly correlated then VIF.

Designed to be called from a workflow, Ideally, VIF would be applied to very large datasets. Due to the almost n^2 number of linear regression required as features increase, this is not possible on datasets with a large number of features - such as methylation datasets. We therefore must use other methods to reduce the features to a comfortable level allowing VIF to be performed. This class calculates correlations between all features and iteratively (Pearson correlation coefficient), removing features with the highest R^2 against another feature. Once the number of featurs is reduced to a level suitable for VIF, VIF is performed.

More information available: https://en.wikipedia.org/wiki/Variance_inflation_factor

Parameters:

arguments (dict) –

  • target_dataset:

    Index or name of dataset which should have features filtered. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.

  • n_before_vif :

    Number of features to remove before applying VIF. This is required when dealing with large datasets which would be too time consuming to process entirely with VIF. Features are removed iteratively, selecting the most correlated features and removing them. If this key:value pair is absent, then it is as if the value of 1000 has been supplied.

  • vif_cutoff :

    The VIF cutoff value, above which features are removed. Features with VIF scores above 5.0 are considered highly correlated. If not supplied, then behaviour is as if a value of 5.0 was supplied.

  • drop_columns :

    If drop columns is True, then not only will features be removed from the dataset features list, but the columns for these features will be removed from the dataframe. If absent, then behaviour is as if False was supplied.

filter_correlated_features(arguments: dict)

Workflow function: Perform filter of highly correlated features

Designed to be called from a workflow, performs filtering of highly correlated features (as calculated by Pearson correlation coefficient) by either by removal of features correlated above a given theshold, or uses the iterative removal of features with the highest R^2 against another feature. The arguments dictionary should contain a threshold or n key:value pair, not both. A key of threshold and float value defines the correlation above which, features should be removed. If the n key is present, then features are iteratively removeduntil n features remain.

Parameters:

arguments (dict) –

target_dataset

Index or name of dataset which should have features filtered. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.

threshold

If this key is present, then it activates threshold mode, where by calculated correlations, above which should be removed. A good value for this threshold is 0.9.

n

If this key is present, then the number of features to keep is defined this way. The process works through iteratively removing features ordered by the most correlated until the number of features is equal to n. If threshold is also present, then n acts as a minimum number of features and feature removal will stop, no matter the correlations present in the dataset.

drop_columnsbool, optional

If drop columns is True, then not only will features be removed from the dataset features list, but the columns for these features will be removed from the dataframe. If absent, then the behaviour is as if False was supplied as a value to this key:value pair.

filter_rows(arguments: dict)

Workflow function: Filter rows

Designed to be called from a workflow, filter_rows alows keeping of only rows with a certain value in a certain column. Takes as arguments a dictionary containing query_column key:value pair and one of query_value, query_values or values key:value pairs:

query_column

name of the column that should match the value below

query_value

value to match

query_values

values to match (as a list)

values

values to match (as a list)

Additionally, a key “keep” with a boolean value may be included. If True then rows matching the query are kept, if False, then rows matching are discarded, and non-matching rows kept.

Parameters:

arguments (dict) – Dictionary containing query_column key and value defining the column name, and one of the following keys: query_value, query_values, values. If plural, then values under the key should be a list containing values to perform matching on, otherwise, singular value.

Raises:

DataError – [description]

if_blank_also_blank(arguments: dict)

Workflow function: if column is empty, also blank

Designed to be called from a workflow, often it is required to clean or remove rows not needed for inclusion into further established pipelines/ workflows. This workflow function allows the ability to remove values from a column on the condition that onther column is empty.

Parameters:

arguments (dict) –

Dictionary containing the following key:value pairs:

query_column

value is the name of the column to perform the query on.

regex_query

value is a boolean denoting if the query column value should be matched using a regular expression. If omitted, then behaves as if present and False.

target_column

value is a string, denoting the name of the column which should be blanked.

target_columns

value is a list of strings, denoting the names of columns which should be blanked.

regex_targets

value is a boolean denoting if the target column or multiple target columns defined in target_columns should be matched using a regular expression. If absent, then behaves as if False was supplied.

Raises:
  • KeyError – ‘query_column’ not found in arguments dictionary

  • IndexError – Multiple columns matched query_column using the regex

  • KeyError – No target columns found for if_blank_also_blank, use target_column keys

load(arguments: dict)

Workflow function: load a dataset (CSV or PakcagedDataset)

Workflow runnable function allowing loading of dataset from CSV or a PackagedDataset. As with all workflow runnable functions, this is designed to be called from a worflow.

There are 2 possible options for loading in a dataset.

Firstly, loading a user supplied CSV file. This option is initiated through inclusion of the ‘file’ key within arguments. The value under the ‘file’ key should be a string or Path to the CSV file. In addition, a ‘metadata’ key is also required to be present in arguments, with a dictionary as the value. Within this dictionary under ‘metadata’, special keywords allow the reading of data in different formats. Special keys for the metadata dictionary are listed below (See Pandas documentation for read_csv for more in-depth information):

sep

define separator for fields within the file (by default ‘,’)

skiprows

define a number of rows to skip at the beginning of a file

header_row_number

may be a single number, or list of numbers denoting header rows.

transpose

In the case of some transcriptomics raw data, transposition is required to have samples row-wise. Therefore the table must be transposed. Set to True to transpose.

index_col

Column to use as the row labels of the CSV given as either the column names or as numerical indexes. Note: index_col=False can be used to force pandas to not use the first column as the index, e.g. when you have a malformed file with delimiters at the end of each line.

key

If an h5 file is supplied, then this is the key to the underlying pandas dataframe.

Secondly, we may load a packaged dataset, by including the ‘dataset’ key within the arguments dictionary. The value under this key should be one of the current packaged datasets supported by workflow mode - currently TCGA, CMAP, Iris, Iris_2_views, BreastCancer. An additional key in the dictionary can be ‘working_dir’ with the value being a string denoting the location of the file data on a local filesystem, or the location that it should be downloaded to and stored.

Parameters:

arguments (dict) – Dictionary containing file and metadata keys, see function description/help/docstring.

mahalanobis_distance(arguments: dict)

Workflow function: Add a column for the Mahalanobis distance to target perturbations.

Designed to be called from a workflow, calculates the Mahalanobis distance in feature space.

Parameters:

arguments (dict, should contain:) –

target_dataset

Index or name of dataset which should be used in the measurment. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.

target_perturbation_column_name

normally a ‘control’ column

target_perturbation_column_value:

value to be found in the column defined previously.

output_column_label:

Output column for the measurement. If this is missing, then it is set to target_perturbation_column_value.

manhattan_distance(arguments: dict)

Workflow function: Add a column for the Manhattan distance to target perturbation.

Designed to be called from a workflow, calculates the Manhattan distance in feature space. Also known as the cityblock distance.

Parameters:

arguments (dict, should contain:) –

target_dataset

Index or name of dataset which should be used in the measurment. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.

target_perturbation_column_name

normally a ‘control’ column

target_perturbation_column_value:

value to be found in the column defined previously.

output_column_label:

Output column for the measurement. If this is missing, then it is set to target_perturbation_column_value.

pca(arguments: dict)

Workflow function: Perform PCA dimensionality reduction technique.

Designed to be called from a workflow, performs the principal component dimensionality reduction technique. If no arguments are given or the arguments dictionary is empty, then 2D PCA is applied to the dataset with the highest index, equivalent of phe[-1], which is usually the last inserted dataset.

Parameters:

arguments (dict) –

Dictionary of arguments to used to direct the PCA process, can contain the following keys and values

target_dataset

Index or name of dataset which should have the dimensionality reduction applied. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.

ndims

Number of dimensions to which the PCA should reduce the features. If absent, then defaults to 2.

center_on_perturbation_id

PCA should be recentered on the perturbation with ID. If absent, then defaults to None, and no centering is performed.

center_by_median

If true, then median of center_on_perturbation is used, if False, then the mean is used.

fit_perturbation_ids

PCA may be fit to only the included IDs, before the transform is applied to the whole dataset.

rename_column(arguments: dict)

Workflow function: Rename column

Designed to be called from a workflow, ranames a single, or multiple columns. The arguments dictionary should contain key:value pairs, where the key is the old column name and the value is the new column name.

Parameters:

arguments (dict) – Dictionary containing name_from:name_to key value pairs, which will cause the column named ‘name_from’ to be renamed ‘name_to’. Multiple columns can be renamed in a single call, using multiple dictionary entries.

Raises:

ValueError – ‘arguments’ was not a dictionary of type: str:str

rename_columns(arguments: dict)

Workflow function: Rename columns

Designed to be called from a workflow, ranames a single, or multiple columns. The arguments dictionary should contain key:value pairs, where the key is the old column name and the value is the new column name.

Parameters:

arguments (dict) – Dictionary containing name_from:name_to key value pairs, which will cause the column named ‘name_from’ to be renamed ‘name_to’. Multiple columns can be renamed in a single call, using multiple dictionary entries.

Raises:

ValueError – ‘arguments’ was not a dictionary of type: str:str

run_workflow()

Run the workflow defined in the workflow object instance

scalar_projection(arguments: dict)

Workflow function: Add a column for the scalar projection to a target perturbation.

Designed to be called from a workflow, calculates the scalar projection and scalar rejection, quantifying on and off target phenotypes, as used in: Heiser, Katie, et al. “Identification of potential treatments for COVID-19 through artificial intelligence-enabled phenomic analysis of human cells infected with SARS-CoV-2.” BioRxiv (2020).

Parameters:

arguments (dict, should contain:) –

target_dataset

Index or name of dataset which should be used in the measurment. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.

target_perturbation_column_name

normally a ‘control’ column

target_perturbation_column_value:

value to be found in the column defined previously.

output_column_label:

Output from the scalar projection will have the form: on_target_<output_column_label> and off_target_<output_column_label> if this is missing, then it is set to target_perturbation_column_value

scatter(arguments: dict)

Workflow function: Make scatter plot.

Designed to be called from a workflow, produce a scatter plot from a dataframe.

Parameters:

arguments (dict) –

target_dataset

Index or name of dataset which should have features plotted. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.

figsize

A tuple denoting the target output size in inches (!), if absent, then the default of (8,6) is used.

title

Title for the plot, if absent, then the default “2D scatter” is used.

peturbations

Can be a list of peturbations - as denoted in the perturbations column of the dataframe to include in the plot. If absent, then all perturbations are included.

destination

Output location for the PNG - required field. An error will be thrown if omitted.

set_perturbation_column(arguments: dict)

Workflow function: Set the perturbation column

Designed to be called from a workflow, the perturbation column can be set on a dataset to help with plotting/scatters.

Parameters:

arguments (dict) –

target_dataset:

Index or name of dataset within which we wish to set the perturbation column. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.

column:

str type giving the new column name which will be set to mark perturbations.

tsne(arguments: dict)

Workflow function: Perform t-SNE dimensionality reduction technique.

Designed to be called from a workflow, performs the t-SNE dimensionality reduction technique. If no arguments are given or the arguments dictionary is empty, then 2D t-SNE is applied to the dataset with the highest index, equivalent of phe[-1], which is usually the last inserted dataset.

Parameters:

arguments (dict) –

Dictionary of arguments to used to direct the t-SNE process, can contain the following keys and values:

target_dataset

Index or name of dataset which should have the dimensionality reduction applied. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.

ndims

number of dimensions to which the t-SNE should reduce the features. If absent, then defaults to 2.

center_on_perturbation_id

tSNE should be recentered on the perturbation with ID.If absent, then defaults to None, and no centering is performed.

center_by_median

If true, then median of center_on_perturbation is used, if False, then the mean is used.

umap(arguments: dict)

Workflow function: Perform UMAP dimensionality reduction technique.

Designed to be called from a workflow, performs the UMAP dimensionality reduction technique. If no arguments are given or the arguments dictionary is empty, then 2D UMAP is applied to the dataset with the highest index, equivalent of phe[-1], which is usually the last inserted dataset.

Parameters:

arguments (dict) –

Dictionary of arguments used to direct the UMAP transform function. Can contain the following keys and values.

target_dataset

Index or name of dataset which should have the dimensionality reduction applied. If absent, then behaviour is as if -1 is supplied, indicating the last added dataset.

ndims

Number of dimensions to which the UMAP should reduce the features. If absent, then defaults to 2.

center_on_perturbation_id

UMAP should be recentered on the perturbation with ID. If absent, then defaults to None, and no centering is performed.

center_by_median

If true, then median of center_on_perturbation is used, if False, then the mean is used.

write_csv(arguments: dict)

Workflow function: Write dataframe to CSV file.

Designed to be called from a workflow, writes a CSV file using the Pandas.DataFrame.to_csv function. Expects a dictionary as arguments containing a ‘path’ key, with a string value pointing at the destination location of the CSV file. Additional keys within the supplied dictionary are supplied to Pandas.DataFrame.to_csv as kwargs, allowing fully flexible output.

Parameters:

arguments (dict) – Dictionary should contain a ‘path’ key, may also contain a target_dataset key which if absent, defaults to -1 (usually the last added dataset).

write_multiple_csvs(arguments: dict)

Workflow function: Write multiple CSV files

Designed to be called from a workflow. Often it is useful to write a CSV file per plate within the dataset, or group the data by some other identifier.

Parameters:

arguments (dict) –

Dictionary, should contain:

split_by_column: str

the column to be split on

output_dir: str

the target output directory

file_prefix: str

optional prefix for each file.

file suffix: str

optional suffix for each file.

file_extension: str

optional file extension, by default ‘.csv’

phenonaut.workflow.predict(self, arguments: dict)

Workflow function: predict

Profile predictors in their ability to predict a given target.

Phenonaut provides functionality to profile the performance of multiple predictors against multiple views of data. This is exemplified in the TCGA example used in the Phenonaut paper - see Example 1 - TCGA for a full walkthrough of applying this functionality to The Cancer Genome Atlas. With a given ‘target’ for prediction which is in the dataset, predict selects all appropriate predictors (classifiers for classification, regressors for regression and multiregressors for multi regression/view targets). Then, enumerating all views of the data and all predictors, hyperparameter optimisation coupled with 5-fold cross validation using Optuna is employed, before finally testing the best hyperparameter sets with retained test sets. This process is automatic and requires the data, and a prediction target. Output from this process is extensive and it may take a long time to complete, depending on the characteristics of your input data. Writen output from the profiling process consists of performance heatmaps highlighting best view/predictor combinations in bold, boxplots for each view combination and a PPTX presentation file allowing easy sharing of data, along with machine readable CSV and JSON results.

For each unique view combination and predictor, perform the following:

  • Merge views and remove samples which do not have features across currently needed views.

  • Shuffle the samples.

  • Withhold 20% of the data as a test set, to be tested against the trained and hyperparameter optimised predictor.

  • Split the data using 5-fold cross validation into train and validation sets.

  • For each fold, perform Optuna hyperparameter optimisation for the given predictor using the train sets, using hyperparameters described by the default predictors for classification, regression and multiregression.

Parameters:

arguments (dict) –

output_directory

Directory into which profiling output (boxplots, heatmaps, CSV, JSON and PPTX should be written).

dataset_combinations

If multiple datasets are already loaded, then lists of ‘views’ may be specified for exploration. If None, or this argument is absent, then all combinations of available views/Datasets are enumerated and used.

target

The prediction target, denoted by a column name given here which exists in loaded datasets.

n_splits

Number of splits to use in the N-fold cross validation, if absent, then the default of 5 is used.

n_optuna_trials

Number of Optuna trials for hyperparameter optimisation, by default 20. This drastically impacts runtime, so if things are taking too long, you may wish to lower this number. For a more thorough exploration of hyperparameter space, increase this number.

optuna_merge_folds

By default, each fold has hyperparameters optimised and the trained predictor with parameters reported. If this optuna_merge_folds is true, then each fold is trained on and and hyperparameters optimised across folds (not per-fold). Setting this to False may be useful depending on the intended use of the predictor. It is believed that when False, and parameters are not optimised across folds, then more accurate prediction variance/accuracy estimates are produced. If absent, behaves as if false.

test_set_fraction

When optimising a predictor, by default a fraction of the total data is held back for testing, separate from the train-validation splits. This test_set_fraction controls the size of this split. If absent, then the default value of 0.2 is used.

Module contents