phenonaut.metrics package

Submodules

phenonaut.metrics.distances module

phenonaut.metrics.distances.euclidean(point1: List[float] | ndarray | DataFrame, point2: List[List[float]] | ndarray | DataFrame)

Measure the euclidean distance between 2 points

Parameters:
  • point1 (Union[List[float],np.ndarray, pd.DataFrame]) – Multidimensional point, can be a simple list of [x,y,z], or a 2d M*N list where M are multiple points to be individually measured of N features and returning an array of measurmenents for each point.

  • point2 (Union[List[List[float]], np.ndarray, pd.DataFrame]) – List of N features, must have 1 dimension.

Returns:

Array (if point is 2D) of distances. If 1D, then single float value is returned indicating the euclidean distance between point1 and point2.

Return type:

[np.ndarray, float]

Raises:

DataError – Point1 can be 2D, but not 3D, and point2 must be 1D

phenonaut.metrics.distances.mahalanobis(point: List[float] | ndarray | DataFrame, cloud: List[List[float]] | ndarray | DataFrame, pvals: bool = False, covariance: ndarray | EmpiricalCovariance | MinCovDet | None = EmpiricalCovariance())

Measure the Mahalanobis distance between a point and a cloud

The Mahalanobis distance calculation is particularly sensitive to outliers, which results in large changes in covariance matrix calculation. For this reason, robust covarience estimators may be suppplied to the method. Whilst a common recomendation is to calculate the sqrt of the Mahalanobis distance, this is an approximation to euclidean space, only correct when the covariance matrix is an identity matrix. As Phenonaut is concerned on operating on high dimensional space which very likely has covariences present, the returned distance is not square rooted, returning D2 as noted.

https://imaging.mrc-cbu.cam.ac.uk/statswiki/FAQ/euclid

Optionally, the p-value for the point or points belonging to the cloud can be returned.

Parameters:
  • point (Union[List[float],np.ndarray, pd.DataFrame]) – Multidimensional point, can be a simple list of features, or a 2d M*N list where M are multiple points to be individually measured of N features and returning an array of measurmenents for each point.

  • cloud (Union[List[List[float]], np.ndarray, pd.DataFrame]) – 2D M*N array-like set of M points, with N features from which the underlying target distribution will be measured.

  • pvals (bool) – If True, then p-value for the point (or points) belonging to cloud is returned. This is calculated using Chi2 and degrees of freedom calculated as N-1, where N is the number of features/dimensions. If point was one dimensional, then a single floating point value is returned. If it is 2D (MxN matrix), then an array of length M is returned. By default, False.

  • covariance (Optional[Union[np.ndarray, EmpiricalCovariance]], optional) – If none, then the covariance matrix is calculated using scikit’s EmpiricalCovariance. This is fairly robust to outliers, and much more robust than the standard approach to calculating a covariance matrix (using numpy’s np.cov method. Robust estimators may be used by passing in an instantiated object which a subclass of EmpiracalCovariance (like sklearn.covariance.MinCovDet or EmpiricalCovariance itself). If None, then numpy’s cov method is used (sensitive to outliers). By default EmpiracalCovariance().

Returns:

Array (if point is 2D) of distances. If 1D, then single float value is returned indicating the Mahalanobis distance of the point to the cloud.

Return type:

[np.ndarray, float]

phenonaut.metrics.distances.manhattan(point1: List[float] | ndarray | DataFrame, point2: List[List[float]] | ndarray | DataFrame)

Measure the Manhattan distance between 2 points

Parameters:
  • point1 (Union[List[float],np.ndarray, pd.DataFrame]) – Multidimensional point, can be a simple list of [x,y,z], or a 2d M*N list where M are multiple points to be individually measured of N features and returning an array of measurmenents for each point.

  • point2 (Union[List[List[float]], np.ndarray, pd.DataFrame]) – List of N features, must have 1 dimension.

Returns:

Array (if point is 2D) of distances. If 1D, then single float value is returned indicating the euclidean distance between point1 and point2.

Return type:

[np.ndarray, float]

Raises:

DataError – Point1 can be 2D, but not 3D, and point2 must be 1D

phenonaut.metrics.distances.treatment_spread_euclidean(data: Dataset, perturbation_column: str | None = None, perturbations: List[str] | None = None) dict[slice(<class 'str'>, <class 'float'>, None)]

Calculate the euclidean spreat of perturbation repeats

Returns:

dict[str – Dictionary with perturbations as keys and euclidean distances as values.

Return type:

float]

Raises:

DataError – Perturbation column was not set for the given Dataset and was not supplied via the perturbation_column argument to treatment_spread_euclidean.

phenonaut.metrics.measures module

phenonaut.metrics.measures.feature_correlation_to_target(dataset: Dataset | Phenonaut | DataFrame, target_feature: str, features: list[str] | None = None, method: str = 'pearson', return_dataframe: bool = True)

Calculate correlation coefficients for features to a column

Sometimes we may wish to identify highly correlated features with a given property.

In this example, we use a subset of the Iris dataset:

sepal length (cm)

sepal width (cm)

petal length (cm)

petal width (cm)

target

5.4

3.4

1.7

0.2

0

7.2

3.0

5.8

1.6

2

6.4

2.8

5.6

2.1

2

4.8

3.1

1.60

2

0

5.6

2.5

3.9

1.1

1

We may wish to determine which feature is correlated with the “petal length (cm)” which can be achieved through calling this feature_correlation_to_target function, alowing the return of a pd.DataFrame or simple dictionary containing features names as keys, and the coefficients as values.

import tempfile
from phenonaut import Phenonaut
with tempfile.NamedTemporaryFile(mode = "w") as tmp:
    tmp.write("sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target\n5.4,3.4,1.7,0.2,0\n7.2,3.0,5.8,1.6,2\n6.4,2.8,5.6,2.1,2\n4.8,3.1,1.60,2,0\n5.6,2.5,3.9,1.1,1\n")
    tmp.flush()
    phe=Phenonaut()
    phe.load_dataset("Flowers",tmp.name, {'features_regex':".*(width|length).*"})

from phenonaut.metrics import feature_correlation_to_target
print(feature_correlation_to_target(phe, 'petal length (cm)'))

Returns a pd.DataFrame containing correlation coefficients:

index

correlation_to_petal length (cm)

sepal length (cm)

0.914639

petal width (cm)

0.448289

sepal width (cm)

-0.544665

The optional dictionary, returned by calling the function with the additional return_dataframe parameter set to False:

print(feature_correlation_to_target(phe, 'petal length (cm)'), return_dataframe=False)

has the form:

{'petal width (cm)': 0.448289248746271, 'sepal length (cm)': 0.9146393603234955, 'sepal width (cm)': -0.5446646166252519}
Parameters:
  • dataset (Union[Dataset, Phenonaut, DataFrame]) – The Phenonaut Dataset, pd.DataFrame, or Phenonaut object (containing only one dataset) on which to perform correlation calculations.

  • target_feature (str) – The feature, or metadata column that all correlations should be calculated against.

  • features (Optional[list[str]], optional) – List of features to include in the correlation calculations. If None, and a Dataset is supplied then those features are used. In the case where a pd.DataFrame is supplied, then features must be supplied. By default None.

  • method (str, optional) – Method used to calculate the correlation coefficient. Can be ‘pearson’, ‘kendall’ for the Kendall Tau correlation coefficient, or ‘spearman’ for the Spearman rank correlation. By default “pearson”.

  • return_dataframe (bool, optional) – If True, then a pd.DataFrame containing correlations is returned. If False, then a dictionary is returned, containing features names as keys, and the coefficients as values. By default True.

Returns:

Return a pd.DataFrame with calculated correlation coefficients, or alternatively, a dictionary containing features names as keys, and the coefficients as values.

Return type:

Union[pd.Datframe, dict]

Raises:
  • ValueError – Phenonaut objects must contain only one dataset if passed to this function.

  • ValueError – Target feature not found in the supplied dataset.

  • ValueError – DataFrame supplied, but no features.

  • ValueError – target_feature not found in the dataset.

  • TypeError – Supplied dataset was not of type Phenonaut, Dataset, or pd.DataFrame

phenonaut.metrics.measures.scalar_projection(dataset: Dataset, target_perturbation_column_name='control', target_perturbation_column_value='pos', output_column_label='pos', norm=True)
Calculates the scalar projection and scalar rejection, quantifying on and

off target phenotypes, as used in: Heiser, Katie, et al. “Identification of potential treatments for COVID-19 through artificial intelligence-enabled phenomic analysis of human cells infected with SARS-CoV-2.” BioRxiv (2020).

Parameters:
  • dataset (Dataset) – Phenonaut dataset being queries

  • target_TreatmentID ([type]) – TreatmentID of target phenotype. The median of all features across wells containing this phenotype is used.

phenonaut.metrics.non_ds_phenotypic_metrics module

class phenonaut.metrics.non_ds_phenotypic_metrics.PhenotypicMetric(name: str, method: Callable | str, range: Tuple[float, float], higher_is_better: bool = True)

Bases: object

Metrics evaluate one profile/feature vector against another

SciPy and other libraries traditionally supply distance metrics, like Manhattan, Euclidean etc. These are typically unbound in their max value, but not always, for example; cosine distance with a maximum dissimilarity of 1. Scientific literature is also full of similarity metrics, where a high value indicates most similarity - the opposite of a similarity metric. This dataclass coerces metrics into a standard form, with .similarity and .distance functions to turn any metric into a similarity or distance metric.

This allows the definition of something like the Zhang similarity metric, which ranges from -1 to 1, indicating most dissimilarity and most similarity respectively. Calling the metric defined by this Zhang function will return the traditional Zhang metric value - ranging from -1 to 1.

The methods similarity and distance will also be added

Calling distance will return a value between 0 and 1, with 0 being most similar and 1 being most dissimilar.

Calling similarity will return a value between 0 and 1, with 1 being the most similar and 0 being the most different.

__call__(anchor, query)

Call self as a function.

distance(anchor, query)
scale(score)
similarity(anchor, query)
phenonaut.metrics.non_ds_phenotypic_metrics.calc_connectivity_scores(anchor: array, queries: array) float | ndarray
phenonaut.metrics.non_ds_phenotypic_metrics.calc_spearmansrank_scores(anchor: array, queries: array) float | ndarray
phenonaut.metrics.non_ds_phenotypic_metrics.calc_zhang_scores(anchor: array, queries: array) float | ndarray

Calculate Zhang scores between two np.ndarrays

Implementation of the Zhang method for comparing L1000/CMAP signatures. Zhang, Shu-Dong, and Timothy W. Gant. “A simple and robust method for connecting small-molecule drugs using gene-expression signatures.” BMC bioinformatics 9.1 (2008): 1-10. Implemented by Steven Shave, following above paper as a reference https://doi.org/10.1186/1471-2105-9-258

Parameters:
  • anchor (np.array) – Anchor profiles/features. Can be a MxN matrix, allowing M sequences to be queried against queries (using N features).

  • queries (np.array) – Candidate profiles/features. Can be a MxN matrix, allowing M candidate sequences to be evaluated against anchor sequences.

Returns:

If anchor and candidate array ndims are both 1, then a single float representing the Zhang score is returned. If one input array has ndims of 2 (and the other has ndims of 1), then a 1-D np.ndarray is returned. If both inputs are 2-D, then a 2D MxN array is returned, where M is the

Return type:

Union[float, np.ndarray]

phenonaut.metrics.non_ds_phenotypic_metrics.calc_zhang_scores_all_v_all(anchor: array)

phenonaut.metrics.performance module

phenonaut.metrics.performance.mp_value_score(ds: Dataset | Phenonaut, ds_groupby: str | List[str], reference_perturbation_query: str, pca_explained_variance: float = 0.99, std_scaler_columns: bool = True, std_scaler_rows: bool = False, n_iters: int = 1000, random_state: int = 42, raise_error_for_low_count_groups: bool = True)

Get mp-value score performance DataFrame for a dataset

Implementation of the mp-value score from the paper:

Hutz JE, Nelson T, Wu H, et al. The Multidimensional Perturbation Value: A Single Metric to Measure Similarity and Activity of Treatments in High-Throughput Multidimensional Screens. Journal of Biomolecular Screening. 2013;18(4):367-377. doi:10.1177/1087057112469257.

The paper mentions normalising by rows as well as columns. This is not appropriate for some data types like DRUG-seq, and so this is not enabled by default. Additionally, a default fraction explained variance for the PCA operation has been set to 0.99 so that the PCA may explain 99 % of variance.

This implementation differs somewhat to the one in pycytominer_eval which deviates from the paper definition and does not perform a mixin of the covariance matrices for treatment and control.

Parameters:
  • ds (Union[Dataset, Phenonaut]) – Phenonaut dataset or Phenonaut object upon which to perform the mp_value_score calculation. If a Phenonaut object is passed, then the dataset at position -1 (usually the last added is used)

  • ds_groupby (Pandas style groupby to apply on the ds. Normally this is the column) – name of a unique compound identifier. Can also be a list, containing the unique compound identifier column name, along with a concentration or timepoint column.

  • reference_perturbation_query (reference_perturbation_query) – Pandas style query which may be run on ds to extract the reference set of points in phenotypic space, against which all other grouped perturbations are compared.

  • pca_explained_variance (float) – This argument is passed to scikit’s PCA object and specifices the % variance that the returned components should capture. The original paper aims for 90 % ev we aim by default for 99 %. Should be expressed as a float between 0 and 1. By default 0.99

  • std_scaler_columns (bool) – Apply standard scaler to columns. By default True

  • std_scaler_rows (bool) – Apply standard scaler to rows. By default False

  • n_iters (int) – Number of iterations iterations to perform in statistical test to derive p-value, by default 1000

  • n_jobs (int, optional) – Calculations will be run in parallel by providing the number of processors to use. If n_jobs is None, then this is autodetected by the system. By default None

  • random_state (int) – Random seed to use for initialisation of rng, enabling reproducible runs

  • raise_error_for_low_count_groups (bool) – Calculation of mp_value scores requires more than three samples to be in each group. If raise_error_for_low_count_groups is True, then an error is raised upon encountering such a group as no mp_value score can be calculated. If False, then a simple warning is printed and the returned p-value and mahalanobis distance in the results dataframe are both np.nan. By default True

phenonaut.metrics.performance.percent_compact(ds: Dataset | Phenonaut | DataFrame, perturbation_column: str | None = None, replicate_criteria: str | list[str] | None = None, replicate_query: str | None = None, replicate_criteria_not: str | list[str] | None = None, null_query_or_df: str | DataFrame | None = None, null_criteria: str | list[str] | None = None, null_criteria_not: str | list[str] | None = None, restrict_evaluation_query: str | None = None, features: list[str] | None = None, n_iters: int = 1000, similarity_metric: str | Callable = 'spearman', similarity_metric_higher_is_better: bool = True, min_cardinality: int = 2, max_cardinality: int = 50, include_cardinality_violating_compounds_in_calculation: bool = False, return_full_performance_df: bool = False, additional_captured_params: dict | None = None, similarity_metric_name: str | None = None, performance_df_file: str | Path | None = None, percentile_cutoff: int | None = None, use_joblib_parallelisation: bool = True, n_jobs: int = -1)

Calculate percent compact

Compactness is defined by the spread of compound repeats compared to a randomly sampled background distribution. For a given compound, its cardinality (num replicates), reffered to as C is determined. Then the median distance of all replicates is determined. This is then compared to a randomly sampled background. This background is obtained as follows: select C random compounds, calculate their median pairwise distances to each other, and store this. Repeat the process 1000 times and build a distribution of matched cardinality to the replicating compound. The replicate treatments are deemed compact if its score is less than the 5th percentile of the background distribution (for distance metrics), and greater than the 95th percentile for similarity metrics. Percent compact is simply the percentage of compounds which pass this compactness test.

Matching distributions are created by matching perturbations. In Phenonaut, this is typically defined by the perturbation_column field. This function takes this field as an argument, although it is unused if found in the passed Dataset/Phenonaut object. Additional criterial to match on such as concentration and well position can be added using the replicate_criteria argument. Null distributions are composed of a pick of C unique compounds, where C is the cardinality of the matched repeats (how many), and their median ‘similarity’. By default, this similarity is the spearman correlation coefficient between profiles. This process to generate median similarity for non-replicate compounds is repeated n_iters times (by default 1000).

As the calculation is demanding, the function makes use of the joblib library for parallel calculation of the null distribution.

Parameters:
  • ds (Union[Dataset, Phenonaut, pd.DataFrame],) – Input data in the form of a Phenonaut dataset, a Phenonaut object, or a pd.DataFrame containing profiles of perturbations. If a Phenonaut object is supplied, then the last added Dataset will be used.

  • perturbation_column (Optional[str]) – This argument sets the column name containing an identifier for the perturbation name (or identifier), usually the name of a compound or similar. If a Phenonaut object or Dataset is supplied as the ds argument and this perturbation_column argument is None, then this value is attempted to be discovered through interrogation of the perturbation_column property of the Dataset. In the case of the CMAP_Level4 PackagedDataset, a standard run would be achieved by passing ‘pert_iname’as an argument here, or disregarding the value found in this argument by providing a dataset with the perturbation_column property already set. By default None.

  • replicate_criteria (Optional[Union[str, list[str]]]=None) – As noted above describing the impact of the perturbation_column argument, matching compounds are often defined by their perturbation name/identifier and dose. Whilst the perturbation column matches the compound name/identifier (something which must always be matched), passing a string here containing the title of a dose column (for example) also enforces matching on this property. A list of strings may also be passed. In the case of the PackagedDataset CMAP_Level4, this argument would take the value “pert_idose_uM” and would ensure that matched replicates share a common identifier/name (as default and enforced by perturbation_column) and concentration. The original perturbation_column may be included here but has no effect. By default None.

  • replicate_query (Optional[str]=None) – Optional pandas query to apply in selection of the matching replicates, this maybe something like ensuring concentration is above a threshold, or that they are taken from certain timepoints. Please note, if information in rows is never going to be included in matching or null distributions, then it is more efficient to prefilter the dataframe before running compactness on it. This parameter should not be used to restrict the compounds on which compactness is run, as this is inefficient. Instead, the restrict_evaluation_query should be used.

  • replicate_criteria_not (Optional[Union[str, list[str]]]=None) – Values in this list enforce that matching replicates do NOT share a common property. This is useful for exotic evaluations, like picking replicates from across cell lines and concentrations.

  • null_query_or_df (Optional[str, pd.DataFrame]=None) – Optional pandas query to apply in selection of the non-matching replicates comprising the null distribution. This can be things like ensuring only a certain plates or cell lines are used in construction of the distribution. Alternatively, a pd.DataFrame may be supplied here, from which the non-matching compounds are drawn for creation of the null distribution. Note; if supplying a query to filter out information in rows that is never going to be included in matching or null distributions, then it is more efficient to prefilter the dataframe before running compactness on it. This argument does not override null_criteria, or null_criteria_not as these have effect after this arguments effects have been applied. Has no effect if None. By default, None.

  • null_criteria (Optional[Union[str, list[str]]]) – Whilst matching compounds are often defined by perturbation and dose, compounds comprising the null distribution must sometimes match the well position of the original compound. This argument captures the column name defining properties of non-matching replicates which must match the orignal query compound. In the case of the CMAP_Level4 PackagedDataset, this argument would take the value “well”. A list of strings may also be passed to enforce further fields within the matching distribution which must match in the null distribution. If None, then no requirements appart from a different name/compound idenfier are enforced. By default None.

  • null_criteria_not (Optional[Union[str, list[str]]]) – Values in this list enforce that matching non-replicates do NOT share a common property with the chosen replicates. The opposite of the above described null_criteria, this allows exotic evaluations like picking replicates from different cell lines to the matching replicates. Has no effect if None. By default None.

  • restrict_evaluation_query (Optional[str], optional) – If only a few compounds in a Phenonaut Dataset are to be included in the compactness calculation, then this parameter may be used to efficiently select only the required compounds using a standard pandas style query which is run on groups defined by replicate_criteria Excluded compounds are not removed from the compactness calculation. If None, then has no effect. By default None.

  • features (features:Optional[list[str]]) – Features list which capture the phenotypic responses to perturbations. Only required and used if a pd.DataFrame is supplied in place of the ds argument. By default None.

  • n_iters (int, optional) – Number of times the non-matching compound replicates should be sampled to compose the null distribution. If less than n_iters are available, then take as many as possible. By default 1000.

  • similarity_metric (Union[str, Callable, PhenotypicMetric], optional) – Callable metric, or string which is passed to pdist. This should be a distance metric; that is, lower is better, higher is worse. Note, a special case exists, whereby ‘spearman’ may be supplied here if so, then a much faster Numpy method np.corrcoef is used, and then results are subtracted from 1 to turn the metric into a distance metric. By default ‘spearman’.

  • similarity_metric_higher_is_better (bool) – If True, then a high value from the supplied similarity metric is better. If False, then a lower value is better (as is the case for distance metrics like Euclidean/Manhattan etc). Note that if lower is better, then the percentile should be changed to the other end of the distribution. For example, if keeping with significance at the 5 % level for a metric for which higher is better, then a metric where lower is better would use the 5th percentile, and percentile_cutoff = 5 should be used. By default True.

  • min_cardinality (int) – Cardinality is the number of times a treatment is repeated (treatment with matching well, dose and any other constraints imposed). This arguments sets the minimum number of treatment repeats that should be present, if not, then the group is excluded from the calculation. Behavior of cytominer-eval includes all single repeat measuments, marking them as non-replicating this behaviour is replicated by setting this argument to 2. If 2, then only compounds with 2 or more repeats are included in the calculation and have the posibility of generating a score for comparison to a null distribution and potentially passing the compactness test of being greater than the Nth percentile of the null distribution. By default 2.

  • max_cardinality (int) – If a dataset has thousands of matched repeats, then little is gained in finding pairwise all-to-all distances of non-matching compounds, this argument allows setting an upper bound cutoff after which, the repeats are shuffled and max_cardinality samples drawn to create a synthetic set of max_cardinality repeats. This is very useful when using a demanding similarity method as it cuts the evaluations dramatically. By default 50.

  • include_cardinality_violating_compounds_in_calculation (bool) – If True, then compounds for which there are no matching replicates, or not enough as defined by min_cardinality (and are therefore deemed not compact) are included in the final reported compactness statistics. If False, then they are not included as non-compact and simply ignored as if they were not present in the dataset. By default False.

  • return_full_performance_df (bool) – If True, then a tuple is returned with the compactness score in the first position, and a pd.DataFrame containing full information on each repeat. By default False

  • performance_df_file (Optional[Union[str, Path, bool]]) – If return_full_performance_df is True and a Path or str is given as an argument to this parameter, then the performance DataFrame is written out to a CSV file using this filename. If True is passed here, then the a filename will be constructed from function arguments, attempting to capture the run details. If an auto-generated file with this name exists, then an error is raised and no calculations are performed. In addition to the output CSV, a json file is also written capturing arguments that the function was called with. So if ‘compactness_results.csv’ is passed here, then a file named compactness_results.json will be written out. If a filename is autogenerated, then the autogenerated filename is adapted to have the ‘.json’ file extension. If the argument does not end in ‘.csv’, then .json is appended to the end of the filename to define the name of the json file. By Default, None.

  • additional_captured_params (Optional[dict]) – If writing out full details, also include this dictionary in the output json file, useful to add metadata to runs. By default None.

  • similarity_metric_name (Optional[str]) – If relying on the function to make a nice performance CSV file name, then a nice succinct similarity metric name may be passed here, rather than relying upon calling __repr__ on the function, which may return long names such as: ‘bound method Phenotypic_Metric.similarity of Manhattan’. By default None.

  • percentile_cutoff (Optional[int]) – Percentile of the null distribution over which the matching replicates must score to be considered compact. Should range from 0 to 100. Normally, this can be 95 (when using a similarity metric where higher is better, but if using a metric where lower is better, then it should be set to 5. To make things easier, this parameter defaults to None, in which case it takes the value 95 if similarity_metric_higher_is_better==True, and 5 if similarity_metric_higher_is_better==False. By default None.

  • use_joblib_parallelisation (bool) – If True, then use joblib to parallelise evaluation of compounds. By default True.

  • n_jobs (int, optional) – The n_jobs argument is passed to joblib for parallel execution and defines the number of threads to use. A value of -1 denotes that the system should determine how many jobs to run. By default -1.

Returns:

If return_full_performance_df is False, then only the percent compact statistic is returned. If True, then a tuple is returned, with percent compact in the first position, and a pd.DataFrame in the second position containing the median repeat scores, as well as median null distribution scores in an easy to analyse format.

Return type:

Union[float, tuple[float, pd.DataFrame]]

phenonaut.metrics.performance.percent_replicating(ds: Dataset | Phenonaut | DataFrame, perturbation_column: str | None = None, replicate_query: str | None = None, replicate_criteria: str | list[str] | None = None, replicate_criteria_not: str | list[str] | None = None, null_query_or_df: str | DataFrame | None = None, null_criteria: str | list[str] | None = None, null_criteria_not: str | list[str] | None = None, restrict_evaluation_query: str | None = None, features: list[str] | None = None, n_iters: int = 1000, similarity_metric: str | Callable = 'spearman', similarity_metric_higher_is_better: bool = True, min_cardinality: int = 2, max_cardinality: int = 50, include_cardinality_violating_compounds_in_calculation: bool = False, return_full_performance_df: bool = False, include_replicate_pairwise_distances_in_df: bool = False, additional_captured_params: dict | None = None, similarity_metric_name: str | None = None, performance_df_file: str | Path | None = None, percentile_cutoff: int | None = None, use_joblib_parallelisation: bool = True, n_jobs: int = -1, random_state: int | Generator = 42)

Calculate percent replicating

Percent replicating is defined by Way et. al. in: Way, Gregory P., et al. “Morphology and gene expression profiling provide complementary information for mapping cell state.” Cell systems 13.11 (2022): 911-923. or on bioRxiv: https://www.biorxiv.org/content/10.1101/2021.10.21.465335v2

Helpful descriptions also exist in https://github.com/cytomining/cytominer-eval/issues/21#issuecomment-902934931

This implementation is designed to work with a variety of phenotypic similarity methods, not just a spearman correlation coefficient between observations.groupby_null

Matching distributions are created by matching perturbations. In Phenonaut, this is typically defined by the perturbation_column field. This function takes this field as an argument, although it is unused if found in the passed Dataset/Phenonaut object. Additional criterial to match on such as concentration and well position can be added using the replicate_criteria argument. Null distributions are composed of a pick of C unique compounds, where C is the cardinality of the matched repeats (how many), and their median ‘similarity’. By default, this similarity is the spearman correlation coefficient between profiles. This process to generate median similarity for non-replicate compounds is repeated n_iters times (by default 1000). Once the null distribution has been collected (median pairwise similarities), the median similarity of matched replicates is compared to the 95th percentile of this null distribution. If it is greater, then the compound (or compound and dose) are deemed replicating. Null distributions may not contain the matched compound. The percent replicating is calculated from the number of matched repeats which were replicating versus the number which were not.

As the calculation is demanding, the function makes use of the joblib library for parallel calculation of the null distribution.

Parameters:
  • ds (Union[Dataset, Phenonaut, pd.DataFrame],) – Input data in the form of a Phenonaut dataset, a Phenonaut object, or a pd.DataFrame containing profiles of perturbations. If a Phenonaut object is supplied, then the last added Dataset will be used.

  • perturbation_column (Optional[str]) – In the standard % replicating calculation, compounds, are matched by name (or identifier), and dose, although this can be relaxed. This argument sets the column name containing an identifier for the perturbation name (or identifier), usually the name of a compound or similar. If a Phenonaut object or Dataset is supplied as the ds argument and this perturbation_column argument is None, then this value is attempted to be discovered through interrogation of the perturbation_column property of the Dataset. In the case of the CMAP_Level4 PackagedDataset, a standard run would be achieved by passing ‘pert_iname’as an argument here, or disregarding the value found in this argument by providing a dataset with the perturbation_column property already set. By default None.

  • replicate_query (Optional[str]=None) – Optional pandas query to apply in selection of the matching replicates, this maybe something like ensuring concentration is above a threshold, or that they are taken from certain timepoints. Please note, if information in rows is never going to be included in matching or null distributions, then it is more efficient to prefilter the dataframe before running percent_replicating on it. This parameter should not be used to restrict the compounds on which percent replicating is run, as this is inefficient. Instead, the restrict_evaluation_query should be used.

  • replicate_criteria (Optional[Union[str, list[str]]]=None) – As noted above describing the impact of the perturbation_column argument, matching compounds are often defined by their perturbation name/identifier and dose. Whilst the perturbation column matches the compound name/identifier (something which must always be matched), passing a string here containing the title of a dose column (for example) also enforces matching on this property. A list of strings may also be passed. In the case of the PackagedDataset CMAP_Level4, this argument would take the value “pert_idose_uM” and would ensure that matched replicates share a common identifier/name (as default and enforced by perturbation_column) and concentration. The original perturbation_column may be included here but has no effect. By default None.

  • replicate_criteria_not (Optional[Union[str, list[str]]]=None) – Values in this list enforce that matching replicates do NOT share a common property. This is useful for exotic evaluations, like picking replicates from across cell lines and concentrations.

  • null_query_or_df (Optional[str, pd.DataFrame]=None) – Optional pandas query to apply in selection of the non-matching replicates comprising the null distribution. This can be things like ensuring only a certain plates or cell lines are used in construction of the distribution. Alternatively, a pd.DataFrame may be supplied here, from which the non-matching compounds are drawn for creation of the null distribution. Note; if supplying a query to filter out information in rows that is never going to be included in matching or null distributions, then it is more efficient to prefilter the dataframe before running percent_replicating on it. This argument does not override null_criteria, or null_criteria_not as these have effect after this arguments effects have been applied. Has no effect if None. By default, None.

  • null_criteria (Optional[Union[str, list[str]]]) – Whilst matching compounds are often defined by perturbation and dose, compounds comprising the null distribution must sometimes match the well position of the original compound. This argument captures the column name defining properties of non-matching replicates which must match the orignal query compound. In the case of the CMAP_Level4 PackagedDataset, this argument would take the value “well”. A list of strings may also be passed to enforce further fields within the matching distribution which must match in the null distribution. If None, then no requirements appart from a different name/compound idenfier are enforced. By default None.

  • null_criteria_not (Optional[Union[str, list[str]]]) – Values in this list enforce that matching non-replicates do NOT share a common property with the chosen replicates. The opposite of the above described null_criteria, this allows exotic evaluations like picking replicates from different cell lines to the matching replicates. Has no effect if None. By default None.

  • restrict_evaluation_query (Optional[str], optional) – If only a few compounds in a Phenonaut Dataset are to be included in the percent replicating calculation, then this parameter may be used to efficiently select only the required compounds using a standard pandas style query which is run on groups defined by replicate_criteria Excluded compounds are not removed from the percent replicating calculation. If None, then has no effect. By default None.

  • features (features:Optional[list[str]]) – Features list which capture the phenotypic responses to perturbations. Only required and used if a pd.DataFrame is supplied in place of the ds argument. By default None.

  • n_iters (int, optional) – Number of times the non-matching compound replicates should be sampled to compose the null distribution. If less than n_iters are available, then take as many as possible. By default 1000.

  • similarity_metric (Union[str, Callable, PhenotypicMetric], optional) – Callable metric, or string which is passed to pdist. This should be a distance metric; that is, lower is better, higher is worse. Note, a special case exists, whereby ‘spearman’ may be supplied here if so, then a much faster Numpy method np.corrcoef is used, and then results are subtracted from 1 to turn the metric into a distance metric. By default ‘spearman’.

  • similarity_metric_higher_is_better (bool) – If True, then a high value from the supplied similarity metric is better. If False, then a lower value is better (as is the case for distance metrics like Euclidean/Manhattan etc). Note that if lower is better, then the percentile should be changed to the other end of the distribution. For example, if keeping with significance at the 5 % level for a metric for which higher is better, then a metric where lower is better would use the 5th percentile, and percentile_cutoff = 5 should be used. By default True.

  • min_cardinality (int) – Cardinality is the number of times a treatment is repeated (treatment with matching well, dose and any other constraints imposed). This arguments sets the minimum number of treatment repeats that should be present, if not, then the group is excluded from the calculation. Behavior of cytominer-eval includes all single repeat measuments, marking them as non-replicating this behaviour is replicated by setting this argument to 2. If 2, then only compounds with 2 or more repeats are included in the calculation and have the posibility of generating a score for comparison to a null distribution and potentially passing the replicating test of being greater than the Nth percentile of the null distribution. By default 2.

  • max_cardinality (int) – If a dataset has thousands of matched repeats, then little is gained in finding pairwise all-to-all distances of non-matching compounds, this argument allows setting an upper bound cutoff after which, the repeats are shuffled and max_cardinality samples drawn to create a synthetic set of max_cardinality repeats. This is very useful when using a demanding similarity method as it cuts the evaluations dramatically. By default 50.

  • include_cardinality_violating_compounds_in_calculation (bool) – If True, then compounds for which there are no matching replicates, or not enough as defined by min_cardinality (and are therefore deemed not replicating) are included in the final reported percent replicating statistics. If False, then they are not included as non-replicating and simply ignored as if they were not present in the dataset. By default False.

  • return_full_performance_df (bool) – If True, then a tuple is returned with the percent replicating score in the first position, and a pd.DataFrame containing full information on each repeat. By default False

  • include_replicate_pairwise_distances_in_df (bool) – If True, then pairwise replicate distances are included in the full performance dataframe. Has no effect if return_full_performance_df is False

  • performance_df_file (Optional[Union[str, Path, bool]]) – If return_full_performance_df is True and a Path or str is given as an argument to this parameter, then the performance DataFrame is written out to a CSV file using this filename. If True is passed here, then the a filename will be constructed from function arguments, attempting to capture the run details. If an auto-generated file with this name exists, then an error is raised and no calculations are performed. In addition to the output CSV, a json file is also written capturing arguments that the function was called with. So if ‘pr_results.csv’ is passed here, then a file named pr_results.json will be written out. If a filename is autogenerated, then the autogenerated filename is adapted to have the ‘.json’ file extension. If the argument does not end in ‘.csv’, then .json is appended to the end of the filename to define the name of the json file. By Default, None.

  • additional_captured_params (Optional[dict]) – If writing out full details, also include this dictionary in the output json file, useful to add metadata to runs. By default None.

  • similarity_metric_name (Optional[str]) – If relying on the function to make a nice performance CSV file name, then a nice succinct similarity metric name may be passed here, rather than relying upon calling __repr__ on the function, which may return long names such as: ‘bound method Phenotypic_Metric.similarity of Manhattan’. By default None.

  • percentile_cutoff (Optional[int]) – Percentile of the null distribution over which the matching replicates must score to be considered compact. Should range from 0 to 100. Normally, this can be 95 (when using a similarity metric where higher is better, but if using a metric where lower is better, then it should be set to 5. To make things easier, this parameter defaults to None, in which case it takes the value 95 if similarity_metric_higher_is_better==True, and 5 if similarity_metric_higher_is_better==False. By default None.

  • use_joblib_parallelisation (bool) – If True, then use joblib to parallelise evaluation of compounds. By default True.

  • n_jobs (int, optional) – The n_jobs argument is passed to joblib for parallel execution and defines the number of threads to use. A value of -1 denotes that the system should determine how many jobs to run. By default -1.

  • random_state (Union[int, np.random.Generator]) – Random state which should be used when performing sampling operations. Can be a np.random.Generator, or an int (in which case, a np.random.Generator) is instantiated with it. If attempting reproducible results, run without parallelisation by settiung the use_joblib_parallelisation argument to False, by default 42

Returns:

If return_full_performance_df is False, then only the percent replicating is returned. If True, then a tuple is returned, with percent replicating in the first position, and a pd.DataFrame in the second position containing the median repeat scores, as well as median null distribution scores in an easy to analyse format.

Return type:

Union[float, tuple[float, pd.DataFrame]]

phenonaut.metrics.performance.silhouette_score(ds: Dataset | Phenonaut | DataFrame, perturbation_column: str | None, replicate_criteria: str | list[str] | None = None, features: list[str] | None = None, similarity_metric: str | Callable = 'euclidean', similarity_metric_higher_is_better: bool = True, return_full_performance_df: bool = False)

phenonaut.metrics.utils module

phenonaut.metrics.utils.percent_compact_summarise_results(file_or_dir: Path | str, dir_glob: str = 'pc_*.csv', get_percent_compact: bool = True) None | DataFrame

Summarise percent compact results

Parameters:
  • file_or_dir (Union[Path, str]) – Percent compact results file, or directory containing results files.

  • dir_glob (str, optional) – Glob to use in searching directories for percent replicating results files, by default “pc_*.csv”.

  • get_percent_compact (bool) – If True, then calculate and return percent compact in the table. Has no effect if run on a file (not a directory). By default True.

Returns:

Either a tuple containing the percent compact and number of records contributing to that score (if run on a single file), or a pd.DataFrame summarising results (if run on a directory).

Return type:

Union[Tuple[float, int], pd.DataFrame]

phenonaut.metrics.utils.percent_replicating_results_dataframe_to_95pct_confidence_interval(df: DataFrame, percentile_cutoff: int | float | list[int | float] = 95, n_resamples: int = 1000, similarity_metric_higher_is_better: bool = True, n_jobs: int = -1) list[tuple[float, float]] | tuple[float, float]

Get confidence interval at given percentile cutoff for percent replicating results

Reads a DataFrame from phenonaut.metrics.performance.percent_replicating and performs bootstrapping, sampling from the null distribution to assign a confidence interval at the given cutoff, or list of cutoffs. Returns a tuple containing upper and lower 95 % confidence interval bounds. If multiple percentile cutoffs are supplied, then a list containing tuples for each is returned.

Parameters:
  • df (pd.DataFrame) – DataFrame supplied by phenonaut.metrics.performance.percent_replicating

  • percentile_cutoff (Union[int, float, list[Union[int, float]]]) – Percentile cutoff at which to calculate the confidence interval. Can also be a list, which results in a list of high and low confidence interval tuples being returned. Should be between 0 and 100, with a value of 95 denoting the 95th percentile as a cutoff. By default 95.

  • n_resamples (int) – Number of times to resample the null distribution. By default 1000.

  • similarity_metric_higher_is_better (bool) – If True, then consider treatment replicating if score is greater than the percentile cutoff. If False, then consider treatment replicating if score is less than percentile cutoff. By default True.

  • defines (n_jobs argument is passed to joblib for parallel execution and) – number of threads to use. A value of -1 denotes that the system should determine how many jobs to run. By default -1.

Returns:

Tuple containing 2 values, the first being the lower confidence interval and the second being the higher confidence interval. If multiple percentile cutoffs are given, then a list of tuples at each percentile cutoff will be returned.

Return type:

Union[list[tuple[float, float]], tuple[float, float]]

phenonaut.metrics.utils.percent_replicating_results_dataframe_to_percentile_vs_percent_replicating(df: DataFrame, percentile_range: tuple[int, int] = (0, 101), percentile_step_size: int = 1, return_counts: bool = False, n_jobs: int = -1, similarity_metric_higher_is_better: bool = True) tuple[ndarray, ndarray]

Get x,y arrays for cutoff vs % replicating plots

Reads a DataFrame from phenonaut.metrics.performance.percent_replicating when run with return_full_performance_df = True, and generates a tuple of x and y coordinates, allowing plotting of percentile cutoff vs percent replicating.

Parameters:
  • df (pd.DataFrame) – DataFrame supplied by phenonaut.metrics.performance.percent_replicating

  • percentile_range (tuple[int, int], optional) – The range of percentiles to cover, by default (0, 101)

  • percentile_step_size (int, optional) – By default, every value in percentile_range is explored (-1 for the max value inkeeping with Python range function operation), the stepsize may be changed here. By default 1.

  • return_counts (bool, optional) – If True, then y values denote the counts of replicates which were deemed replicating. If False, then percent replicating is returned. By default False

  • similarity_metric_higher_is_better (bool) – If True, then consider treatment replicating if score is greater than the percentile cutoff. If False, then consider treatment replicating if score is less than percentile cutoff. By default True.

  • n_jobs (int, optional) – The n_jobs argument is passed to joblib for parallel execution and defines number of threads to use. A value of -1 denotes that the system should determine how many jobs to run. By default -1.

Returns:

Tuple containing 2 np.ndarrays, the first being percentile cutoff, and the second being mathching % replicating values (or count of replicating compounds if return_counts=True)

Return type:

tuple[np.ndarray, np.ndarray]

phenonaut.metrics.utils.percent_replicating_summarise_results(file_or_dir: Path | str, dir_glob: str = 'pr_*.csv', if_no_json_use_filename_to_derive_info: bool = True, get_percent_replicating: bool = True) None | DataFrame

Summarise percent replicating results

Parameters:
  • file_or_dir (Union[Path, str]) – Percent replicating results file, or directory containing results files.

  • dir_glob (str, optional) – Glob to use in searching directories for percent replicating results files, by default “pr_*.csv”.

  • if_no_json_use_filename_to_derive_info (bool, optional) – Legacy runs of percent replicating did not produce a json information file and therefore run information is attempted to be derived from filenames if this parameter is True. By default True.

  • get_percent_replicating (bool) – If True, then calculate and return PR in the table. Has no effect if run on a file (not a directory). By default True.

Returns:

Either a tuple containing the percent replicating and number of records contributing to that score (if run on a single file), or a pd.DataFrame summarising results (if run on a directory).

Return type:

Union[Tuple[float, int], pd.DataFrame]

Module contents