phenonaut.metrics.distinctness package
Submodules
phenonaut.metrics.distinctness.distinctness_measures module
- phenonaut.metrics.distinctness.distinctness_measures.mp_value_score(ds: Dataset | Phenonaut, ds_groupby: str | List[str], reference_perturbation_query: str, pca_explained_variance: float = 0.99, std_scaler_columns: bool = True, std_scaler_rows: bool = False, n_iters: int = 1000, random_state: int = 42, raise_error_for_low_count_groups: bool = True)
Get mp-value score performance DataFrame for a dataset
- Implementation of the mp-value score from the paper:
Hutz JE, Nelson T, Wu H, et al. The Multidimensional Perturbation Value: A Single Metric to Measure Similarity and Activity of Treatments in High-Throughput Multidimensional Screens. Journal of Biomolecular Screening. 2013;18(4):367-377. doi:10.1177/1087057112469257.
The paper mentions normalising by rows as well as columns. This is not appropriate for some data types like DRUG-seq, and so this is not enabled by default. Additionally, a default fraction explained variance for the PCA operation has been set to 0.99 so that the PCA may explain 99 % of variance.
This implementation differs somewhat to the one in pycytominer_eval which deviates from the paper definition and does not perform a mixin of the covariance matrices for treatment and control.
- Parameters:
ds (Union[Dataset, Phenonaut]) – Phenonaut dataset or Phenonaut object upon which to perform the mp_value_score calculation. If a Phenonaut object is passed, then the dataset at position -1 (usually the last added is used)
ds_groupby (Pandas style groupby to apply on the ds. Normally this is the column) – name of a unique compound identifier. Can also be a list, containing the unique compound identifier column name, along with a concentration or timepoint column.
reference_perturbation_query (reference_perturbation_query) – Pandas style query which may be run on ds to extract the reference set of points in phenotypic space, against which all other grouped perturbations are compared.
pca_explained_variance (float) – This argument is passed to scikit’s PCA object and specifices the % variance that the returned components should capture. The original paper aims for 90 % ev we aim by default for 99 %. Should be expressed as a float between 0 and 1. By default 0.99
std_scaler_columns (bool) – Apply standard scaler to columns. By default True
std_scaler_rows (bool) – Apply standard scaler to rows. By default False
n_iters (int) – Number of iterations iterations to perform in statistical test to derive p-value, by default 1000
n_jobs (int, optional) – Calculations will be run in parallel by providing the number of processors to use. If n_jobs is None, then this is autodetected by the system. By default None
random_state (int) – Random seed to use for initialisation of rng, enabling reproducible runs
raise_error_for_low_count_groups (bool) – Calculation of mp_value scores requires more than three samples to be in each group. If raise_error_for_low_count_groups is True, then an error is raised upon encountering such a group as no mp_value score can be calculated. If False, then a simple warning is printed and the returned p-value and mahalanobis distance in the results dataframe are both np.nan. By default True
- phenonaut.metrics.distinctness.distinctness_measures.silhouette_score(ds: Dataset | Phenonaut | DataFrame, perturbation_column: str | None, replicate_criteria: str | list[str] | None = None, features: list[str] | None = None, similarity_metric: str | Callable = 'euclidean', similarity_metric_higher_is_better: bool = True, return_full_performance_df: bool = False)
Module contents
- phenonaut.metrics.distinctness.mp_value_score(ds: Dataset | Phenonaut, ds_groupby: str | List[str], reference_perturbation_query: str, pca_explained_variance: float = 0.99, std_scaler_columns: bool = True, std_scaler_rows: bool = False, n_iters: int = 1000, random_state: int = 42, raise_error_for_low_count_groups: bool = True)
Get mp-value score performance DataFrame for a dataset
- Implementation of the mp-value score from the paper:
Hutz JE, Nelson T, Wu H, et al. The Multidimensional Perturbation Value: A Single Metric to Measure Similarity and Activity of Treatments in High-Throughput Multidimensional Screens. Journal of Biomolecular Screening. 2013;18(4):367-377. doi:10.1177/1087057112469257.
The paper mentions normalising by rows as well as columns. This is not appropriate for some data types like DRUG-seq, and so this is not enabled by default. Additionally, a default fraction explained variance for the PCA operation has been set to 0.99 so that the PCA may explain 99 % of variance.
This implementation differs somewhat to the one in pycytominer_eval which deviates from the paper definition and does not perform a mixin of the covariance matrices for treatment and control.
- Parameters:
ds (Union[Dataset, Phenonaut]) – Phenonaut dataset or Phenonaut object upon which to perform the mp_value_score calculation. If a Phenonaut object is passed, then the dataset at position -1 (usually the last added is used)
ds_groupby (Pandas style groupby to apply on the ds. Normally this is the column) – name of a unique compound identifier. Can also be a list, containing the unique compound identifier column name, along with a concentration or timepoint column.
reference_perturbation_query (reference_perturbation_query) – Pandas style query which may be run on ds to extract the reference set of points in phenotypic space, against which all other grouped perturbations are compared.
pca_explained_variance (float) – This argument is passed to scikit’s PCA object and specifices the % variance that the returned components should capture. The original paper aims for 90 % ev we aim by default for 99 %. Should be expressed as a float between 0 and 1. By default 0.99
std_scaler_columns (bool) – Apply standard scaler to columns. By default True
std_scaler_rows (bool) – Apply standard scaler to rows. By default False
n_iters (int) – Number of iterations iterations to perform in statistical test to derive p-value, by default 1000
n_jobs (int, optional) – Calculations will be run in parallel by providing the number of processors to use. If n_jobs is None, then this is autodetected by the system. By default None
random_state (int) – Random seed to use for initialisation of rng, enabling reproducible runs
raise_error_for_low_count_groups (bool) – Calculation of mp_value scores requires more than three samples to be in each group. If raise_error_for_low_count_groups is True, then an error is raised upon encountering such a group as no mp_value score can be calculated. If False, then a simple warning is printed and the returned p-value and mahalanobis distance in the results dataframe are both np.nan. By default True
- phenonaut.metrics.distinctness.pertmutation_test_distinct_from_query_group(ds: Phenonaut | Dataset | list[DataFrame], query_group_query: str, groupby: str | list[str] | None, phenotypic_metric: PhenotypicMetric, n_iters=10000, return_full_results_df: bool = True, random_state: int | Generator = 42, max_samples_in_a_group=50, quiet: bool = False, no_error_on_empty_query: bool = True) tuple[float, DataFrame | None]
- phenonaut.metrics.distinctness.silhouette_score(ds: Dataset | Phenonaut | DataFrame, perturbation_column: str | None, replicate_criteria: str | list[str] | None = None, features: list[str] | None = None, similarity_metric: str | Callable = 'euclidean', similarity_metric_higher_is_better: bool = True, return_full_performance_df: bool = False)