phenonaut.metrics.compactness package

Submodules

phenonaut.metrics.compactness.pr module

phenonaut.metrics.compactness.pr.percent_compact(ds: Dataset | Phenonaut | DataFrame, perturbation_column: str | None = None, replicate_criteria: str | list[str] | None = None, replicate_query: str | None = None, replicate_criteria_not: str | list[str] | None = None, null_query_or_df: str | DataFrame | None = None, null_criteria: str | list[str] | None = None, null_criteria_not: str | list[str] | None = None, restrict_evaluation_query: str | None = None, features: list[str] | None = None, n_iters: int = 1000, similarity_metric: str | Callable = 'spearman', similarity_metric_higher_is_better: bool = True, min_cardinality: int = 2, max_cardinality: int = 50, include_cardinality_violating_compounds_in_calculation: bool = False, return_full_performance_df: bool = False, additional_captured_params: dict | None = None, similarity_metric_name: str | None = None, performance_df_file: str | Path | None = None, percentile_cutoff: int | None = None, parallel: bool = True, n_jobs: int = -1)

Calculate percent compact

Compactness is defined by the spread of compound repeats compared to a randomly sampled background distribution. For a given compound, its cardinality (num replicates), reffered to as C is determined. Then the median distance of all replicates is determined. This is then compared to a randomly sampled background. This background is obtained as follows: select C random compounds, calculate their median pairwise distances to each other, and store this. Repeat the process 1000 times and build a distribution of matched cardinality to the replicating compound. The replicate treatments are deemed compact if its score is less than the 5th percentile of the background distribution (for distance metrics), and greater than the 95th percentile for similarity metrics. Percent compact is simply the percentage of compounds which pass this compactness test.

Matching distributions are created by matching perturbations. In Phenonaut, this is typically defined by the perturbation_column field. This function takes this field as an argument, although it is unused if found in the passed Dataset/Phenonaut object. Additional criterial to match on such as concentration and well position can be added using the replicate_criteria argument. Null distributions are composed of a pick of C unique compounds, where C is the cardinality of the matched repeats (how many), and their median ‘similarity’. By default, this similarity is the spearman correlation coefficient between profiles. This process to generate median similarity for non-replicate compounds is repeated n_iters times (by default 1000).

As the calculation is demanding, the function makes use of the joblib library for parallel calculation of the null distribution.

Parameters:

ds (Union[Dataset, Phenonaut, pd.DataFrame],) – Input data in the form of a Phenonaut dataset, a Phenonaut object, or a pd.DataFrame containing profiles of perturbations. If a Phenonaut object is supplied, then the last added Dataset will be used.
perturbation_column (Optional[str]) – This argument sets the column name containing an identifier for the perturbation name (or identifier), usually the name of a compound or similar. If a Phenonaut object or Dataset is supplied as the ds argument and this perturbation_column argument is None, then this value is attempted to be discovered through interrogation of the perturbation_column property of the Dataset. In the case of the CMAP_Level4 PackagedDataset, a standard run would be achieved by passing ‘pert_iname’as an argument here, or disregarding the value found in this argument by providing a dataset with the perturbation_column property already set. By default None.
replicate_criteria (Optional[Union[str, list[str]]]=None) – As noted above describing the impact of the perturbation_column argument, matching compounds are often defined by their perturbation name/identifier and dose. Whilst the perturbation column matches the compound name/identifier (something which must always be matched), passing a string here containing the title of a dose column (for example) also enforces matching on this property. A list of strings may also be passed. In the case of the PackagedDataset CMAP_Level4, this argument would take the value “pert_idose_uM” and would ensure that matched replicates share a common identifier/name (as default and enforced by perturbation_column) and concentration. The original perturbation_column may be included here but has no effect. By default None.
replicate_query (Optional[str]=None) – Optional pandas query to apply in selection of the matching replicates, this maybe something like ensuring concentration is above a threshold, or that they are taken from certain timepoints. Please note, if information in rows is never going to be included in matching or null distributions, then it is more efficient to prefilter the dataframe before running compactness on it. This parameter should not be used to restrict the compounds on which compactness is run, as this is inefficient. Instead, the restrict_evaluation_query should be used.
replicate_criteria_not (Optional[Union[str, list[str]]]=None) – Values in this list enforce that matching replicates do NOT share a common property. This is useful for exotic evaluations, like picking replicates from across cell lines and concentrations.
null_query_or_df (Optional[str, pd.DataFrame]=None) – Optional pandas query to apply in selection of the non-matching replicates comprising the null distribution. This can be things like ensuring only a certain plates or cell lines are used in construction of the distribution. Alternatively, a pd.DataFrame may be supplied here, from which the non-matching compounds are drawn for creation of the null distribution. Note; if supplying a query to filter out information in rows that is never going to be included in matching or null distributions, then it is more efficient to prefilter the dataframe before running compactness on it. This argument does not override null_criteria, or null_criteria_not as these have effect after this arguments effects have been applied. Has no effect if None. By default, None.
null_criteria (Optional[Union[str, list[str]]]) – Whilst matching compounds are often defined by perturbation and dose, compounds comprising the null distribution must sometimes match the well position of the original compound. This argument captures the column name defining properties of non-matching replicates which must match the orignal query compound. In the case of the CMAP_Level4 PackagedDataset, this argument would take the value “well”. A list of strings may also be passed to enforce further fields within the matching distribution which must match in the null distribution. If None, then no requirements appart from a different name/compound idenfier are enforced. By default None.
null_criteria_not (Optional[Union[str, list[str]]]) – Values in this list enforce that matching non-replicates do NOT share a common property with the chosen replicates. The opposite of the above described null_criteria, this allows exotic evaluations like picking replicates from different cell lines to the matching replicates. Has no effect if None. By default None.
restrict_evaluation_query (Optional[str], optional) – If only a few compounds in a Phenonaut Dataset are to be included in the compactness calculation, then this parameter may be used to efficiently select only the required compounds using a standard pandas style query which is run on groups defined by replicate_criteria Excluded compounds are not removed from the compactness calculation. If None, then has no effect. By default None.
features (features:Optional[list[str]]) – Features list which capture the phenotypic responses to perturbations. Only required and used if a pd.DataFrame is supplied in place of the ds argument. By default None.
n_iters (int, optional) – Number of times the non-matching compound replicates should be sampled to compose the null distribution. If less than n_iters are available, then take as many as possible. By default 1000.
similarity_metric (Union[str, Callable, PhenotypicMetric], optional) – Callable metric, or string which is passed to pdist. This should be a distance metric; that is, lower is better, higher is worse. Note, a special case exists, whereby ‘spearman’ may be supplied here if so, then a much faster Numpy method np.corrcoef is used, and then results are subtracted from 1 to turn the metric into a distance metric. By default ‘spearman’.
similarity_metric_higher_is_better (bool) – If True, then a high value from the supplied similarity metric is better. If False, then a lower value is better (as is the case for distance metrics like Euclidean/Manhattan etc). Note that if lower is better, then the percentile should be changed to the other end of the distribution. For example, if keeping with significance at the 5 % level for a metric for which higher is better, then a metric where lower is better would use the 5th percentile, and percentile_cutoff = 5 should be used. By default True.
min_cardinality (int) – Cardinality is the number of times a treatment is repeated (treatment with matching well, dose and any other constraints imposed). This arguments sets the minimum number of treatment repeats that should be present, if not, then the group is excluded from the calculation. Behavior of cytominer-eval includes all single repeat measuments, marking them as non-replicating this behaviour is replicated by setting this argument to 2. If 2, then only compounds with 2 or more repeats are included in the calculation and have the posibility of generating a score for comparison to a null distribution and potentially passing the compactness test of being greater than the Nth percentile of the null distribution. By default 2.
max_cardinality (int) – If a dataset has thousands of matched repeats, then little is gained in finding pairwise all-to-all distances of non-matching compounds, this argument allows setting an upper bound cutoff after which, the repeats are shuffled and max_cardinality samples drawn to create a synthetic set of max_cardinality repeats. This is very useful when using a demanding similarity method as it cuts the evaluations dramatically. By default 50.
include_cardinality_violating_compounds_in_calculation (bool) – If True, then compounds for which there are no matching replicates, or not enough as defined by min_cardinality (and are therefore deemed not compact) are included in the final reported compactness statistics. If False, then they are not included as non-compact and simply ignored as if they were not present in the dataset. By default False.
return_full_performance_df (bool) – If True, then a tuple is returned with the compactness score in the first position, and a pd.DataFrame containing full information on each repeat. By default False
performance_df_file (Optional[Union[str, Path, bool]]) – If return_full_performance_df is True and a Path or str is given as an argument to this parameter, then the performance DataFrame is written out to a CSV file using this filename. If True is passed here, then the a filename will be constructed from function arguments, attempting to capture the run details. If an auto-generated file with this name exists, then an error is raised and no calculations are performed. In addition to the output CSV, a json file is also written capturing arguments that the function was called with. So if ‘compactness_results.csv’ is passed here, then a file named compactness_results.json will be written out. If a filename is autogenerated, then the autogenerated filename is adapted to have the ‘.json’ file extension. If the argument does not end in ‘.csv’, then .json is appended to the end of the filename to define the name of the json file. By Default, None.
additional_captured_params (Optional[dict]) – If writing out full details, also include this dictionary in the output json file, useful to add metadata to runs. By default None.
similarity_metric_name (Optional[str]) – If relying on the function to make a nice performance CSV file name, then a nice succinct similarity metric name may be passed here, rather than relying upon calling __repr__ on the function, which may return long names such as: ‘bound method Phenotypic_Metric.similarity of Manhattan’. By default None.
percentile_cutoff (Optional[int]) – Percentile of the null distribution over which the matching replicates must score to be considered compact. Should range from 0 to 100. Normally, this can be 95 (when using a similarity metric where higher is better, but if using a metric where lower is better, then it should be set to 5. To make things easier, this parameter defaults to None, in which case it takes the value 95 if similarity_metric_higher_is_better==True, and 5 if similarity_metric_higher_is_better==False. By default None.
parallel (bool) – If True, then use joblib to parallelise evaluation of compounds. By default True.
n_jobs (int, optional) – The n_jobs argument is passed to joblib for parallel execution and defines the number of threads to use. A value of -1 denotes that the system should determine how many jobs to run. By default -1.

Returns:

If return_full_performance_df is False, then only the percent compact statistic is returned. If True, then a tuple is returned, with percent compact in the first position, and a pd.DataFrame in the second position containing the median repeat scores, as well as median null distribution scores in an easy to analyse format.

Return type:

Union[float, tuple[float, pd.DataFrame]]

phenonaut.metrics.compactness.pr.percent_replicating(ds: Dataset | Phenonaut | DataFrame, perturbation_column: str | None = None, replicate_query: str | None = None, replicate_criteria: str | list[str] | None = None, replicate_criteria_not: str | list[str] | None = None, null_query_or_df: str | DataFrame | None = None, null_criteria: str | list[str] | None = None, null_criteria_not: str | list[str] | None = None, restrict_evaluation_query: str | None = None, features: list[str] | None = None, n_iters: int = 1000, phenotypic_metric: str | Callable = 'spearman', phenotypic_metric_higher_is_better: bool = True, min_cardinality: int = 2, max_cardinality: int = 50, include_cardinality_violating_compounds_in_calculation: bool = False, return_full_performance_df: bool = False, include_replicate_pairwise_distances_in_df: bool = False, additional_captured_params: dict | None = None, similarity_metric_name: str | None = None, performance_df_file: str | Path | None = None, percentile_cutoff: int | None = None, parallel: bool = True, n_jobs: int | None = None, random_state: int | Generator = 42, quiet: bool = False)

Calculate percent replicating

Percent replicating is defined by Way et. al. in: Way, Gregory P., et al. “Morphology and gene expression profiling provide complementary information for mapping cell state.” Cell systems 13.11 (2022): 911-923. or on bioRxiv: https://www.biorxiv.org/content/10.1101/2021.10.21.465335v2

Helpful descriptions also exist in https://github.com/cytomining/cytominer-eval/issues/21#issuecomment-902934931

This implementation is designed to work with a variety of phenotypic similarity methods, not just a spearman correlation coefficient between observations.groupby_null

Matching distributions are created by matching perturbations. In Phenonaut, this is typically defined by the perturbation_column field. This function takes this field as an argument, although it is unused if found in the passed Dataset/Phenonaut object. Additional criterial to match on such as concentration and well position can be added using the replicate_criteria argument. Null distributions are composed of a pick of C unique compounds, where C is the cardinality of the matched repeats (how many), and their median ‘similarity’. By default, this similarity is the spearman correlation coefficient between profiles. This process to generate median similarity for non-replicate compounds is repeated n_iters times (by default 1000). Once the null distribution has been collected (median pairwise similarities), the median similarity of matched replicates is compared to the 95th percentile of this null distribution. If it is greater, then the compound (or compound and dose) are deemed replicating. Null distributions may not contain the matched compound. The percent replicating is calculated from the number of matched repeats which were replicating versus the number which were not.

As the calculation is demanding, the function makes use of parallel calculation of the null distribution.

Parameters:

ds (Union[Dataset, Phenonaut, pd.DataFrame],) – Input data in the form of a Phenonaut dataset, a Phenonaut object, or a pd.DataFrame containing profiles of perturbations. If a Phenonaut object is supplied, then the last added Dataset will be used.
perturbation_column (Optional[str]) – In the standard % replicating calculation, compounds, are matched by name (or identifier), and dose, although this can be relaxed. This argument sets the column name containing an identifier for the perturbation name (or identifier), usually the name of a compound or similar. If a Phenonaut object or Dataset is supplied as the ds argument and this perturbation_column argument is None, then this value is attempted to be discovered through interrogation of the perturbation_column property of the Dataset. In the case of the CMAP_Level4 PackagedDataset, a standard run would be achieved by passing ‘pert_iname’as an argument here, or disregarding the value found in this argument by providing a dataset with the perturbation_column property already set. By default None.
replicate_query (Optional[str]=None) – Optional pandas query to apply in selection of the matching replicates, this maybe something like ensuring concentration is above a threshold, or that they are taken from certain timepoints. Please note, if information in rows is never going to be included in matching or null distributions, then it is more efficient to prefilter the dataframe before running percent_replicating on it. This parameter should not be used to restrict the compounds on which percent replicating is run, as this is inefficient. Instead, the restrict_evaluation_query should be used.
replicate_criteria (Optional[Union[str, list[str]]]=None) – As noted above describing the impact of the perturbation_column argument, matching compounds are often defined by their perturbation name/identifier and dose. Whilst the perturbation column matches the compound name/identifier (something which must always be matched), passing a string here containing the title of a dose column (for example) also enforces matching on this property. A list of strings may also be passed. In the case of the PackagedDataset CMAP_Level4, this argument would take the value “pert_idose_uM” and would ensure that matched replicates share a common identifier/name (as default and enforced by perturbation_column) and concentration. The original perturbation_column may be included here but has no effect. By default None.
replicate_criteria_not (Optional[Union[str, list[str]]]=None) – Values in this list enforce that matching replicates do NOT share a common property. This is useful for exotic evaluations, like picking replicates from across cell lines and concentrations.
null_query_or_df (Optional[str, pd.DataFrame]=None) – Optional pandas query to apply in selection of the non-matching replicates comprising the null distribution. This can be things like ensuring only a certain plates or cell lines are used in construction of the distribution. Alternatively, a pd.DataFrame may be supplied here, from which the non-matching compounds are drawn for creation of the null distribution. Note; if supplying a query to filter out information in rows that is never going to be included in matching or null distributions, then it is more efficient to prefilter the dataframe before running percent_replicating on it. This argument does not override null_criteria, or null_criteria_not as these have effect after this arguments effects have been applied. Has no effect if None. By default, None.
null_criteria (Optional[Union[str, list[str]]]) – Whilst matching compounds are often defined by perturbation and dose, compounds comprising the null distribution must sometimes match the well position of the original compound. This argument captures the column name defining properties of non-matching replicates which must match the orignal query compound. In the case of the CMAP_Level4 PackagedDataset, this argument would take the value “well”. A list of strings may also be passed to enforce further fields within the matching distribution which must match in the null distribution. If None, then no requirements appart from a different name/compound idenfier are enforced. By default None.
null_criteria_not (Optional[Union[str, list[str]]]) – Values in this list enforce that matching non-replicates do NOT share a common property with the chosen replicates. The opposite of the above described null_criteria, this allows exotic evaluations like picking replicates from different cell lines to the matching replicates. Has no effect if None. By default None.
restrict_evaluation_query (Optional[str], optional) – If only a few compounds in a Phenonaut Dataset are to be included in the percent replicating calculation, then this parameter may be used to efficiently select only the required compounds using a standard pandas style query which is run on groups defined by replicate_criteria Excluded compounds are not removed from the percent replicating calculation. If None, then has no effect. By default None.
features (features:Optional[list[str]]) – Features list which capture the phenotypic responses to perturbations. Only required and used if a pd.DataFrame is supplied in place of the ds argument. By default None.
n_iters (int, optional) – Number of times the non-matching compound replicates should be sampled to compose the null distribution. If less than n_iters are available, then take as many as possible. By default 1000.
phenotypic_metric (Union[str, Callable, PhenotypicMetric], optional) – Callable metric, or string which is passed to pdist. This should be a distance metric; that is, lower is better, higher is worse. Note, a special case exists, whereby ‘spearman’ may be supplied here if so, then a much faster Numpy method np.corrcoef is used, and then results are subtracted from 1 to turn the metric into a distance metric. By default ‘spearman’.
phenotypic_metric_higher_is_better (bool) – If True, then a high value from the supplied phenotypic metric is better. If False, then a lower value is better (as is the case for distance metrics like Euclidean/Manhattan etc). Note that if lower is better, then the percentile should be changed to the other end of the distribution. For example, if keeping with significance at the 5 % level for a metric for which higher is better, then a metric where lower is better would use the 5th percentile, and percentile_cutoff = 5 should be used. By default True.
min_cardinality (int) – Cardinality is the number of times a treatment is repeated (treatment with matching well, dose and any other constraints imposed). This arguments sets the minimum number of treatment repeats that should be present, if not, then the group is excluded from the calculation. Behavior of cytominer-eval includes all single repeat measuments, marking them as non-replicating this behaviour is replicated by setting this argument to 2. If 2, then only compounds with 2 or more repeats are included in the calculation and have the posibility of generating a score for comparison to a null distribution and potentially passing the replicating test of being greater than the Nth percentile of the null distribution. By default 2.
max_cardinality (int) – If a dataset has thousands of matched repeats, then little is gained in finding pairwise all-to-all distances of non-matching compounds, this argument allows setting an upper bound cutoff after which, the repeats are shuffled and max_cardinality samples drawn to create a synthetic set of max_cardinality repeats. This is very useful when using a demanding similarity method as it cuts the evaluations dramatically. By default 50.
include_cardinality_violating_compounds_in_calculation (bool) – If True, then compounds for which there are no matching replicates, or not enough as defined by min_cardinality (and are therefore deemed not replicating) are included in the final reported percent replicating statistics. If False, then they are not included as non-replicating and simply ignored as if they were not present in the dataset. By default False.
return_full_performance_df (bool) – If True, then a tuple is returned with the percent replicating score in the first position, and a pd.DataFrame containing full information on each repeat. By default False
include_replicate_pairwise_distances_in_df (bool) – If True, then pairwise replicate distances are included in the full performance dataframe. Has no effect if return_full_performance_df is False
performance_df_file (Optional[Union[str, Path, bool]]) – If return_full_performance_df is True and a Path or str is given as an argument to this parameter, then the performance DataFrame is written out to a CSV file using this filename. If True is passed here, then the a filename will be constructed from function arguments, attempting to capture the run details. If an auto-generated file with this name exists, then an error is raised and no calculations are performed. In addition to the output CSV, a json file is also written capturing arguments that the function was called with. So if ‘pr_results.csv’ is passed here, then a file named pr_results.json will be written out. If a filename is autogenerated, then the autogenerated filename is adapted to have the ‘.json’ file extension. If the argument does not end in ‘.csv’, then .json is appended to the end of the filename to define the name of the json file. By Default, None.
additional_captured_params (Optional[dict]) – If writing out full details, also include this dictionary in the output json file, useful to add metadata to runs. By default None.
similarity_metric_name (Optional[str]) – If relying on the function to make a nice performance CSV file name, then a nice succinct similarity metric name may be passed here, rather than relying upon calling __repr__ on the function, which may return long names such as: ‘bound method Phenotypic_Metric.similarity of Manhattan’. By default None.
percentile_cutoff (Optional[int]) – Percentile of the null distribution over which the matching replicates must score to be considered compact. Should range from 0 to 100. Normally, this can be 95 (when using a similarity metric where higher is better, but if using a metric where lower is better, then it should be set to 5. To make things easier, this parameter defaults to None, in which case it takes the value 95 if similarity_metric_higher_is_better==True, and 5 if similarity_metric_higher_is_better==False. By default None.
parallel (bool) – If True, then use multiprocessing to parallelise evaluation of compounds. By default True.
n_jobs (int, optional) – The n_jobs argument is passed to multiprocessing for parallel execution and defines the number of threads to use. A value of -1 denotes that the system should determine how many jobs to run. By default None.
random_state (Union[int, np.random.Generator]) – Random state which should be used when performing sampling operations. Can be a np.random.Generator, or an int (in which case, a np.random.Generator) is instantiated with it. If attempting reproducible results, run without parallelisation by settiung the parallel argument to False, by default 42
quiet (bool) – If True, then dont display a progressbar

Returns:

If return_full_performance_df is False, then only the percent replicating is returned. If True, then a tuple is returned, with percent replicating in the first position, and a pd.DataFrame in the second position containing the median repeat scores, as well as median null distribution scores in an easy to analyse format.

Return type:

Union[float, tuple[float, pd.DataFrame]]

Module contents

phenonaut.metrics.compactness.percent_compact(ds: Dataset | Phenonaut | DataFrame, perturbation_column: str | None = None, replicate_criteria: str | list[str] | None = None, replicate_query: str | None = None, replicate_criteria_not: str | list[str] | None = None, null_query_or_df: str | DataFrame | None = None, null_criteria: str | list[str] | None = None, null_criteria_not: str | list[str] | None = None, restrict_evaluation_query: str | None = None, features: list[str] | None = None, n_iters: int = 1000, similarity_metric: str | Callable = 'spearman', similarity_metric_higher_is_better: bool = True, min_cardinality: int = 2, max_cardinality: int = 50, include_cardinality_violating_compounds_in_calculation: bool = False, return_full_performance_df: bool = False, additional_captured_params: dict | None = None, similarity_metric_name: str | None = None, performance_df_file: str | Path | None = None, percentile_cutoff: int | None = None, parallel: bool = True, n_jobs: int = -1)

Calculate percent compact

Compactness is defined by the spread of compound repeats compared to a randomly sampled background distribution. For a given compound, its cardinality (num replicates), reffered to as C is determined. Then the median distance of all replicates is determined. This is then compared to a randomly sampled background. This background is obtained as follows: select C random compounds, calculate their median pairwise distances to each other, and store this. Repeat the process 1000 times and build a distribution of matched cardinality to the replicating compound. The replicate treatments are deemed compact if its score is less than the 5th percentile of the background distribution (for distance metrics), and greater than the 95th percentile for similarity metrics. Percent compact is simply the percentage of compounds which pass this compactness test.

Matching distributions are created by matching perturbations. In Phenonaut, this is typically defined by the perturbation_column field. This function takes this field as an argument, although it is unused if found in the passed Dataset/Phenonaut object. Additional criterial to match on such as concentration and well position can be added using the replicate_criteria argument. Null distributions are composed of a pick of C unique compounds, where C is the cardinality of the matched repeats (how many), and their median ‘similarity’. By default, this similarity is the spearman correlation coefficient between profiles. This process to generate median similarity for non-replicate compounds is repeated n_iters times (by default 1000).

As the calculation is demanding, the function makes use of the joblib library for parallel calculation of the null distribution.

Parameters:

ds (Union[Dataset, Phenonaut, pd.DataFrame],) – Input data in the form of a Phenonaut dataset, a Phenonaut object, or a pd.DataFrame containing profiles of perturbations. If a Phenonaut object is supplied, then the last added Dataset will be used.
perturbation_column (Optional[str]) – This argument sets the column name containing an identifier for the perturbation name (or identifier), usually the name of a compound or similar. If a Phenonaut object or Dataset is supplied as the ds argument and this perturbation_column argument is None, then this value is attempted to be discovered through interrogation of the perturbation_column property of the Dataset. In the case of the CMAP_Level4 PackagedDataset, a standard run would be achieved by passing ‘pert_iname’as an argument here, or disregarding the value found in this argument by providing a dataset with the perturbation_column property already set. By default None.
replicate_criteria (Optional[Union[str, list[str]]]=None) – As noted above describing the impact of the perturbation_column argument, matching compounds are often defined by their perturbation name/identifier and dose. Whilst the perturbation column matches the compound name/identifier (something which must always be matched), passing a string here containing the title of a dose column (for example) also enforces matching on this property. A list of strings may also be passed. In the case of the PackagedDataset CMAP_Level4, this argument would take the value “pert_idose_uM” and would ensure that matched replicates share a common identifier/name (as default and enforced by perturbation_column) and concentration. The original perturbation_column may be included here but has no effect. By default None.
replicate_query (Optional[str]=None) – Optional pandas query to apply in selection of the matching replicates, this maybe something like ensuring concentration is above a threshold, or that they are taken from certain timepoints. Please note, if information in rows is never going to be included in matching or null distributions, then it is more efficient to prefilter the dataframe before running compactness on it. This parameter should not be used to restrict the compounds on which compactness is run, as this is inefficient. Instead, the restrict_evaluation_query should be used.
replicate_criteria_not (Optional[Union[str, list[str]]]=None) – Values in this list enforce that matching replicates do NOT share a common property. This is useful for exotic evaluations, like picking replicates from across cell lines and concentrations.
null_query_or_df (Optional[str, pd.DataFrame]=None) – Optional pandas query to apply in selection of the non-matching replicates comprising the null distribution. This can be things like ensuring only a certain plates or cell lines are used in construction of the distribution. Alternatively, a pd.DataFrame may be supplied here, from which the non-matching compounds are drawn for creation of the null distribution. Note; if supplying a query to filter out information in rows that is never going to be included in matching or null distributions, then it is more efficient to prefilter the dataframe before running compactness on it. This argument does not override null_criteria, or null_criteria_not as these have effect after this arguments effects have been applied. Has no effect if None. By default, None.
null_criteria (Optional[Union[str, list[str]]]) – Whilst matching compounds are often defined by perturbation and dose, compounds comprising the null distribution must sometimes match the well position of the original compound. This argument captures the column name defining properties of non-matching replicates which must match the orignal query compound. In the case of the CMAP_Level4 PackagedDataset, this argument would take the value “well”. A list of strings may also be passed to enforce further fields within the matching distribution which must match in the null distribution. If None, then no requirements appart from a different name/compound idenfier are enforced. By default None.
null_criteria_not (Optional[Union[str, list[str]]]) – Values in this list enforce that matching non-replicates do NOT share a common property with the chosen replicates. The opposite of the above described null_criteria, this allows exotic evaluations like picking replicates from different cell lines to the matching replicates. Has no effect if None. By default None.
restrict_evaluation_query (Optional[str], optional) – If only a few compounds in a Phenonaut Dataset are to be included in the compactness calculation, then this parameter may be used to efficiently select only the required compounds using a standard pandas style query which is run on groups defined by replicate_criteria Excluded compounds are not removed from the compactness calculation. If None, then has no effect. By default None.
features (features:Optional[list[str]]) – Features list which capture the phenotypic responses to perturbations. Only required and used if a pd.DataFrame is supplied in place of the ds argument. By default None.
n_iters (int, optional) – Number of times the non-matching compound replicates should be sampled to compose the null distribution. If less than n_iters are available, then take as many as possible. By default 1000.
similarity_metric (Union[str, Callable, PhenotypicMetric], optional) – Callable metric, or string which is passed to pdist. This should be a distance metric; that is, lower is better, higher is worse. Note, a special case exists, whereby ‘spearman’ may be supplied here if so, then a much faster Numpy method np.corrcoef is used, and then results are subtracted from 1 to turn the metric into a distance metric. By default ‘spearman’.
similarity_metric_higher_is_better (bool) – If True, then a high value from the supplied similarity metric is better. If False, then a lower value is better (as is the case for distance metrics like Euclidean/Manhattan etc). Note that if lower is better, then the percentile should be changed to the other end of the distribution. For example, if keeping with significance at the 5 % level for a metric for which higher is better, then a metric where lower is better would use the 5th percentile, and percentile_cutoff = 5 should be used. By default True.
min_cardinality (int) – Cardinality is the number of times a treatment is repeated (treatment with matching well, dose and any other constraints imposed). This arguments sets the minimum number of treatment repeats that should be present, if not, then the group is excluded from the calculation. Behavior of cytominer-eval includes all single repeat measuments, marking them as non-replicating this behaviour is replicated by setting this argument to 2. If 2, then only compounds with 2 or more repeats are included in the calculation and have the posibility of generating a score for comparison to a null distribution and potentially passing the compactness test of being greater than the Nth percentile of the null distribution. By default 2.
max_cardinality (int) – If a dataset has thousands of matched repeats, then little is gained in finding pairwise all-to-all distances of non-matching compounds, this argument allows setting an upper bound cutoff after which, the repeats are shuffled and max_cardinality samples drawn to create a synthetic set of max_cardinality repeats. This is very useful when using a demanding similarity method as it cuts the evaluations dramatically. By default 50.
include_cardinality_violating_compounds_in_calculation (bool) – If True, then compounds for which there are no matching replicates, or not enough as defined by min_cardinality (and are therefore deemed not compact) are included in the final reported compactness statistics. If False, then they are not included as non-compact and simply ignored as if they were not present in the dataset. By default False.
return_full_performance_df (bool) – If True, then a tuple is returned with the compactness score in the first position, and a pd.DataFrame containing full information on each repeat. By default False
performance_df_file (Optional[Union[str, Path, bool]]) – If return_full_performance_df is True and a Path or str is given as an argument to this parameter, then the performance DataFrame is written out to a CSV file using this filename. If True is passed here, then the a filename will be constructed from function arguments, attempting to capture the run details. If an auto-generated file with this name exists, then an error is raised and no calculations are performed. In addition to the output CSV, a json file is also written capturing arguments that the function was called with. So if ‘compactness_results.csv’ is passed here, then a file named compactness_results.json will be written out. If a filename is autogenerated, then the autogenerated filename is adapted to have the ‘.json’ file extension. If the argument does not end in ‘.csv’, then .json is appended to the end of the filename to define the name of the json file. By Default, None.
additional_captured_params (Optional[dict]) – If writing out full details, also include this dictionary in the output json file, useful to add metadata to runs. By default None.
similarity_metric_name (Optional[str]) – If relying on the function to make a nice performance CSV file name, then a nice succinct similarity metric name may be passed here, rather than relying upon calling __repr__ on the function, which may return long names such as: ‘bound method Phenotypic_Metric.similarity of Manhattan’. By default None.
percentile_cutoff (Optional[int]) – Percentile of the null distribution over which the matching replicates must score to be considered compact. Should range from 0 to 100. Normally, this can be 95 (when using a similarity metric where higher is better, but if using a metric where lower is better, then it should be set to 5. To make things easier, this parameter defaults to None, in which case it takes the value 95 if similarity_metric_higher_is_better==True, and 5 if similarity_metric_higher_is_better==False. By default None.
parallel (bool) – If True, then use joblib to parallelise evaluation of compounds. By default True.
n_jobs (int, optional) – The n_jobs argument is passed to joblib for parallel execution and defines the number of threads to use. A value of -1 denotes that the system should determine how many jobs to run. By default -1.

Returns:

If return_full_performance_df is False, then only the percent compact statistic is returned. If True, then a tuple is returned, with percent compact in the first position, and a pd.DataFrame in the second position containing the median repeat scores, as well as median null distribution scores in an easy to analyse format.

Return type:

Union[float, tuple[float, pd.DataFrame]]

phenonaut.metrics.compactness.percent_replicating(ds: Dataset | Phenonaut | DataFrame, perturbation_column: str | None = None, replicate_query: str | None = None, replicate_criteria: str | list[str] | None = None, replicate_criteria_not: str | list[str] | None = None, null_query_or_df: str | DataFrame | None = None, null_criteria: str | list[str] | None = None, null_criteria_not: str | list[str] | None = None, restrict_evaluation_query: str | None = None, features: list[str] | None = None, n_iters: int = 1000, phenotypic_metric: str | Callable = 'spearman', phenotypic_metric_higher_is_better: bool = True, min_cardinality: int = 2, max_cardinality: int = 50, include_cardinality_violating_compounds_in_calculation: bool = False, return_full_performance_df: bool = False, include_replicate_pairwise_distances_in_df: bool = False, additional_captured_params: dict | None = None, similarity_metric_name: str | None = None, performance_df_file: str | Path | None = None, percentile_cutoff: int | None = None, parallel: bool = True, n_jobs: int | None = None, random_state: int | Generator = 42, quiet: bool = False)

Calculate percent replicating

Percent replicating is defined by Way et. al. in: Way, Gregory P., et al. “Morphology and gene expression profiling provide complementary information for mapping cell state.” Cell systems 13.11 (2022): 911-923. or on bioRxiv: https://www.biorxiv.org/content/10.1101/2021.10.21.465335v2

Helpful descriptions also exist in https://github.com/cytomining/cytominer-eval/issues/21#issuecomment-902934931

This implementation is designed to work with a variety of phenotypic similarity methods, not just a spearman correlation coefficient between observations.groupby_null

Matching distributions are created by matching perturbations. In Phenonaut, this is typically defined by the perturbation_column field. This function takes this field as an argument, although it is unused if found in the passed Dataset/Phenonaut object. Additional criterial to match on such as concentration and well position can be added using the replicate_criteria argument. Null distributions are composed of a pick of C unique compounds, where C is the cardinality of the matched repeats (how many), and their median ‘similarity’. By default, this similarity is the spearman correlation coefficient between profiles. This process to generate median similarity for non-replicate compounds is repeated n_iters times (by default 1000). Once the null distribution has been collected (median pairwise similarities), the median similarity of matched replicates is compared to the 95th percentile of this null distribution. If it is greater, then the compound (or compound and dose) are deemed replicating. Null distributions may not contain the matched compound. The percent replicating is calculated from the number of matched repeats which were replicating versus the number which were not.

As the calculation is demanding, the function makes use of parallel calculation of the null distribution.

Parameters:

ds (Union[Dataset, Phenonaut, pd.DataFrame],) – Input data in the form of a Phenonaut dataset, a Phenonaut object, or a pd.DataFrame containing profiles of perturbations. If a Phenonaut object is supplied, then the last added Dataset will be used.
perturbation_column (Optional[str]) – In the standard % replicating calculation, compounds, are matched by name (or identifier), and dose, although this can be relaxed. This argument sets the column name containing an identifier for the perturbation name (or identifier), usually the name of a compound or similar. If a Phenonaut object or Dataset is supplied as the ds argument and this perturbation_column argument is None, then this value is attempted to be discovered through interrogation of the perturbation_column property of the Dataset. In the case of the CMAP_Level4 PackagedDataset, a standard run would be achieved by passing ‘pert_iname’as an argument here, or disregarding the value found in this argument by providing a dataset with the perturbation_column property already set. By default None.
replicate_query (Optional[str]=None) – Optional pandas query to apply in selection of the matching replicates, this maybe something like ensuring concentration is above a threshold, or that they are taken from certain timepoints. Please note, if information in rows is never going to be included in matching or null distributions, then it is more efficient to prefilter the dataframe before running percent_replicating on it. This parameter should not be used to restrict the compounds on which percent replicating is run, as this is inefficient. Instead, the restrict_evaluation_query should be used.
replicate_criteria (Optional[Union[str, list[str]]]=None) – As noted above describing the impact of the perturbation_column argument, matching compounds are often defined by their perturbation name/identifier and dose. Whilst the perturbation column matches the compound name/identifier (something which must always be matched), passing a string here containing the title of a dose column (for example) also enforces matching on this property. A list of strings may also be passed. In the case of the PackagedDataset CMAP_Level4, this argument would take the value “pert_idose_uM” and would ensure that matched replicates share a common identifier/name (as default and enforced by perturbation_column) and concentration. The original perturbation_column may be included here but has no effect. By default None.
replicate_criteria_not (Optional[Union[str, list[str]]]=None) – Values in this list enforce that matching replicates do NOT share a common property. This is useful for exotic evaluations, like picking replicates from across cell lines and concentrations.
null_query_or_df (Optional[str, pd.DataFrame]=None) – Optional pandas query to apply in selection of the non-matching replicates comprising the null distribution. This can be things like ensuring only a certain plates or cell lines are used in construction of the distribution. Alternatively, a pd.DataFrame may be supplied here, from which the non-matching compounds are drawn for creation of the null distribution. Note; if supplying a query to filter out information in rows that is never going to be included in matching or null distributions, then it is more efficient to prefilter the dataframe before running percent_replicating on it. This argument does not override null_criteria, or null_criteria_not as these have effect after this arguments effects have been applied. Has no effect if None. By default, None.
null_criteria (Optional[Union[str, list[str]]]) – Whilst matching compounds are often defined by perturbation and dose, compounds comprising the null distribution must sometimes match the well position of the original compound. This argument captures the column name defining properties of non-matching replicates which must match the orignal query compound. In the case of the CMAP_Level4 PackagedDataset, this argument would take the value “well”. A list of strings may also be passed to enforce further fields within the matching distribution which must match in the null distribution. If None, then no requirements appart from a different name/compound idenfier are enforced. By default None.
null_criteria_not (Optional[Union[str, list[str]]]) – Values in this list enforce that matching non-replicates do NOT share a common property with the chosen replicates. The opposite of the above described null_criteria, this allows exotic evaluations like picking replicates from different cell lines to the matching replicates. Has no effect if None. By default None.
restrict_evaluation_query (Optional[str], optional) – If only a few compounds in a Phenonaut Dataset are to be included in the percent replicating calculation, then this parameter may be used to efficiently select only the required compounds using a standard pandas style query which is run on groups defined by replicate_criteria Excluded compounds are not removed from the percent replicating calculation. If None, then has no effect. By default None.
features (features:Optional[list[str]]) – Features list which capture the phenotypic responses to perturbations. Only required and used if a pd.DataFrame is supplied in place of the ds argument. By default None.
n_iters (int, optional) – Number of times the non-matching compound replicates should be sampled to compose the null distribution. If less than n_iters are available, then take as many as possible. By default 1000.
phenotypic_metric (Union[str, Callable, PhenotypicMetric], optional) – Callable metric, or string which is passed to pdist. This should be a distance metric; that is, lower is better, higher is worse. Note, a special case exists, whereby ‘spearman’ may be supplied here if so, then a much faster Numpy method np.corrcoef is used, and then results are subtracted from 1 to turn the metric into a distance metric. By default ‘spearman’.
phenotypic_metric_higher_is_better (bool) – If True, then a high value from the supplied phenotypic metric is better. If False, then a lower value is better (as is the case for distance metrics like Euclidean/Manhattan etc). Note that if lower is better, then the percentile should be changed to the other end of the distribution. For example, if keeping with significance at the 5 % level for a metric for which higher is better, then a metric where lower is better would use the 5th percentile, and percentile_cutoff = 5 should be used. By default True.
min_cardinality (int) – Cardinality is the number of times a treatment is repeated (treatment with matching well, dose and any other constraints imposed). This arguments sets the minimum number of treatment repeats that should be present, if not, then the group is excluded from the calculation. Behavior of cytominer-eval includes all single repeat measuments, marking them as non-replicating this behaviour is replicated by setting this argument to 2. If 2, then only compounds with 2 or more repeats are included in the calculation and have the posibility of generating a score for comparison to a null distribution and potentially passing the replicating test of being greater than the Nth percentile of the null distribution. By default 2.
max_cardinality (int) – If a dataset has thousands of matched repeats, then little is gained in finding pairwise all-to-all distances of non-matching compounds, this argument allows setting an upper bound cutoff after which, the repeats are shuffled and max_cardinality samples drawn to create a synthetic set of max_cardinality repeats. This is very useful when using a demanding similarity method as it cuts the evaluations dramatically. By default 50.
include_cardinality_violating_compounds_in_calculation (bool) – If True, then compounds for which there are no matching replicates, or not enough as defined by min_cardinality (and are therefore deemed not replicating) are included in the final reported percent replicating statistics. If False, then they are not included as non-replicating and simply ignored as if they were not present in the dataset. By default False.
return_full_performance_df (bool) – If True, then a tuple is returned with the percent replicating score in the first position, and a pd.DataFrame containing full information on each repeat. By default False
include_replicate_pairwise_distances_in_df (bool) – If True, then pairwise replicate distances are included in the full performance dataframe. Has no effect if return_full_performance_df is False
performance_df_file (Optional[Union[str, Path, bool]]) – If return_full_performance_df is True and a Path or str is given as an argument to this parameter, then the performance DataFrame is written out to a CSV file using this filename. If True is passed here, then the a filename will be constructed from function arguments, attempting to capture the run details. If an auto-generated file with this name exists, then an error is raised and no calculations are performed. In addition to the output CSV, a json file is also written capturing arguments that the function was called with. So if ‘pr_results.csv’ is passed here, then a file named pr_results.json will be written out. If a filename is autogenerated, then the autogenerated filename is adapted to have the ‘.json’ file extension. If the argument does not end in ‘.csv’, then .json is appended to the end of the filename to define the name of the json file. By Default, None.
additional_captured_params (Optional[dict]) – If writing out full details, also include this dictionary in the output json file, useful to add metadata to runs. By default None.
similarity_metric_name (Optional[str]) – If relying on the function to make a nice performance CSV file name, then a nice succinct similarity metric name may be passed here, rather than relying upon calling __repr__ on the function, which may return long names such as: ‘bound method Phenotypic_Metric.similarity of Manhattan’. By default None.
percentile_cutoff (Optional[int]) – Percentile of the null distribution over which the matching replicates must score to be considered compact. Should range from 0 to 100. Normally, this can be 95 (when using a similarity metric where higher is better, but if using a metric where lower is better, then it should be set to 5. To make things easier, this parameter defaults to None, in which case it takes the value 95 if similarity_metric_higher_is_better==True, and 5 if similarity_metric_higher_is_better==False. By default None.
parallel (bool) – If True, then use multiprocessing to parallelise evaluation of compounds. By default True.
n_jobs (int, optional) – The n_jobs argument is passed to multiprocessing for parallel execution and defines the number of threads to use. A value of -1 denotes that the system should determine how many jobs to run. By default None.
random_state (Union[int, np.random.Generator]) – Random state which should be used when performing sampling operations. Can be a np.random.Generator, or an int (in which case, a np.random.Generator) is instantiated with it. If attempting reproducible results, run without parallelisation by settiung the parallel argument to False, by default 42
quiet (bool) – If True, then dont display a progressbar

Returns:

If return_full_performance_df is False, then only the percent replicating is returned. If True, then a tuple is returned, with percent replicating in the first position, and a pd.DataFrame in the second position containing the median repeat scores, as well as median null distribution scores in an easy to analyse format.

Return type:

Union[float, tuple[float, pd.DataFrame]]