phenonaut.data package

Submodules

phenonaut.data.dataset module

class phenonaut.data.dataset.Dataset(dataset_name: str, input_file_path_or_df: Path | str | DataFrame | None = None, metadata: dict | Path | str = {}, kind: str | None = None, init_hash: str | bytes | None = None, h5_key: str | None = None, features: list[str] | None = None)

Bases: object

Dataset constructor

Dataset holds source agnostic datasets, read in using hints from a user configurable YAML file describing the input CSV file format and indicating key columns.

Parameters:

dataset_name (str) – Dataset name
input_file_path (Union[Path, str, pd.DataFrame]) – Location of the input CSV/TSV/H5 file to be read, or a pd.DataFrame to use. If None, then an empty DataFrame object is returned. As well as CSV/TSV files, the location of an h5 file may also be given which contains a pandas dataframe. If h5 is given, then it is expected that a h5_key argument be passed.
metadata (Union[dict, Path, str]) – Dictionary or path to yml file describing CSV file format and key columns. a ‘sep’ key:value pair may be supplied, but if absent, then the file is examined and if a TAB character is found present in the first line of the file, then it is assumed that the TAB character should be used to delimit values. This check is not performed if a ‘sep’ key is found in metadata, allowing a simple way to override this check. By default {}.
kind (Optional[str]) – Instead of providing metadata, some presets are available, which make reading in things like DRUG-Seq easier, without the need to explicitly set all required transforms. If used as well as metadata, then the preset metadata dictionary from the kind argument is first loaded, then updated with anything in the metadata dictionary, this therefore allows overriding specific presets present in kind dictionaries. Available ‘kind’ dictionaries may be listed by examining: phenonaut.data.recipes.recipes.keys()
init_hash (Optional[Union[str, bytes]]) – Cryptographic hashing within Phenonaut Datasets can be initialised with a starting/seed hash. This is useful in the creation of blockchain-like chains of hashes. In environments where timestamping is unavailable, hashes may be published and then used as input to subsequent experiments. Building up a provable chain along the way. By default None, implying an empty bytes array.
h5_key (Optional[str]) – If input_file_path is an h5 file, then a key to access the target DataFrame must be supplied.
features (Optional[list[str]]) – Features may be supplied here which are then added to the metadata dict if supplied.

Raises:

FileNotFoundError – Input CSV file not found
DataError – Metadata could not be used to parse input CSV

add_well_id(numerical_column_name: str = 'COLUMN', numerical_row_name: str = 'ROW', plate_type: int = 384, new_well_column_name: str = 'Well', add_empty_wells: bool = False, plate_barcode_column: str | None = None, no_sort: bool = False)

Add standard well IDs - such as A1, A2, etc.

If a dataset contains numerical row and column names, then they may be translated into standard letter-number well IDs.

Parameters:

numerical_column_name (str, optional) – Name of column containing numeric column number, by default “COLUMN”.
numerical_row_name (str, optional) – Name of column containing numeric column number, by default “ROW”.
plate_type (int, optional) – Plate type - note, at present, only 384 well plate format is supported, by default 384.
new_well_column_name (str, optional) – Name of new column containing letter-number well ID, by default “Well”.
add_empty_wells (bool, optional) – Should all wells from a plate be inserted, even when missing from the data, by default False.
plate_barcode_column (str, optional) – Multiple plates may be in a dataset, this column contains their unique ID, by default None.
no_sort (bool, optional) – Do not resort the dataset by well ID, by default False.

copy(name: str | None = None)

Return a deep copy of the Dataset object

Parameters:: name (str | None) – Name to be given to the new dataset, if None, then the name is taken from the dataset being copied. By default None
Returns:: Copy of the input object.
Return type:: PhenonautData

property data

Return self.df[self.features]

Returns:: DataFrame containing only features and index
Return type:: pd.DataFrame

df_to_csv(output_path: Path | str, **kwargs)

Write DataFrame to CSV

Convenience function to write the underlying DataFrame to a CSV. Additional arguments will be passed to the Pandas.DataFrame.to_csv function.

Parameters:: output_path (Union[Path, str]) – Target output file

df_to_multiple_csvs(split_by_column: str, output_dir: str | Path | None = None, file_prefix: str = '', file_suffix='', file_extension='.csv', **kwargs)

Wite multiple CSV files from a dataset DataFrame.

In the case where one output CSV is required per plate, then splitting the underlying DataFrame on something like a PlateID serves the purpose of generating one output CSV file per plate. This can be achieved with this function and providing the column to split on.

Parameters:

split_by_column (str) – Column containing unique values within a split output CSV file
output_dir (Optional[Union[str, Path]], optional) – Target output directory for split CSV files, by default None
file_prefix (str, optional) – Prefix for split CSV files, by default “”
file_suffix (str, optional) – Suffix for split CSV files, by default “”
file_extension (str, optional) – File extension for split CSV files, by default “.csv”

distance_df(candidate_dataset: Dataset, metric: str | Callable = 'euclidean', return_best_n_indexes_and_score: int | None = None, lower_is_better=True) → DataFrame

Generate a distance DataFrame

Distance DataFrames allow simple generation of pd.DataFrames where the index take the form of perturbations and the columns other perturbations. The values at the intersections are therfore the distances between these perturbations in feature space. Many different metrics both inbuilt and custom/user defined may be used.

Parameters:

candidate_dataset (Dataset) – The dataset to which the query (this) should be compared.
metric (Union[str, Callable], optional) – Metric which should be used for the distance calculation. May be a simple string understood by scipy.spatial.distance.cdist, or a callable, like a function or lambda accepting two vectors representing query and candidate features. By default “euclidean”.
return_best_n_indexes_and_score (Optional[int], optional) – If an integer is given, then just that number of best pairs/measures are returned. By default None.
lower_is_better (bool, optional) – If using the above ‘return_best_n_indexes_and_score’ then it needs to be flagged if lower is better (default), or higher is better. By default True

Returns:

Returns a distance Dataframe, unless ‘return_best_n_indexes_and_score’ is an int, in which case a list of the top scoring pairs are returned in the form of a nested tuple: ((from, to), score)

Return type:

Union [pd.DataFrame, tuple(tuple(int, int), float)]

Raises:

ValueError – Error raised if this Dataset and the given candidate Dataset do not share common features.

divide_mean(query: str) → None

Divide dataset features by the mean of rows identified in the query

Useful function for normalising to controls.

Parameters:: query (str) – Pandas style query to retrieve rows from which means are calculated.

divide_median(query: str) → None

Divide dataset features by the median of rows identified in the query

Useful function for normalising to controls.

Parameters:: query (str) – Pandas style query to retrieve rows from which medians are calculated.

drop_absent_features() → None: Remove features if they are not present in the internal DataFrame

drop_columns(column_labels: str | list[str] | tuple[str], reason: str | None = None) → None

Drop columns inplace, update features if needed and set new history.

Intelligently drop columns from the dataset (inplace). If any of those columns were listed as features, then remove them from the features list and set a new new history. Updating features and new history only happens if it needs to (removed column was a feature). Updaing of features will cause hash update.

Parameters:

column_labels (Union[str, list[str], tuple[str]]) – List of column labels which should be removed. Can also be a str to remove just one column.
reason (Optional[str]) – A reason may be given for dropping the column. If not None and the column was a feature, then this reason is recorded along with the history. If None, and the column was a feature, then the history entry will state: “Droped columns ({column_labels})” Where column_labels contains the dropped columns. If reason is not None and the column is a feature, then the history entry will state: “Droped columns ({column_labels}), reason:{reason}” where {reason} is the given reason. Has no effect if the dropped column is not a feature, or the list of dropped columns do not contain a feature. By default None.

drop_nans_with_cutoff(axis: int | None = None, nan_cutoff: float = 0.1) → None

Drop rows or columns containing NaN or Inf values above a specified cutoff percentage.

Parameters:

axis: Optional[int], default=None: Axis along which to drop NaN or Inf values. If None, both rows and columns are dropped.
nan_cutoff: float, default=0.1: Cutoff percentage for NaN or Inf values. Rows or columns with NaN or Inf percentages greater than this value will be dropped.

drop_rows(row_indices: Index) → None

Drop rows inplace given a set of indices.

Intelligently drop rows from the dataset (inplace). Updating of rows will not cause hash update as features are unchanged.

Parameters:: row_indices (pd.Index) – List of row indexes which should be removed. Can also be an int to remove just one row.
Raises:: KeyError – Error raised if the index is missing from dataframe index:

property features

Return current dataset features

Returns:: List of strings containing current features
Return type:: list

filter_columns(column_names: list, keep=True, regex=False)

Filter dataframe columns

Parameters:

column_names (list) – Column names
keep (bool, optional) – Keep columns listed in column_names, if false, then the opposite happens and these columns are removed, by default True.

filter_columns_with_prefix(column_prefix: str | list, keep: bool = False)

Filter columns based on prefix

Parameters:

column_prefix (Union[str, list]) – Prefix for columns as a string, or alternatively, a list of string prefixes
keep (bool, optional) – If true, only columns matching the prefix are kept, if false, these columns are removed, by default False

filter_inplace(query: str) → None

Apply a pandas query style filter, keeping all that pass

Parameters:: query (str) – Pandas style query which when applied to rows, will keep all those which return True.

Filter dataframe rows on identifiers

When training and evaluating AI/ML models, it is useful to define training validation and test sets, as well as folds and filter datasets to include only those rows which match some criteria. For example, in a set of splits defined on compound MOAs, where each row has an associated Metadata_moa field, we can use a dictionary to define a set of test splits. The dictionary may be stored in a JSON file which will be read if filter_data is an instance of the pathlib.Path object. If a dictionary is given, then the dict_path can be used to traverse the dictionary and find the appropriate list of identifiers. Multiple dictionary levels can be traversed by separating them with forwardslash characters, such as ‘train_val/2/train’. A list of filtering identifiers may also be passed as an alternative to a dictionary.

Parameters:

filter_field (str) – Dataset DataFrame field which contains the identifiers which must match those in filter_data
filter_data (list[str] | dict | Path) – Identifiers for use in the filtering of rows. Can be a simple list of strings, a dictionary which may then be traversed (using the keys in the dict_path argument) to arive at a list of identifiers, or if a pathlib.Path object, then JSON.load is used to read the file, which can contain a list or dictionary.
dict_path (str | None, optional) – _description_, by default None
inplace (bool) – Perform the action in place. If True, then None is returned from the function and the underlying dataset is modified. If False, then the underlying dataset is untouched and a copy with the requested filtering returned, by default True.
additional_items (list[str] | None) – List of additional items to add to the filter_data list. This is useful when for example, DMSO is required to be in each dataset for downstream processing but is absent from splits/test sets. Here, DMSO can be added by passing it as a list or string using this parameter. If None, then nothing is added, by default None

filter_rows(query_column: str, values: list | str, keep: bool = True)

Filter dataframe rows

Parameters:

query_column (str) – Column name which is being filtered on
values (Union[list, str]) – List or string of values to be filtered on
keep (bool, optional) – If true, then only rows containing listed values in query column are kept. If this argument is false, then the opposite occurs, and the rows matching are discarded, by default True

get_df_features_perturbation_column(quiet: bool = False) → tuple[DataFrame, list[str], str | None]

Helper function to obtain DataFrame, features and perturbation column name.

Some Phenonaut functions allow passing of a Phenonaut object, or DataSet. They then access the underlying pd.DataFrame for calculations. This helper function is present on Phenonaut objects and Dataset objects, allowing more concise code and less replication when obtaining the underlying data.

Parameters:: quiet (bool) – When checking if perturbation is set, check without inducing a warning if it is None.
Returns:: Tuple containing the Dataframe, a list of features and the perturbation column name.
Return type:: tuple[pd.DataFrame, list[str], str]

get_ds_from_query(name: str, query: str)

Make a new Dataset object from a pandas style query.

Parameters:

name (str) – Name of new dataset
query (str) – Pandas style query from which all rows returning true will be included into the new PhenonautGenericData set object.

Returns:

New dataset created from query

Return type:

Dataset

get_feature_ranges(pad_by_percent: float | int) → tuple

Get the ranges of feature columns

Parameters:: pad_by_percent (Union[float, int]) – Optionally, pad the ranges by a percent value
Returns:: Returns tuple of tuples with shape (features, 2), for example, with two features it would return: ((min_f1, max_f1), (min_f2, max_f2))
Return type:: tuple

get_history() → list[TransformationHistory]

Get dataset history

Returns:: List of TransformationHistory (named tuples) which contain a list of features as the first element and then a plain text description of what was applied to arrive at those features as the second element.
Return type:: list[TransformationHistory]

get_non_feature_columns() → list

Get columns which are not features

Returns:: Returns list of Dataset columns which are not currently features.
Return type:: list[str]

get_unique_perturbations() → list[str]

Return a list of unique perturbations

Return a uniquified list of perturbations from the perturbation_colum field

Returns:: _description_
Return type:: list[str]

get_unique_treatments() → list[str]

Return a list of unique perturbations/treatments

Return a uniquified list of perturbations from the perturbation_colum field. Acts as a shortcut/alternative to get_unique_perturbations

Returns:: List of unique treatments/perturbations
Return type:: list[str]

groupby(by: str | List[str])

Returns multiple new Dataset objects by splitting on columns

Akin to performing groupby on a pd.DataFrame, split a dataset on one or many columns and return a list of Phenonaut Datasets containing the information contained within each unique split.

Parameters:: by (Union[str, list[str]]) – If a string, then this is used as a column name upon which to group the dataset and return unique classes based on this column. A list of strings is also allowed, enabling grouping of datasets by multiple columns, such as [‘timepoint’, ‘concentration’]
Returns:: A list of new phenonaut.data.Dataset objects split on the value(s) of the by argument
Return type:: List[phenonaut.data.Dataset]

property history

Get dataset history

Returns the same as calling .get_history on the dataset

Returns:: List of TransformationHistory (named tuples) which contain a list of features as the first element and then a plain text description of what was applied to arrive at those features as the second element.
Return type:: list[TransformationHistory]

impute_nans(groupby_col: str | list[str] | None = None, impute_fn: Callable | str | None = 'median') → None

Impute missing values in the DataFrame.

Parameters:

groupby_col: str or list of str, default=None: The name(s) of the column(s) to group by when imputing missing values. If None, impute missing values across the entire DataFrame.
impute_fn: Union[Callable, str, None]: The callable to use for imputing missing values on the DataFrame or grouped DataFrame as defined by the groupby_col. Special cases exist for ‘median’ and ‘mean’, whereby pd.median and pd.mean are applied. If None, then no action is taken. By default ‘median’.

new_aggregated_dataset(identifier_columns: list[str], new_dataset_name: str = 'Merged rows dataset', transformation_lookup: dict[str, Callable | str] | None = None, tranformation_lookup_default_value: str | Callable = 'mean')

Merge dataset rows and make a new dataframe

If we have a pd.DataFrame containing data derived from 2 fields of view from a microscopy image, a sensible approach is averaging features. If we have the DataFrame below, we may merge FOV 1 and FOV 2, taking the mean of all features. As strings such as filenames should be kept, they are concatenated together, separated by a comma, unless the strings are the same, in which case just one is used.

Here we test a df as follows:

ROW	COLUMN	BARCODE	feat_1	feat_2	feat_3	filename	FOV
1	1	Plate1	1.2	1.2	1.3	fileA.png	1
1	1	Plate1	1.3	1.4	1.5	FileB.png	2
1	1	Plate2	5.2	5.1	5	FileC.png	1
1	1	Plate2	6.2	6.1	6.8	FileD.png	2
1	2	Plate1	0.1	0.2	0.3	fileE.png	1
1	2	Plate1	0.2	0.2	0.38	FileF.png	2

Merging produces:

ROW	COLUMN	BARCODE	feat_1	feat_2	feat_3	filename	FOV
1	1	Plate1	1.25	1.3	1.40	fileA.png,FileB.png	1.5
1	1	Plate2	5.70	5.6	5.90	FileC.png,FileD.png	1.5
1	2	Plate1	0.15	0.2	0.34	FileF.png,fileE.png	1.5

Note that the FOV column has also been averaged.

Parameters:

identifier_columns (list[str]) – If a biochemical assay evaluated through imaging is identified by a row, column, and barcode (for the plate) but multiple images taken from a well, then these multiple fields of view can be merged, creating averaged features.
new_dataset_name (str, optional) – Name for the new Dataset, by default “Merged rows dataset”
transformation_lookup (dict[str,Union[Callable, str]]) – Dictionary mapping data types to aggregations. When None, it is as if the dictionary: {np.dtype(“O”): lambda x: “,”.join([f”{item}” for item in set(x)])} was provided, concatenating strings together (separated by a comma) if they are different and just using one if they are the same across rows. If a type not present in the dictionary is encountered (such as int, or float) in the above example, then the default specified by transformation_lookup_default_value is returned. By default, None.
tranformation_lookup_default_value (Union[str, Callable]) – Transformation to apply if the data type is not found in the transformation_lookup_dictionary, can be a callable or string to pandas defined string to function shortcut mappings. By default “mean”.

Returns:

Dataset with samples merged.

Return type:

Dataset

property num_features: int

Return number of features

Returns:: The number of features in the dataset
Return type:: int

property pcol

Return the name of the treatment column (short form of .perturbation_column)

A treatement is an identifier relating to the peturbation. In many cases, it is the unique compound name or identifier. Many replicates may be present, with identifiers like ‘DMSO’ etc.

Returns:: Column name of dataframe containing the treatment.
Return type:: String

property perturbation_column

Return the name of the treatment column

A treatement is an identifier relating to the peturbation. In many cases, it is the unique compound name or identifier. Many replicates may be present, with identifiers like ‘DMSO’ etc.

Returns:: Column name of dataframe containing the treatment.
Return type:: String

pivot(feature_names_column: str, values_column: str)

remove_blocklist_features(blocklist: Path | str | list[str], skip_first_line_in_file: bool = True, erase_data: bool = True, apply_to_non_features: bool = True, remove_prefixed: bool = True)

Remove blocklisted features/columns from a Dataset

Allows removal of predefined feature blocklists. Featurisation may generate features which are to be excluded from analysis as standard. This is the case with cellular images featurised with cell profiler. As such, there are a set of blocklist feautures which are often applied. This function allows specification of a list of features for removal (in the form of a list), or a string or path object denoting the location of a file containing this information. A special string may also be passed to this function: “CellProfiler”, which instructs Phenonaut to download the standard blocklist located here: https://figshare.com/ndownloader/files/23661539. Whilst matching features are removed, by default features which have a prefix on a blocklist matched feature are also removed. See parameters.

Note: matching columns which are not features are also removed by default, see parameters.

Parameters:

blocklist (Union[Path, str, list[str]]) – A str or Path directing Phenonaut to where a text file of blocklisted features is stored. Alternatively, a list of blocklisted features may be supplied. A special value is also accepted, whereby a string of “CellProfiler” is passed in, causing Phenonaut to retrieve the commonly used CellProfiler blocklisted features from https://figshare.com/ndownloader/files/23661539 .
skip_first_line_in_file (bool, optional) – Commonly, blocklist files have a title line, which can be ignored before starting to list features. If True, then the first line is ignored. By default True.
erase_data (bool) – If False, then no removal of columns from the Dataset is performed, only ensuring that no features are set which match the blocklist. This means that blocklist columns could persist in the Dataset as non-features. If True, then features are removed, and matching columns deleted. If False, apply_to_non_features has no effect. By default, True.
apply_to_non_features (bool) – If True, then apply the filtering to columns as well as features. By default True.
removed_prefixed (bool) – If True, features/columns may still be matched with blocklist features if they have a prefix followed by an underscore character. This allows transformations to be performed and features still removed. For example, applying the RobustMAD trasform prefixes features with ‘RobustMAD_’, generating RobustMAD_FeatureA, RobustMAD_FeatureB etc. remove_blocklist_features will identify FeatureA (if in blocklist) and still remove that blocklisted feature. To deactivate this default behavior, set remove_prefixed_features to False. By default True.

Raises:

FileNotFoundError – Error raised if specified file is not found

remove_features_with_outliers(outlier_cutoff=15.0, remove_data: bool = False)

Removes feature columns containing values greater than given cutoff

By default, any feature containing a value greater than 15 is removed. This cutoff can be raised and lowered as appropriate.

Parameters:

outlier_cutoff (float, optional) – If a feature column contains a value greater than this cutoff, then the feature is removed. By default 15.
remove_data (bool, optional) – If True, then not only are feature columns with outliers removed from the Datasets list of features, but these columns are dropped from the DataFrames. If False, then only the Datasets list of features are changed. By default False.

remove_low_variance_features(freq_cutoff=0.05, unique_cutoff=0.01)

Exclude low information content features.

Adapted from pycytominer variance_threshold method https://github.com/cytomining/pycytominer/blob/master/pycytominer/operations/variance_threshold.py

Sometimes, features can vary very little, this allows definition of cutoffs (ratios) of unique values that can exist in a feature. See parameters for further description of cutoffs.

Parameters:

freq_cutoff (float, default 0.05) – Ratio as defined by 2nd most common feature value divided by the most common feature value). Must range between 0 and 1. Features below this cutoff have a large population with a unique value and will be removed.
unique_cutoff (float, default 0.01) – Remove features with little diversity in their measurements. Must range between 0 and 1. Dividing the number of unique values in a feature by the number of measurements returns a ‘unique’ ratio, values below this cutoff are removed.

rename_column(from_column_name: str, to_column_name: str)

Rename a single dataset column

Parameters:

from_column_name (str) – Name of column to rename
to_column_name (str) – New column name

rename_columns(from_to: dict)

Rename multiple columns

Parameters:: from_to (dict) – Dictionary of the form {‘old_name’:’new_name’}

replace_str(column: str | int, pat: str, repl: str)

Replace a string present in a column

Parameters:

column (Union[str, int]) – Name of the column(could be a feature), within which to search and replace instances of the string specified in the ‘pat’ argument.
pat (str) – The patter, (non-regex), just query substring to find and replace.
repl (str) – Replacement text for the substring identified in the ‘pat’ argument.

shrink(keep_prefix: str | list[str] | None = 'Metadata_')

Reduce the size of a dataset by removing unused columns from the internal DataFrame

Often datasets contain intermediate features or unused columns which can be removed. This function removes every column from a Dataset internal DataFrame that is not the perturbation_column, or has a given prefix. By default, this prefix is “Metadata_”, however this can be removed, changed, or a new list of prefixes supplied using the keep_prefix argument.

Parameters:: keep_prefix (str | list[str] | None, optional) – Prefix for columns which should be kept during shirinking of the dataset. This prefix applies to columns which are not features (which are kept automatically). Can be a list of prefixes, or None, by default “Metadata_”

split_column(column: str | int, pat: str, new_columns: list[str])

Split a column on a delimiter

If a column named ‘data’ contained:

idx	data
1	A2_CPD1_Plate1

Then calling:

split_column('data', '_', ['WellID', 'CpdID', 'PlateID'])

Would introduce the following new columns into the dataframe:

idx	WellID	CpdID	PlateID
1	A2	CPD1	Plate1

Parameters:

column (Union[str, int]) – Name of column to split, or the index of the column.
pat (str) – Pattern (non-regex), usually a delimiter to split on.
new_columns (list[str]) – List of new column names. Should be the correct size to absorb all produced splits.

Raises:

DataError – Inconsistent number of splits produced when splitting the column.
ValueError – Incorrect number of new column names given in new_columns.

subtract_func_results_on_features(query_or_perturbation_name: str, groupby: str | list[str] | None, func: Callable | str | None = 'median') → None

Subtract the result of a function applied to rows

Useful function for centering plates on DMSO or control perturbations. If called with no func, then median is taken as the required function. The median, or result of applied function to rows identified by the query string (query_or_perturbation_name parameter) are subtracted from all perturbations. The query_or_perturbation_name may also be an identifier present in the datasets perturbation column (if set). If a column name, or list of column names are given in the groupby argument, then the operation is carried out within these groups before being merged back to the original dataframe.

Parameters:

query_or_perturbation_name (str) – Pandas style query to retrieve rows from which quantities for substraction are calculated, or, if the dataset has perturbation_column set and the parameter value can be found it the perturbation column, then these samples are used and have the given function applied to them. In short, for a Dataset with perturbation_column set to “cpd_name”, then the same effect can be achied with this parameter being “DMSO” and “cpd_name==’DMSO’”.
groupby (Optional[str, list[str]]) – The name, or list of names of columns that the DataSet should be grouped by for application of the transformation on a group-by group basis. This is very useful if neededing to subtract median DMSO perturbation features on a plate-by-plate basis, whereby the column containing plateIDs would be supplied. Multiple column names may also be supplied.
func (Union[Callable, str, None]) – The callable to use in calculation of the quantity to subtract for each perturbation. Special cases exist for ‘median’ and ‘mean’ strings whereby pd.median and pd.mean are applied respectively. If None, then no action is taken. By default ‘median’.

subtract_mean(query_or_perturbation_name: str, groupby: str | list[str] | None) → None

Subtract the mean of rows identified in the query from features

Useful function for centering plates on DMSO or control perturbations. The mean of row features identified by the query string (query_or_perturbation_name parameter) are subtracted from all perturbations. If the query_or_perturbation_name may also be an identifier present in the datasets perturbation column (if set). If a column name, or list of column names aregiven in the groupby argument, then the operation is carried out within these groups before being merged back to the original dataframe.

Parameters:

query_or_perturbation_name (str) – Pandas style query to retrieve rows from which quantities for substraction are calculated, or, if the dataset has perturbation_column set and the parameter value can be found it the perturbation column, then these samples are used and have the given function applied to them. In short, for a Dataset with perturbation_column set to “cpd_name”, then the same effect can be achied with this parameter being “DMSO” and “cpd_name==’DMSO’”.
groupby (Optional[str, list[str]]) – The name, or list of names of columns that the DataSet should be grouped by for application of the transformation on a group-by group basis. This is very useful if neededing to subtract mean DMSO perturbation features on a plate-by-plate basis, whereby the column containing plateIDs would be supplied. Multiple column names may also be supplied.

subtract_median(query_or_perturbation_name: str, groupby: str | list[str] | None) → None

Subtract the median of rows identified in the query from features

Useful function for centering plates on DMSO or control perturbations. The median of row features identified by the query string (query_or_perturbation_name parameter) are subtracted from all perturbations. If the query_or_perturbation_name may also be an identifier present in the datasets perturbation column (if set). If a column name, or list of column names aregiven in the groupby argument, then the operation is carried out within these groups before being merged back to the original dataframe.

Parameters:

query_or_perturbation_name (str) – Pandas style query to retrieve rows from which quantities for substraction are calculated, or, if the dataset has perturbation_column set and the parameter value can be found it the perturbation column, then these samples are used and have the given function applied to them. In short, for a Dataset with perturbation_column set to “cpd_name”, then the same effect can be achied with this parameter being “DMSO” and “cpd_name==’DMSO’”.
groupby (Optional[str, list[str]]) – The name, or list of names of columns that the DataSet should be grouped by for application of the transformation on a group-by group basis. This is very useful if neededing to subtract median DMSO perturbation features on a plate-by-plate basis, whereby the column containing plateIDs would be supplied. Multiple column names may also be supplied.

subtract_median_perturbation(perturbation_label: str, per_column_name: str | None = None, new_features_prefix: str = 'SMP_')

Subtract the median perturbation from all features

Useful for normalisation within a well/plate format. The median feature may be identified through the per_column_name variable, and perturbation label. Newly generated features may have their prefixes controled via the new_features_prefix argument.

Parameters:

perturbation_label (str) – The perturbation label which should be used to calculate the median
per_column_name (Optional[str], optional) – The perturbation column name. This is optional and can be None, as the Dataset may already have perturbation column set. By default, None.
new_features_prefix (str) – Prefix for new features, each with the median perturbation subtracted. By default ‘SMP_’ (for subtracted median perturbation).

transpose(reset_index: bool = True, new_header_column: int | None = 0): Transpose internal DataFrame

class phenonaut.data.dataset.TransformationHistory(features, description)

Bases: tuple

description: Alias for field number 1

features: Alias for field number 0

phenonaut.data.platemap_querier module

class phenonaut.data.platemap_querier.PlatemapQuerier(platemap_directory: str | Path, platemap_csv_files: list | str | Path | None = None, plate_name_before_underscore_in_filename=True)

Bases: object

get_compound_locations(cpd, plates: str | list | None = None)

plate_to_cpd_to_well_dict = {}

platemap_files = None

phenonaut.data package

Submodules

phenonaut.data.dataset module

Parameters:

Parameters:

phenonaut.data.platemap_querier module

phenonaut.data.recipes module

Module contents