phenonaut.integration package
Subpackages
Submodules
phenonaut.integration.integration_control module
Module contents
- phenonaut.integration.concatenate_datasets_horizontally(datasets: list[Dataset], merge_field: str | list[str] | None = None, how: str = 'PerfectMatch', n_random_right: int = 4, random_state: int | Generator = 7, quiet: bool = True)
Concatenate datasets horizontally
Useful for merging two or more datasets and expanding their features by a factor of the number of datesets. This function expands columns, not rows.
At present, only one concatenation method is implemented:
- ‘EnumerateAll’
For each treatment, all data combinations are enumerated for the left and right datasets
(iterating through a list using left and right if concatenating more than 2). This has the effect of massively increasing the number of samples in the final dataset, as merging treatments from 2 datasets where replicate cardinalities are 4 in both, results in the new dataset having a replicate cardinality of 16 (4x4). Note that all non-essential columns in the dataframe will be removed when using this approach. This is a design decision taken with the aim of controlling spiraling memory requirements when performing multiple concatenations.
- ‘EnumerateThenMatchCardinality’
For each treatment, all data combinations are enumerated for the left and right datasets
(iterating through a list using left and right if concatenating more than 2). Then, a sample is taken from all of these combinations to bring the cardinality down to that of the # left dataset group treatments. Warning, this approach with downsampling causes smoothing of the dataset, removing outliers and may result in artificially high benchmark scores.
- ‘PerfectMatch’
For each treatment group on the left, match with the treatment group on the right. If
group cardinalities are not the same, then drop samples until they match.
- Parameters:
datasets (list[phenonaut.data.Dataset]) – List of phenonaut Datasets which should be merged
merge_field (str | list[str] | None) – The column name which is used to match treatments. If None, then this is taken to be the perturbation_column from the first dataset, by default None
how (str, optional) – Concatenation method, see function description above, by default ‘PerfectMatch’
n_random_right (int, optional) – Num sampled treatments to merge, see function description above, by default 4
random_state (int | np.random.Generator, optional) – Random state as either an int for sampling, can also be a np.random.Generator, by default 7
- Returns:
Dataset made of horizontally concatenated datasets
- Return type:
phenonaut.data.Dataset
- Raises:
NotImplementedError – Requested concatenation method is not yet implemented