phenonaut.integration package

Subpackages

Submodules

phenonaut.integration.integration_control module

Module contents

phenonaut.integration.concatenate_datasets_horizontally(datasets: list[Dataset], merge_field: str | list[str] | None = None, how: str = 'PerfectMatch', n_random_right: int = 4, random_state: int | Generator = 7, quiet: bool = True)

Concatenate datasets horizontally

Useful for merging two or more datasets and expanding their features by a factor of the number of datesets. This function expands columns, not rows.

At present, only one concatenation method is implemented:

  • ‘EnumerateAll’
      • For each treatment, all data combinations are enumerated for the left and right datasets

    (iterating through a list using left and right if concatenating more than 2). This has the effect of massively increasing the number of samples in the final dataset, as merging treatments from 2 datasets where replicate cardinalities are 4 in both, results in the new dataset having a replicate cardinality of 16 (4x4). Note that all non-essential columns in the dataframe will be removed when using this approach. This is a design decision taken with the aim of controlling spiraling memory requirements when performing multiple concatenations.

  • ‘EnumerateThenMatchCardinality’
    • For each treatment, all data combinations are enumerated for the left and right datasets

    (iterating through a list using left and right if concatenating more than 2). Then, a sample is taken from all of these combinations to bring the cardinality down to that of the # left dataset group treatments. Warning, this approach with downsampling causes smoothing of the dataset, removing outliers and may result in artificially high benchmark scores.

  • ‘PerfectMatch’
    • For each treatment group on the left, match with the treatment group on the right. If

    group cardinalities are not the same, then drop samples until they match.

Parameters:
  • datasets (list[phenonaut.data.Dataset]) – List of phenonaut Datasets which should be merged

  • merge_field (str | list[str] | None) – The column name which is used to match treatments. If None, then this is taken to be the perturbation_column from the first dataset, by default None

  • how (str, optional) – Concatenation method, see function description above, by default ‘PerfectMatch’

  • n_random_right (int, optional) – Num sampled treatments to merge, see function description above, by default 4

  • random_state (int | np.random.Generator, optional) – Random state as either an int for sampling, can also be a np.random.Generator, by default 7

Returns:

Dataset made of horizontally concatenated datasets

Return type:

phenonaut.data.Dataset

Raises:

NotImplementedError – Requested concatenation method is not yet implemented