phenonaut.packaged_datasets package

Submodules

phenonaut.packaged_datasets.base module

class phenonaut.packaged_datasets.base.PackagedDataset(root: Path | str, raw_data_dir: str | Path | None = PosixPath('raw_data'), raw_data_dir_relative_to_root: bool = True, download: bool = True)

Bases: ABC

PackagedDataset base class for all downloaded Datasets

Inherited by Phenonaut classes which supply public datasets in the same way that pytorch allows easy access to MNIST and FashionMNIST etc, Phenonaut offers classes which download and preprocess datasets like TCGA (The Cancer Genome Atlas and the Connectivity Map), which may include many different ‘views’ or omics-based measurements of the underlying cells.

Inheriting from this class allows easy access to commonly used functions for checking datasets exist in directories and downloading them if not, in addition to more small helpful functions. Inheriting grants the following:

  • Getters and setters for root and raw_data_dir, properly handling the expected location of dataset files, listing available Phenonaut Dataset objects via the .keys() or ds_keys() methods, listing supporting dataframes via the df_keys() methods.

  • download and batch_download functions which simplify the download of remote public datasets.

  • processed_dataset_exists and raw_dataset_exists, which check for the presence of the processed dataset and the raw dataset, respectively.

Inheriting classes should do the following:

  • Call super().__init__() on initialisation

  • Check if they find a saved/processed version of the packaged dataset. The CMAP and TCGA classes which inherit from this PackagedDataset process and save the datasets in an h5 file. This is optional, and any store may be used.

  • If it does not exist, download the data storing it in .raw_data_dir, process the data and store in a convenient format.

  • Register the available Phenonaut Datasets associated with this PackagedDataset. By convention, the default/main dataset should be named ‘ds’. Registration is completed by calling self.register_ds_key(‘ds_name’). Available Phenonaut Datasets are available by calling self.keys(), or self.ds_keys(). Phenonaut datasets may be accessed by calling get_ds, or with the [‘ds_name’] notation on the PackagedDataset instance.

  • Register the available supporting dataframes associated with this PackagedDataset. Supporting dataframe names can be listed by calling df_keys() and accessed by calling get_df(‘df_name’).

  • Classes should provide their own get_df and get_ds methods. This is enforced by this base class specifying required methods to be present through inheritance of the AbstractBaseClass.

Parameters:
  • root (Union[Path, str]) – Root directory for the dataset. The root directory should contain processed files, usable by Phenonaut, this means that the data has been downloaded and usually transformed in some manner prior to being put here. By convention, processed files will be put into this directory, but there will exist a subdirectory called “raw_data”, within which downloaded files (possibly compressed) will be placed prior to preprocessing.

  • raw_data_dir (Optional[Union[Path, str]], optional) – Directory in which the raw, downloaded files should be saved, also the location of intermediate files generated in the processing step. By convention, this directory lies within the root directory and has the default name “raw_data”. It can be an absolute path in a different directory or filesystem by setting raw_data_dir_relative_to_root argument to False. By default Path(“raw_data”).

  • raw_data_dir_relative_to_root (bool, optional) – As described for the raw_data_dir argument, by the “raw_data” directory is usually within the root directory for the dataset. If this argument is True, then the full path of the raw_data_dir is generated with root as the parent directory. If this value is False, then it is taken to be an absolute path, possibly existing on another filesystem. By default True.

df_keys() list

Get a list of available DataFrames.

PackagedDatasets may include DataFrames, useful in the capture of metadata.

Returns:

List of keys which allow accessing metadata (typically) pd.DataFrames belonging to this PackagedDataset.

Return type:

List

ds_keys() list

Get a list of Datasets contained within this PackagedDataset.

Returns a list of Dataset names contained within this PackagedDataset and allows access to them via pds.[‘dataset_name’] - dictionary-like notation.

Returns:

List of keys which allow accessing pd.DataFrames belonging to this PackagedDataset

Return type:

List

abstract get_df(key: str)

Abstract method - Get DataFrame

Abstract method which all inheriting classes are required to implement for retrieval of DataFrames.

Parameters:

key (str) – Name of DataFrame

abstract get_ds(key: str)

Abstract method - Get Dataset

Abstract method which all inheriting classes are required to implement for retrieval of Datasets.

Parameters:

key (str) – Name of Dataset

keys() list

Get a list of Datasets contained within this PackagedDataset.

Returns a list of Dataset names contained within this PackagedDataset and allows access to them via pds.[‘dataset_name’] - dictionary-like notation.

Returns:

List of keys which allow accessing pd.DataFrames belonging to this PackagedDataset.

Return type:

List

property raw_data_dir: Path

Getthe raw unprocessed data directory of the dataset

Returns:

Directory of the raw, unprocessed dataset.

Return type:

Path

register_df_key(key: str | List[str]) None

Register a dataframe key with the PackagedDataset

Packaged datasets may contain or have access to multiple pd.DataFrames which accompany the main dataset. In the case of the CMAP dataset, the main Phenonaut Dataset contains the L1000 values, along with features and metadata. The supporting dataframes contain information on the perturbation type, compound information, etc. Keys are typically the same as their HDF5 store key values, although this is up to the specific PackagedDataset implementation.

Parameters:

key (Union[str, List[str]]) – Short string which may be used to access the pd.DataFrame.

Raises:

TypeError – [description]

register_ds_key(key: str | List[str]) None

Register a Phenonaut Dataset key with the PackagedDataset

Packaged datasets may contain or have access to multiple pd.DataFrames from which Phenonaut Datasets can be created. These Datasets contain not only the pd.DataFrame containing data, but also features and additional metadata such as history and origin. The main Phenonaut Dataset should be called, by convention “ds” and additional Phenonaut Datasets given a descriptive name. In the case of the CMAP dataset, the main DataSet (ds) contains the L1000 values, feature and metadata. Supporting dataframes are accessible using the .get_df(“name”) method and contain information on things like the perturbation type, compound information, etc. Keys are typically the same as their HDF5 store key values, although this is up to the specific PackagedDataset implementation.

Parameters:

key (Union[str, List[str]]) – Short string which may be used to access the pd.DataFrame. When given, to the function, the name will be registered. After which the inheriting class should allow access to a Phenonaut Dataset via the __getitem__ method, allowing access to the Phenonaut Dataset via pdf[‘ds’].

Raises:

TypeError – Given key must be a str or list of str.

property root: Path

Get the root directory of the dataset

Returns:

Root directory within which the processed dataset and raw dataset data can be found.

Return type:

Path

phenonaut.packaged_datasets.breast_cancer module

class phenonaut.packaged_datasets.breast_cancer.BreastCancer(root: Path | str, download: bool = False, raw_data_dir: str | Path | None = PosixPath('raw_data'), rm_downloaded_data: bool = True)

Bases: PackagedDataset

Breast Cancer Dataset from scikit-learn

This PackagedDataset provides the Breast Cancer dataset from scikit-learn. This is also known as the Breast cancer Wisconsin (diagnostic) dataset. See the scikit-learn user guide for more information: https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-dataset Original dataset information available at: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

Contains 569 unique samples breast cancer fine needle aspirates, each with 30 features, and one target of 0 or 1, denoting benign or malignant respectively.

Parameters:
  • root (Union[Path, str]) – Local directory containing the prepared dataset. If the dataset is not found here and the argument download=True is not given, then an error is raised. If download=True and the processed dataset is absent, then it is downloaded the directory pointed at by the ‘raw_data’ argument detailed below. If raw_data_dir is a non-absolute path, such as a single directory, then it is created as a subdirectory of this root directory.

  • download (bool, optional) – If true and the processed dataset is not found in the root directory, then the dataset is downloaded and processed. By default False.

  • raw_data_dir (Optional[Union[Path, str]], optional) – If downloading and preparing the dataset, then a directory for the raw data may be specified. If a non-absolute location is given, then it is created in a subdirectory of the root directory specified as the first argument. Absolute paths may be used to place raw datafiles and intermediates in another location, such as scratch disks etc, by default Path(“raw_data”).

get_df(key: str) DataFrame

Get supporting dataframe

Parameters:

key (str) – Key of pd.DataFrame.

Returns:

Requested pd.DataFrame from h5 store

Return type:

pd.DataFrame

get_ds(key: str) Dataset

Get supporting dataframe

Parameters:

key (str) – Key of Phenonaut Dataset.

Returns:

Requested Phenonaut Dataset from h5 store, with the correctly set features and metadata

Return type:

Dataset

phenonaut.packaged_datasets.cmap module

class phenonaut.packaged_datasets.cmap.CMAP(root: Path | str, download: bool = False, raw_data_dir: str | Path | None = PosixPath('raw_data'), rm_downloaded_data: bool = True, landmark_only: bool = True)

Bases: PackagedDataset

CMAP (Level 5)- ConnectivityMap dataset - https://clue.io/

CMAP is a repository of L1000 profiles measured from small molecule and crisper perturbations. This CMAP packaged dataset supplies an interface to querying this data and allows access to the data through a Phenonaut dataset object. This PackagedDataset is for level 5 data (merged profiles).

Data supplied by CMAP is in their own GCTX format files which are HDF5 files with data residing in specific paths. Rather than have Phenonaut rely on their supplied library, we simply read the file with standard HDF5 tools. Additionally, we merge sig_info data, assigning perturbation information, bringing in the following columns: ‘pert_id’, ‘pert_iname’, ‘pert_type’, ‘cell_id’, ‘pert_idose’, ‘pert_itime’, and ‘distil_id’.

in the pert_type column, and taking the following values:

ctl_vehicle : DMSO

trt_cp : compound treatment

trt_xpr : crisper treatment

If the pert_type is trt_cp, then pert_idose gives the compound concentration. In the supplied cmap data, the field is a string containing for example, “3.33 um”. This dataloader changes this to a float field without the µM (um as written) prefix - units are µM. If the pert_type is ctl_vehicle or trt_xpr, then CMAP supplied data with -666 in the pert_idose field. This is changed to np.nan.

Further field information can be found here: https://clue.io/connectopedia/glossary

This PackagedDataset provides supplies the following pd.DataFrames (queryable by calling the inherited “.keys” method):

  • creation_date

    the date on which the h5 file was written.

  • df

    the main dataframe containing L1000 data, merged with sig_info to give a more complete view of the dataset.

  • gene_info

    Information on the genes - allows translation of L1000 gene number in df to gene name and more.

  • pert_info

    Perturbation info. Using the df, which contains the merged sig_info data, compound ids can be used to query this dataframe and molecule smiles etc returned.

  • sig_metrics

    Additional metrics on the profiles recorded in df.

  • inst_info

    Information on plate barcodes etc from which profiles derived.

Parameters:
  • root (Union[Path, str]) – Local directory containing the prepared dataset. If the dataset is not found here and the argument download=True is not given, then an error is raised. If download=True and the processed dataset is absent, then it is downloaded the directory pointed at by the ‘raw_data’ argument detailed below. If raw_data_dir is a non-absolute path, such as a single directory, then it is created as a subdirectory of this root directory.

  • download (bool, optional) – If true and the processed dataset is not found in the root directory, then the dataset is downloaded and processed. By default False.

  • raw_data_dir (Optional[Union[Path, str]], optional) – If downloading and preparing the dataset, then a directory for the raw data may be specified. If a non-absolute location is given, then it is created in a subdirectory of the root directory specified as the first argument. Absolute paths may be used to place raw datafiles and intermediates in another location, such as scratch disks etc. Only has an effect if downloading and rebuilding the PackagedDataset.by default Path(“raw_data”).

  • landmark_only (bool) – If True, then only return landmark genes, essentially removing all inferred gene abundances. This is likely the most useful for the majority of tasks ashighly correlated abundances simply adds to colinearity. Only has an effect if rebuilding the PackagedDataset. By default, True.

get_df(key: str) DataFrame

Get supporting dataframe

Parameters:

key (str) – Key of pd.DataFrame.

Returns:

Requested pd.DataFrame from h5 store

Return type:

pd.DataFrame

get_ds(key: str) Dataset

Get supporting dataframe

Parameters:

key (str) – Key of Phenonaut Dataset.

Returns:

Requested Phenonaut Dataset from h5 store, with the correctly set features and metadata

Return type:

Dataset

class phenonaut.packaged_datasets.cmap.CMAP_Level4(root: Path | str, download: bool = False, raw_data_dir: str | Path | None = PosixPath('raw_data'), rm_downloaded_data: bool = True, landmark_only: bool = True, allowed_treatment_types: str | list[str] = ['trt_cp', 'ctl_vehicle'], allowed_treatment_times: str | list[str] = '24 h')

Bases: PackagedDataset

CMAP (Level4) - ConnectivityMap dataset - https://clue.io/

CMAP is a repository of L1000 profiles measured from small molecule and crisper perturbations. This CMAP packaged dataset supplies an interface to querying this data and allows access to the data through a Phenonaut dataset object.

Data supplied by CMAP is in their own GCTX format files which are HDF5 files with data residing in specific paths. Rather than have Phenonaut rely on their supplied library, we simply read the file with standard HDF5 tools. Additionally, we merge sig_info data, assigning perturbation information, bringing in the following columns: ‘pert_id’, ‘pert_iname’, ‘pert_type’, ‘cell_id’, ‘pert_idose’, ‘pert_itime’, and ‘distil_id’.

in the pert_type column, and taking the following values:

ctl_vehicle : DMSO

trt_cp : compound treatment

trt_xpr : crisper treatment

If the pert_type is trt_cp, then pert_idose gives the compound concentration. In the supplied cmap data, the field is a string containing for example, “3.33 um”. This dataloader changes this to a float field without the µM (um as written) prefix - units are µM. If the pert_type is ctl_vehicle or trt_xpr, then CMAP supplied data with -666 in the pert_idose field. This is changed to np.nan.

Further field information can be found here: https://clue.io/connectopedia/glossary

This PackagedDataset provides supplies the following pd.DataFrames (queryable by calling the inherited “.keys” method):

  • creation_date

    the date on which the h5 file was written.

  • df

    the main dataframe containing L1000 data, merged with sig_info to give a more complete view of the dataset.

  • gene_info

    Information on the genes - allows translation of L1000 gene number in df to gene name and more.

  • pert_info

    Perturbation info. Using the df, which contains the merged sig_info data, compound ids can be used to query this dataframe and molecule smiles etc returned.

  • sig_metrics

    Additional metrics on the profiles recorded in df.

  • inst_info

    Information on plate barcodes etc from which profiles derived.

Parameters:
  • root (Union[Path, str]) – Local directory containing the prepared dataset. If the dataset is not found here and the argument download=True is not given, then an error is raised. If download=True and the processed dataset is absent, then it is downloaded the directory pointed at by the ‘raw_data’ argument detailed below. If raw_data_dir is a non-absolute path, such as a single directory, then it is created as a subdirectory of this root directory.

  • download (bool, optional) – If true and the processed dataset is not found in the root directory, then the dataset is downloaded and processed. By default False.

  • raw_data_dir (Optional[Union[Path, str]], optional) – If downloading and preparing the dataset, then a directory for the raw data may be specified. If a non-absolute location is given, then it is created in a subdirectory of the root directory specified as the first argument. Absolute paths may be used to place raw datafiles and intermediates in another location, such as scratch disks etc. Only has an effect if downloading and rebuilding the PackagedDataset. By default Path(“raw_data”).

  • landmark_only (bool) – If True, then only return landmark genes, essentially removing all inferred gene abundances. This is likely the most useful for the majority of tasks ashighly correlated abundances simply adds to colinearity. Only has an effect if rebuilding the PackagedDataset. By default, True.

  • allowed_treatment_types (Union[str, list[str]]) – Often, only compound treatments are needed for analysis, and so we include only treatments with pert_type of trt_cp or ctl_vehicle to allow compounds and DMSO vehicle only. This can be expanded to include crispr treatments with the inclusion of “trt_xpr”, see https://clue.io/connectopedia/perturbagen_types_and_controls for further information on treatment types and possible values. Only has an effect if rebuilding the PackagedDataset. By default [‘trt_cp’,’ctl_vehicle’]

  • allowed_treatment_times (Union[str, list[str]]) – Often, we are only interested in examining compound treatments after 24hrs, as there is an abundance of these measurements in the CMAP database. 24 hr treatments commonly have the pert_itime value ‘24 h’. Only has an effect if rebuilding the PackagedDataset. By default ‘24 h’.

get_df(key: str) DataFrame

Get supporting dataframe

Parameters:

key (str) – Key of pd.DataFrame.

Returns:

Requested pd.DataFrame from h5 store

Return type:

pd.DataFrame

get_ds(key: str) Dataset

Get supporting dataframe

Parameters:

key (str) – Key of Phenonaut Dataset.

Returns:

Requested Phenonaut Dataset from h5 store, with the correctly set features and metadata

Return type:

Dataset

phenonaut.packaged_datasets.iris module

class phenonaut.packaged_datasets.iris.Iris(root: Path | str, download: bool = False, raw_data_dir: str | Path | None = PosixPath('raw_data'), rm_downloaded_data: bool = True)

Bases: PackagedDataset

IRIS dataset Scikit learn

This PackagedDataset provides the Iris dataset from scikit-learn, with unique iris entries, each with four features each, and finally a target column denoting class. Further information is available via scikit-lean here:

https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html

or wikipedia:

https://en.wikipedia.org/wiki/Iris_flower_data_set

or from:

https://archive.ics.uci.edu/ml/datasets/Iris

Parameters:
  • root (Union[Path, str]) – Local directory containing the prepared dataset. If the dataset is not found here and the argument download=True is not given, then an error is raised. If download=True and the processed dataset is absent, then it is downloaded the directory pointed at by the ‘raw_data’ argument detailed below. If raw_data_dir is a non-absolute path, such as a single directory, then it is created as a subdirectory of this root directory.

  • download (bool, optional) – If true and the processed dataset is not found in the root directory, then the dataset is downloaded and processed. By default False.

  • raw_data_dir (Optional[Union[Path, str]], optional) – If downloading and preparing the dataset, then a directory for the raw data may be specified. If a non-absolute location is given, then it is created in a subdirectory of the root directory specified as the first argument. Absolute paths may be used to place raw datafiles and intermediates in another location, such as scratch disks etc, by default Path(“raw_data”).

get_df(key: str) DataFrame

Get supporting dataframe

Parameters:

key (str) – Key of pd.DataFrame.

Returns:

Requested pd.DataFrame from h5 store

Return type:

pd.DataFrame

get_ds(key: str) Dataset

Get supporting dataframe

Parameters:

key (str) – Key of Phenonaut Dataset.

Returns:

Requested Phenonaut Dataset from h5 store, with the correctly set features and metadata

Return type:

Dataset

class phenonaut.packaged_datasets.iris.Iris_2_views(root: Path | str, download: bool = False, raw_data_dir: str | Path | None = PosixPath('raw_data'), rm_downloaded_data: bool = True)

Bases: PackagedDataset

get_df(key: str) DataFrame

Get supporting dataframe

Parameters:

key (str) – Key of pd.DataFrame.

Returns:

Requested pd.DataFrame from h5 store

Return type:

pd.DataFrame

get_ds(key: str) Dataset

Get supporting dataframe

Parameters:

key (str) – Key of Phenonaut Dataset.

Returns:

Requested Phenonaut Dataset from h5 store, with the correctly set features and metadata

Return type:

Dataset

phenonaut.packaged_datasets.lincs module

class phenonaut.packaged_datasets.lincs.LINCS_Cell_Painting(root: Path | str, download: bool = False, raw_data_dir: str | Path | None = PosixPath('raw_data'), rm_downloaded_data: bool = True)

Bases: PackagedDataset

LINCS Cell Painting Dataset - https://clue.io/

This PackagedDataset provides supplies the following pd.DataFrames (queryable by calling the inherited “.keys” method):

Parameters:
  • root (Union[Path, str]) – Local directory containing the prepared dataset. If the dataset is not found here and the argument download=True is not given, then an error is raised. If download=True and the processed dataset is absent, then it is downloaded the directory pointed at by the ‘raw_data’ argument detailed below. If raw_data_dir is a non-absolute path, such as a single directory, then it is created as a subdirectory of this root directory.

  • download (bool, optional) – If true and the processed dataset is not found in the root directory, then the dataset is downloaded and processed. By default False.

  • raw_data_dir (Optional[Union[Path, str]], optional) – If downloading and preparing the dataset, then a directory for the raw data may be specified. If a non-absolute location is given, then it is created in a subdirectory of the root directory specified as the first argument. Absolute paths may be used to place raw data files and intermediates in another location, such as scratch disks etc, by default Path(“raw_data”).

get_df(key: str) DataFrame

Get supporting dataframe

Parameters:

key (str) – Key of pd.DataFrame.

Returns:

Requested pd.DataFrame from h5 store

Return type:

pd.DataFrame

get_ds(key: str) Dataset

Get supporting dataframe

Parameters:

key (str) – Key of Phenonaut Dataset.

Returns:

Requested Phenonaut Dataset from h5 store, with the correctly set features and metadata

Return type:

Dataset

phenonaut.packaged_datasets.metadata_moa module

class phenonaut.packaged_datasets.metadata_moa.MetadataBROADLincsCellPaintingMOAs(root: Path | str, download: bool = False, raw_data_dir: str | Path | None = PosixPath('raw_data'), rm_downloaded_data: bool = True)

Bases: PackagedDataset

DataFrame supplier for BROAD Lincs Cell Painting assigned MOAs

This PackagedDataset provides access to a pd.DataFrame containing information on the BROAD institutes LINCS Cell Paiting compound MOA assignment.

This data is located in the broadinstitute/lincs-cell-painting GitHub repository under metadata/moa/repurposing_simple.tsv.

https://raw.githubusercontent.com/broadinstitute/lincs-cell-painting/master/metadata/moa/repurposing_simple.tsv

Commentary on creation of this resource which may be useful is also available here: https://github.com/broadinstitute/lincs-cell-painting/issues/5

Parameters:
  • root (Union[Path, str]) – Local directory containing the prepared dataset. If the dataset is not found here and the argument download=True is not given, then an error is raised. If download=True and the processed dataset is absent, then it is downloaded the directory pointed at by the ‘raw_data’ argument detailed below. If raw_data_dir is a non-absolute path, such as a single directory, then it is created as a subdirectory of this root directory.

  • download (bool, optional) – If true and the processed dataset is not found in the root directory, then the dataset is downloaded and processed. By default False.

  • raw_data_dir (Optional[Union[Path, str]], optional) – If downloading and preparing the dataset, then a directory for the raw data may be specified. If a non-absolute location is given, then it is created in a subdirectory of the root directory specified as the first argument. Absolute paths may be used to place raw datafiles and intermediates in another location, such as scratch disks etc, by default Path(“raw_data”).

__call__() DataFrame

Call self as a function.

get_df(key: str) DataFrame

Get supporting dataframe

Parameters:

key (str) – Key of pd.DataFrame.

Returns:

Requested pd.DataFrame from h5 store

Return type:

pd.DataFrame

get_ds(key: str) Dataset

Get supporting dataframe

Parameters:

key (str) – Key of Phenonaut Dataset.

Returns:

Requested Phenonaut Dataset from h5 store, with the correctly set features and metadata

Return type:

Dataset

class phenonaut.packaged_datasets.metadata_moa.MetadataJUMPMOACompounds(root: Path | str, download: bool = False, raw_data_dir: str | Path | None = PosixPath('raw_data'), rm_downloaded_data: bool = True)

Bases: PackagedDataset

DataFrame supplier for JUMP consortium MOA compound set

This PackagedDataset provides access to a pd.DataFrame containing information on the JUMP MOA compound selection. Further information is available here:

https://github.com/jump-cellpainting/JUMP-MOA

Parameters:
  • root (Union[Path, str]) – Local directory containing the prepared dataset. If the dataset is not found here and the argument download=True is not given, then an error is raised. If download=True and the processed dataset is absent, then it is downloaded the directory pointed at by the ‘raw_data’ argument detailed below. If raw_data_dir is a non-absolute path, such as a single directory, then it is created as a subdirectory of this root directory.

  • download (bool, optional) – If true and the processed dataset is not found in the root directory, then the dataset is downloaded and processed. By default False.

  • raw_data_dir (Optional[Union[Path, str]], optional) – If downloading and preparing the dataset, then a directory for the raw data may be specified. If a non-absolute location is given, then it is created in a subdirectory of the root directory specified as the first argument. Absolute paths may be used to place raw datafiles and intermediates in another location, such as scratch disks etc, by default Path(“raw_data”).

__call__() DataFrame

Call self as a function.

get_df(key: str) DataFrame

Get supporting dataframe

Parameters:

key (str) – Key of pd.DataFrame.

Returns:

Requested pd.DataFrame from h5 store

Return type:

pd.DataFrame

get_ds(key: str) Dataset

Get supporting dataframe

Parameters:

key (str) – Key of Phenonaut Dataset.

Returns:

Requested Phenonaut Dataset from h5 store, with the correctly set features and metadata

Return type:

Dataset

phenonaut.packaged_datasets.tcga module

class phenonaut.packaged_datasets.tcga.TCGA(root: Path | str, download: bool = False, raw_data_dir: str | Path | None = PosixPath('raw_data'), rm_downloaded_data: bool = True, rm_intermediates: bool = True, prediction_target: str | None = None, num_pca_dims: int | None = 10, vif_filter_cutoff: float | None = None, custom_transformation_func_and_name_tuple: tuple[Callable, str] | None = None)

Bases: PackagedDataset

TCGA - The Cancer Genome Atlas, packaged dataset

The TCGA dataset captures a snapshot of The Cancer Genome Atlas, from the TCGA website:

https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga

“The Cancer Genome Atlas (TCGA), a landmark cancer genomics program, molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types. This joint effort between NCI and the National Human Genome Research Institute began in 2006, bringing together researchers from diverse disciplines and multiple institutions.”

Processing of the dataset occurs in a manner the same as that described by Lee in:

Lee, Changhee, and Mihaela Schaar. “A Variational Information Bottleneck Approach to Multi-Omics Data Integration.” International Conference on Artificial Intelligence and Statistics. PMLR, 2021.

Although we have taken the decision to process the dataset into 10 PCA dimensions using linear PCA, although the number of PCA dimensions can be changed using the num_pca_dims argument to this constructor.

Additionally no PCA can be applied and custom transformation in the form of a callable used by passing it to custom_transformation_func_and_name_tuple.

Datasets representing clinical_decision, RPPA, miRNA, methylation, and mRNA are generated. These datasets are processed and saved in the HDF5 file format, writing files of the format “tcga_pca{num_pca_dims}_phenonaut.h5” and “tcga_pca{num_pca_dims}_metadata_phenonaut.h5” where {num_pca_dims} is the requested number of PCA dimensions.

Steps undertaken in dataset preparation are as follows:

  1. Download the dataset. This downloads 186 .tar.gz files totalling ~20 GB. These are then extracted, taking ~32 GB.

  2. Files representing different tumour types for each view are merged, and empty columns removed. These files are then saved as intermediates.

On modest 2020 hardware, processing:

  • clinical_decisions takes ~2 secs, generating 2.5 MB intermediate

  • RPPA takes ~4 secs, generating 29 MB intermediate

  • miRNA takes ~ 5 secs, generating 102 B intermediate

  • methylation takes ~12.5 mins, generating 3.6GB intermediate

  • mRNA takes 10 mins, generating 4 MB intermediate files

Parameters:
  • root (Union[Path, str]) – Local directory containing the prepared dataset. If the dataset is not found here and the argument download=True is not given, then an error is raised. If download=True and the processed dataset is absent, then it is downloaded the directory pointed at by the ‘raw_data’ argument detailed below. If raw_data_dir is a non-absolute path, such as a single directory, then it is createdas a subdirectory of this root directory.

  • download (bool, optional) – If true and the processed dataset is not found in the root directory, then the dataset is downloaded and processed. By default False.

  • raw_data_dir (Optional[Union[Path, str]], optional) – If downloading and preparing the dataset, then a directory for the raw data may be specified. If a non-absolute location is given, then it is created in a subdirectory of the root directory specified as the first argument. Absolute paths may be used to place raw datafiles and intermediates in another location, such as scratch disks etc, by default Path(“raw_data”).

  • rm_downloaded_data (bool) – If creating the dataset, and this is True, then downloaded data raw TCGA data (archives) will be deleted, by default True.

  • rm_intermediates (bool) – If creating the dataset, and this is True, then intermediate data generated from the extraction of TCGA archives will be deleted, by default True.

  • prediction_target (str, optional) – Often we are want to make predictions on datasets using labels or targets captured by TCGA and placed in the clinical_decisions dataframe, an example is the commonly used days_to_death column. This argument may be any column within that DataFrame (queryable by calling get_clinical_decisions_columns) such as days_to_death, tumor_tissue_site, gender etc additionally to these column names, the string ‘years_to_death’ can also be used, which will operate on days_to_death divided by 365.25, by default None.

  • num_pca_dims (int, optional) – The number of linear principal components to use in dimensionality reduction. By default 10.

  • vif_filter_cutoff (float, optional) – Apply a VIF (variance inflation factor) cutoff, removing features with a VIF score greater than this value. This has the effect or removing features which have a high degree of colinearity. If vif_filter_cutoff is None, then no vif filter is applied. A good default choice for this value is 5.0. If a custom_transformation_func_and_name_tuple as defined below is given, then the vif filter is ignored, to include it, you may combine ZIF filtering in the custom function.

  • custom_transformation_func_and_name_tuple (Optional[tuple[Callable, str]]) – If PCA is not the preferred transformation to be applied to the data, then the user may provide their own in a tuple. The first tuple element should be the callable function, and the second a unique name/identifier which will be used to uniquely identify the saved dataset. Whereas datasets using the default PCA dimensionality reduction technique are named: “tcga_pca{num_pca_dims}_phenonaut.h5 Datasets named by custom callables will be named: “tcga_{custom_callable_id}_phenonaut.h5 where custom_callable_id is the second element of the custom_transformation_func_and_name_tuple tuple. If None, then no customtransformation or dimensionality reduction is performed, instead using the standard scalar, followed by PCA approach as described above. If a custom_transformation_func_and_name_tuple, then vif_filter_cutoff has no effect, and it is as if it is set to None. By default None.

class TCGA_MetadataTuple(files, load_csv_kwargs, output_file_name, header_offset, treatment_id, treatment_lambda)

Bases: tuple

files

Alias for field number 0

header_offset

Alias for field number 3

load_csv_kwargs

Alias for field number 1

output_file_name

Alias for field number 2

treatment_id

Alias for field number 4

treatment_lambda

Alias for field number 5

add_clinical_decision_data_to_df(df: ~pandas.core.frame.DataFrame, clinical_decision_column: str, new_df_column_name: str | None = None, custom_func: ~collections.abc.Callable[[~pandas.core.series.Series], ~pandas.core.series.Series] | None = None, remove_incomplete_rows: bool = True, dtype: ~typing.Type = <class 'float'>) DataFrame

Merge a field from TCGA clinical_decisions.

The TCGA dataset comes with a clinical_decisions DataFrame specifying things like patient age, days to death etc. As years to death can be a prediction target, we need a way to add this to our multiomics DataSets/DataFrames. This function merges the clinical_decisions information based on the “Hybridization REF” index.

Parameters:
  • df (pd.DataFrame) – DataFrame to which the information will be added

  • clinical_decision_column (str) – Column in clinical_decitions that will be added

  • new_df_column_name (Optional[str], optional) – The new column may have a new name. If None, then the clinical_decision_column is used, by default None.

  • custom_func (Optional[Callable[[pd.Series], pd.Series]], optional) – Optionally apply a transformation to the newly added data. It can be useful to pass a lambda here, which can enable a simple way to convert days to years, such as : ‘lambda x: x/365.’. If None, then no transformation is applied to the new column, by default None.

  • remove_incomplete_rows (bool, optional) – If True, rows containing missing data in the new column (by virtue of a pd.na being present) are removed, by default True.

Returns:

DataFrame containing the new column.

Return type:

pd.DataFrame

get_df(key: str) DataFrame

Get supporting DataFrame

Parameters:

key (str) – Key of pd.DataFrame.

Returns:

Requested pd.DataFrame from h5 store

Return type:

pd.DataFrame

get_ds(key: str) Dataset

Get Dataset

Parameters:

key (str) – Key of Phenonaut Dataset.

Returns:

Requested Phenonaut Dataset from h5 store, with the correctly set features and metadata

Return type:

Dataset

Module contents