phenonaut.predict package

Subpackages

Submodules

phenonaut.predict.optuna_functions module

phenonaut.predict.optuna_functions.predictor_from_str(ob_str: str) PhenonautPredictor

Convert a base64 gziped object string to PhenonautPredictor

Convert a str to object after applying gzip decompression and then pickle.loads.

Parameters:

ob_str (str) – String representation of base64 encoded gzipped object

Returns:

PhenonautPredictor object

Return type:

PhenonautPredictor

phenonaut.predict.optuna_functions.predictor_to_str(ob: PhenonautPredictor) str

Encode a PhenonautPredictor (or any serialisable object) to str

First serialise the object using pickle.dumps, compress using gzip and return the base64 string representation. Works with any serialisable object.

Parameters:

ob (PhenonautPredictor) – Serializable object to convert.

Returns:

Base64 gzip serialised object as a utf-8 string.

Return type:

str

phenonaut.predict.optuna_functions.run_optuna_opt(X: list[ndarray] | ndarray, X_test: list[ndarray] | ndarray, y: list[ndarray] | ndarray, y_test: list[ndarray] | ndarray, prediction_type: PredictionType, predictor: PhenonautPredictor, metric: PhenonautPredictionMetric, n_optuna_trials: int, phe_name: str, dataset_combination: list[str], optuna_db_path: Path | str, n_splits: int = 5, random_state: int | Generator | None = None, optuna_db_protocol: str = 'sqlite:///', target_dataset_name: str | None = None)

Run Optuna-led hyperparameter optimisation on predictor and data

Parameters:
  • X (Union[list[np.ndarray], np.ndarray]) – Training data

  • X_test (Union[list[np.ndarray], np.ndarray]) – Test data

  • y (Union[list[np.ndarray], np.ndarray]) – Training target

  • y_test (Union[list[np.ndarray], np.ndarray]) – Test target data

  • prediction_type (PredictionType) – Prediction type specifying classification, regression or view.

  • predictor (PhenonautPredictor) – The predictor with fit, predict functions packed into PhenonautPredictor dataclass which supports the predictor class with specification of hyperparameters etc.

  • metric (PhenonautPredictionMetric) – The scoring metric to be used to assess performance.

  • n_optuna_trials (int) – Number of optuna trials to optimise across.

  • phe_name (str) – Name of the Phenonaut object from which the data comes

  • dataset_combination (list[str]) – Combination views of datasets to be used

  • optuna_db_path (Union[Path, str]) – Output file path for Optuna sqlite3 database file

  • n_splits (int, optional) – Number of splits to be used in cross fold validation, by default 5

  • random_state (Optional[Union[int, np.random.Generator]], optional) – Seed for use by random number generator, allow deterministing repeats. If None, then do no pre-seed the generator. By default None

  • optuna_db_protocol (_type_, optional) – Protocol for optuna to use to access storage. By default “sqlite:///”

  • target_dataset_name (Optional[str], optional) – If predicting a view, then the target Dataset may be given a name, by default None.

phenonaut.predict.optuna_functions.run_optuna_opt_merge_folds(X: list[ndarray] | ndarray, X_test: list[ndarray] | ndarray, y: ndarray, y_test: ndarray, prediction_type: PredictionType, predictor: PhenonautPredictor, metric: PhenonautPredictionMetric, n_optuna_trials: int, phe_name: str, dataset_combination: list[str], optuna_db_path: Path | str, n_splits: int = 5, random_state: int | None = None, optuna_db_protocol: str = 'sqlite:///', target_dataset_name: str | None = None)

phenonaut.predict.predict_utils module

class phenonaut.predict.predict_utils.PredictionType(value)

Bases: Enum

PredictionType Enum for classification, regression or view prediction.

Parameters:

Enum (int) – Enumerated type, captures if the prediction task is classification, regression, or view (multiregression)

classification = 1
regression = 2
view = 3
phenonaut.predict.predict_utils.get_X_y(phe: Phenonaut, dataset_combination: tuple | list, target, predictor: PhenonautPredictor, prediction_type: PredictionType) tuple

For a given set of views, and known y, get X and y for predictor training

Parameters:
  • phe (Phenonaut) – The Phenonaut object containing Datasets

  • dataset_combination (Union[tuple, list]) – Dataset combinations to be used in prediction task.

  • target (pd.Series, np.ndarray) – Prediction target

  • predictor (PhenonautPredictor) – The PhenonautPredictor being used.

  • prediction_type (PredictionType) – Enum classification type specifying classification, regression or view.

Returns:

X, y tuple for training of predictor.

Return type:

tuple

phenonaut.predict.predict_utils.get_best_predictor_dataset_df(df: DataFrame, column_containing_values: str = 'test_score') DataFrame

For a given Optuna hyperaprameter scan dataframe, get the best predictor

Parameters:
  • df (pd.DataFrame) – Optua hyperparameterscan pd.DataFrame, likely generated by get_df_from_optuna_db.

  • column_containing_values (str, optional) – Name of the column containing scores. By default “test_score”.

Returns:

DataFrame containing information on the best predictor.

Return type:

pd.DataFrame

phenonaut.predict.predict_utils.get_common_indexes(dataframes_list: list[DataFrame]) list[str]

Get common indexes from list of DataFrames

Parameters:

dataframes_list (list[pd.DataFrame]) – List of pd.DataFrames from which common indexes should be extracted.

Returns:

List of common indexes between pd.DataFrames.

Return type:

list[str]

phenonaut.predict.predict_utils.get_df_from_optuna_db(optuna_db_file: str | Path, csv_output_filename: str | Path | None = None, json_output_filename: str | Path | None = None, get_only_best_per_study: bool = False) DataFrame

After predict.profile, turn Optuna sqlite3 files into pd.DataFrame

Parameters:
  • optuna_db_file (Union[str, Path]) – Optuna hyperparameter optimisation database (sqlite3file)

  • csv_output_filename (Union[Path, str], optional) – Target output CSV file. Can be None, in which case, no CSV file is written out. By default None.

  • json_output_filename (Union[Path, str], optional) – Target output JSON file. Can be None, in which case, no JSON file is written out. By default None.

  • get_only_best_per_study (bool, optional) – Boolean value stating if only the best hyperparameter set per study should be writen out. By default False.

Returns:

DataFrame sumarising Optuna hyperparameter scan results.

Return type:

pd.DataFrame

Raises:

FileNotFoundError – Database file (sqlite3) not found.

phenonaut.predict.predict_utils.get_metric(metric: str | dict | PhenonautPredictionMetric)

Get metric function from various options for metric definition

Helper function which allows specification of metrics with strings indicating common names, dictionaries.

Currently understands the shortcut strings: accuracy, accuracy_score mse, MSE, mean_squared_error rmse, RMSE, root_mean_squared_error AUROC, auroc, area_under_roc_curve

Parameters:

metric (Union[str, dict, PhenonautPredictionMetric]) – String, dict or PhenonautPredictionMetric to be used for scoring

Returns:

Prediction metric

Return type:

PhenonautPredictionMetric

Raises:
  • ValueError – No metrics found matching short string name.

  • KeyError – Given dictionary did not include all required fields.

  • ValueError – metric argument was not of a suitable type.

phenonaut.predict.predict_utils.get_prediction_type_from_y(y)

For a given target y - get prediction type

Looking at the data in y, return prediction type from classification, regression or multiregression (view prediction).

Parameters:

y (_type_) – _description_

Returns:

_description_

Return type:

_type_

Raises:

ValueError – _description_

phenonaut.predict.predict_utils.get_y_from_target(data: Dataset | DataFrame | list[Dataset] | list[DataFrame], target: str | Series | ndarray | tuple | None = None) list | Series | tuple | ndarray

Get target y from Dataset(s) or DataFrame

Parameters:
  • data (Union[list[Dataset],list[DataFrame]]) – Dataset, DataFrame, list of Datasets or list of DataFrames where the target values are present.

  • target (Optional[Union[str, Series, np.ndarray, tuple]], optional) – Target column name containing target y values in Dataset/DataFrame. If None, then the target y is attempted to be inferred by looking for Dataset columns which are not listed as features. Target y cannot be inferred from DataFrames. If np.ndarray, tuple, or pd.Series, then this is directly returned as the target of prediciton y, by default None.

Returns:

Target y values for prediction

Return type:

Union[list, Series, tuple, np.ndarray]

Raises:
  • ValueError – Given target string (column title) was a feature of the dataset.

  • ValueError – Object should be phenonaut.Dataset or pd.DataFrame.

  • ValueError – Given target string (column title) was not found in any supplied Datasets/Dataframes.

  • ValueError – Target was not set, trying to guess it from a phenonaut.Dataset, but did not find this type.

  • ValueError – Could not guess the target from supplied Dataset(s). Pass target as string for a dataset.df column heading, or the prediction target directly.

phenonaut.predict.predictor_dataclasses module

class phenonaut.predict.predictor_dataclasses.HyperparameterCategorical(name: str, choices: list | tuple, needed: bool = True)

Bases: OptunaHyperparameter

Optuna hyperparameter dataclass for categorical lists

choices: list | tuple
needed: bool = True
class phenonaut.predict.predictor_dataclasses.HyperparameterFloat(name: str, lower_bound: int, upper_bound: int, needed: bool = True)

Bases: OptunaHyperparameterNumber

Optuna hyperparameter dataclass for floats

class phenonaut.predict.predictor_dataclasses.HyperparameterInt(name: str, lower_bound: int, upper_bound: int, needed: bool = True)

Bases: OptunaHyperparameterNumber

Optuna hyperparameter dataclass for ints

class phenonaut.predict.predictor_dataclasses.HyperparameterLog(name: str, lower_bound: int, upper_bound: int, needed: bool = True)

Bases: OptunaHyperparameterNumber

Optuna hyperparameter dataclass for loguniform distributions of floats

class phenonaut.predict.predictor_dataclasses.OptunaHyperparameter(name: str)

Bases: object

OptunaHyperparameter base dataclass which is inherited from

name: str
class phenonaut.predict.predictor_dataclasses.OptunaHyperparameterNumber(name: str, lower_bound: int, upper_bound: int, needed: bool = True)

Bases: OptunaHyperparameter

Optuna hyperparameter dataclass for numbers, inherited for int and float

Raises:

ValueError – Lower bound must be lower than upper bound.

lower_bound: int
needed: bool = True
upper_bound: int
class phenonaut.predict.predictor_dataclasses.PhenonautPredictionMetric(func: Callable, name: str, lower_is_better: bool)

Bases: object

PhenonautPredictionMetric dataclass to hold metric, name and direction.

__call__(*args: Any, **kwds: Any) float

Call self as a function.

func: Callable
lower_is_better: bool
name: str
class phenonaut.predict.predictor_dataclasses.PhenonautPredictor(name: str, predictor: ~sklearn.base.BaseEstimator | ~collections.abc.Callable, optuna: ~collections.abc.Iterable[~phenonaut.predict.predictor_dataclasses.OptunaHyperparameter] | ~phenonaut.predict.predictor_dataclasses.OptunaHyperparameter | None = None, num_views: int = 1, max_optuna_trials: int | None = None, dataset_size_cutoff: int | None = None, constructor_kwargs: dict = <factory>, max_classes: int | None = None, conditional_hyperparameter_generator_constructor_keyword: str | None = (None, ), conditional_hyperparameter_generator: ~collections.abc.Callable | None = None, embed_in_results: bool = True)

Bases: object

PhenonautPredictor dataclass

The PhenonautPredictor wraps classes with fit and predict methods, augmenting them with additional information like name, the number of views it may operate on at once, and hyperparameter lists which may be optimised using Optuna.

conditional_hyperparameter_generator: Callable | None = None
conditional_hyperparameter_generator_constructor_keyword: str | None = (None,)
constructor_kwargs: dict
dataset_size_cutoff: int | None = None
embed_in_results: bool = True
max_classes: int | None = None
max_optuna_trials: int | None = None
name: str
num_views: int = 1
optuna: Iterable[OptunaHyperparameter] | OptunaHyperparameter | None = None
predictor: BaseEstimator | Callable

phenonaut.predict.profile module

phenonaut.predict.profile.profile(phe: Phenonaut, output_directory: str, dataset_combinations: None | list[int] | list[str] = None, target: str | Series | ndarray | Dataset | None = None, predictors: list[PhenonautPredictor] | None = None, prediction_type: str | PredictionType | None = None, n_splits: int = 5, random_state: int | Generator | None = None, optuna_db_path: str | Path | None = None, optuna_db_protocol: str = 'sqlite:///', n_optuna_trials=20, metric: PhenonautPredictionMetric | str | None = None, no_output: bool = False, write_pptx: bool = True, optuna_merge_folds: bool = False, test_set_fraction: float = 0.2)

Profile predictors in their ability to predict a given target.

This predict.profile function operates on a Phenonaut object (optionally containing multiple Datasets) and a given or indicated prediction target. The data within the prediction target is examined and the prediction type determined from classification, regression, and multiregression/view prediction (prediction of 1 omics view from another). In the case of Example 1 in the Phenonaut paper, on TCGA, with the prediction target of “survives_1_year”, the data types within the metadata are examined and only two values found 0 (no) or 1 (yes). Classification is chosen. With no classifiers explicitly given as arguments to the profile function, Phenonaut selects all default classifiers. User supplied classifiers and predictors may be passed, including PyTorch neural networks and similar objects through wrapping in PhenonautPredictor dataclasses of any class that implements the fit and predict methods. See PhenonautPredictor API documentation for further information on including user defined and packaged predictors. With no views explicitly listed, all possible view combinations are selected. For TCGA, the four omics views allow 15 view combinations (4x singular, 6x double, 4x triple and 1x quad). For each unique view combination and predictor, perform the following:

  • Merge views and remove samples which do not have features across currently needed views.

  • Shuffle the samples.

  • Withhold 20% of the data as a test set, to be tested against the trained and hyperparameter optimised predictor.

  • Split the data using 5-fold cross validation into train and validation sets.

  • For each fold, perform Optuna hyperparameter optimisation for the given predictor using the train sets, using hyperparameters described by the default predictors for classification, regression and multiregression.

Parameters:
  • phe (phenonaut.Phenonaut) – A Phenonaut object containing Datasets for prediction on

  • output_directory (str) – Directory into which profiling output (boxplots, heatmaps, CSV, JSON and PPTX should be written).

  • dataset_combinations (Optional[Union[None, list[int], list[str]]], optional) – If the Phenonaut object contains multiple datasets, then tuples of ‘views’ may be specified for exploration. If None, then all combinations of available views/Datasets are enumerated and used. By default None.

  • target (Optional[Union[str, pd.Series, np.ndarray, phenonaut.data.Dataset]], optional) – The prediction target. May be an array-like structure for prediction from aligned views, a string denoting the column in which to find the prediction target data within one of the Phenonaut Datasets, or a pd.Series, by default None.

  • predictors (Optional[list[PhenonautPredictor]], optional) – A list of PhenonautPredictors may be supplied to the function. If None, the all suitable predictors for the type of prediction problem that it is are selected (through loading of default_classifiers, default_regressors, or default_multiregressors from phenonaut/predict/default_predictors/ . By default None.,

  • prediction_type (Optional[Union[str, PredictionType]], optional) – The type of prediction task like “classification”, “regression” and “view”, or the proper PredictionType enum. If None, then the prediction task is assigned through inspection of the target data types and values present. By default, None.

  • n_splits (int, optional) – Number of splits to use in the N-fold cross validation, by default 5.

  • random_state (Union[int, np.random.Generator], optional) – If an int, then use this to seed a np.random.Generator for reproducibility of random operations (like shuffling etc). If a numpy.random.Generator, then this is used as the source of randomness. Can also be None, in which case a random seed is used. By default, None.

  • optuna_db_path (Optional[Union[Path, str]], optional) – Path to Optuna sqlite3 database file. If None, then a default filename will be assigned by Phenonaut. By default None.

  • optuna_db_protocol (_type_, optional) – Protocol that Optuna should use for accessing its required persistent storage. By default “sqlite:///”

  • n_optuna_trials (int, optional) – Number of Optuna trials for hyperparameter optimisation, by default 20

  • metric (Optional[Union[PhenonautPredictionMetric, str]], optional) – Metric used for scoring, currently understands the shortcut strings: accuracy, accuracy_score mse, MSE, mean_squared_error rmse, RMSE, root_mean_squared_error AUROC, auroc, area_under_roc_curveby default None

  • no_output (bool, optional) – If True, then no output is writen to disk. Hyperparameter optimisation is performed and the (usually) sqlite3 file written, without writing boxplot and heatmap images, CSVs, JSONS, and PPTX files By default False.

  • write_pptx (bool, optional) – If True, then the output boxplots and heatmaps are written to a PPTX file for presentation/sharing of data. By default True.

  • optuna_merge_folds (bool, optional) – By default, each fold has hyperparameters optimised and the trained predictor with parameters reported. If this optuna_merge_folds is true, then each fold is trained on and and hyperparameters optimised across folds (not per-fold). Setting this to False may be useful depending on the intended use of the predictor. It is believed that when False, and parameters are not optimised across folds, then more accurate prediction variance/accuracy estimates are produced. By default False.

  • test_set_fraction (float, optional) – When optimising a predictor, by default a fraction of the total data is held back for testing, separate from the train-validation splits. This test_set_fraction controls the size of this split. By default 0.2.

  • save_models (bool, optional) – Save the trained models to pickle files for later use. By default False.

Module contents