phenonaut.predict package
Subpackages
Submodules
phenonaut.predict.optuna_functions module
- phenonaut.predict.optuna_functions.predictor_from_str(ob_str: str) PhenonautPredictor
Convert a base64 gziped object string to PhenonautPredictor
Convert a str to object after applying gzip decompression and then pickle.loads.
- Parameters:
ob_str (str) – String representation of base64 encoded gzipped object
- Returns:
PhenonautPredictor object
- Return type:
- phenonaut.predict.optuna_functions.predictor_to_str(ob: PhenonautPredictor) str
Encode a PhenonautPredictor (or any serialisable object) to str
First serialise the object using pickle.dumps, compress using gzip and return the base64 string representation. Works with any serialisable object.
- Parameters:
ob (PhenonautPredictor) – Serializable object to convert.
- Returns:
Base64 gzip serialised object as a utf-8 string.
- Return type:
str
- phenonaut.predict.optuna_functions.run_optuna_opt(X: list[ndarray] | ndarray, X_test: list[ndarray] | ndarray, y: list[ndarray] | ndarray, y_test: list[ndarray] | ndarray, prediction_type: PredictionType, predictor: PhenonautPredictor, metric: PhenonautPredictionMetric, n_optuna_trials: int, phe_name: str, dataset_combination: list[str], optuna_db_path: Path | str, n_splits: int = 5, random_state: int | Generator | None = None, optuna_db_protocol: str = 'sqlite:///', target_dataset_name: str | None = None)
Run Optuna-led hyperparameter optimisation on predictor and data
- Parameters:
X (Union[list[np.ndarray], np.ndarray]) – Training data
X_test (Union[list[np.ndarray], np.ndarray]) – Test data
y (Union[list[np.ndarray], np.ndarray]) – Training target
y_test (Union[list[np.ndarray], np.ndarray]) – Test target data
prediction_type (PredictionType) – Prediction type specifying classification, regression or view.
predictor (PhenonautPredictor) – The predictor with fit, predict functions packed into PhenonautPredictor dataclass which supports the predictor class with specification of hyperparameters etc.
metric (PhenonautPredictionMetric) – The scoring metric to be used to assess performance.
n_optuna_trials (int) – Number of optuna trials to optimise across.
phe_name (str) – Name of the Phenonaut object from which the data comes
dataset_combination (list[str]) – Combination views of datasets to be used
optuna_db_path (Union[Path, str]) – Output file path for Optuna sqlite3 database file
n_splits (int, optional) – Number of splits to be used in cross fold validation, by default 5
random_state (Optional[Union[int, np.random.Generator]], optional) – Seed for use by random number generator, allow deterministing repeats. If None, then do no pre-seed the generator. By default None
optuna_db_protocol (_type_, optional) – Protocol for optuna to use to access storage. By default “sqlite:///”
target_dataset_name (Optional[str], optional) – If predicting a view, then the target Dataset may be given a name, by default None.
- phenonaut.predict.optuna_functions.run_optuna_opt_merge_folds(X: list[ndarray] | ndarray, X_test: list[ndarray] | ndarray, y: ndarray, y_test: ndarray, prediction_type: PredictionType, predictor: PhenonautPredictor, metric: PhenonautPredictionMetric, n_optuna_trials: int, phe_name: str, dataset_combination: list[str], optuna_db_path: Path | str, n_splits: int = 5, random_state: int | None = None, optuna_db_protocol: str = 'sqlite:///', target_dataset_name: str | None = None)
phenonaut.predict.predict_utils module
- class phenonaut.predict.predict_utils.PredictionType(value)
Bases:
Enum
PredictionType Enum for classification, regression or view prediction.
- Parameters:
Enum (int) – Enumerated type, captures if the prediction task is classification, regression, or view (multiregression)
- classification = 1
- regression = 2
- view = 3
- phenonaut.predict.predict_utils.get_X_y(phe: Phenonaut, dataset_combination: tuple | list, target, predictor: PhenonautPredictor, prediction_type: PredictionType) tuple
For a given set of views, and known y, get X and y for predictor training
- Parameters:
phe (Phenonaut) – The Phenonaut object containing Datasets
dataset_combination (Union[tuple, list]) – Dataset combinations to be used in prediction task.
target (pd.Series, np.ndarray) – Prediction target
predictor (PhenonautPredictor) – The PhenonautPredictor being used.
prediction_type (PredictionType) – Enum classification type specifying classification, regression or view.
- Returns:
X, y tuple for training of predictor.
- Return type:
tuple
- phenonaut.predict.predict_utils.get_best_predictor_dataset_df(df: DataFrame, column_containing_values: str = 'test_score') DataFrame
For a given Optuna hyperaprameter scan dataframe, get the best predictor
- Parameters:
df (pd.DataFrame) – Optua hyperparameterscan pd.DataFrame, likely generated by get_df_from_optuna_db.
column_containing_values (str, optional) – Name of the column containing scores. By default “test_score”.
- Returns:
DataFrame containing information on the best predictor.
- Return type:
pd.DataFrame
- phenonaut.predict.predict_utils.get_common_indexes(dataframes_list: list[DataFrame]) list[str]
Get common indexes from list of DataFrames
- Parameters:
dataframes_list (list[pd.DataFrame]) – List of pd.DataFrames from which common indexes should be extracted.
- Returns:
List of common indexes between pd.DataFrames.
- Return type:
list[str]
- phenonaut.predict.predict_utils.get_df_from_optuna_db(optuna_db_file: str | Path, csv_output_filename: Path | str | None = None, json_output_filename: Path | str | None = None, get_only_best_per_study: bool = False) DataFrame
After predict.profile, turn Optuna sqlite3 files into pd.DataFrame
- Parameters:
optuna_db_file (Union[str, Path]) – Optuna hyperparameter optimisation database (sqlite3file)
csv_output_filename (Union[Path, str], optional) – Target output CSV file. Can be None, in which case, no CSV file is written out. By default None.
json_output_filename (Union[Path, str], optional) – Target output JSON file. Can be None, in which case, no JSON file is written out. By default None.
get_only_best_per_study (bool, optional) – Boolean value stating if only the best hyperparameter set per study should be writen out. By default False.
- Returns:
DataFrame sumarising Optuna hyperparameter scan results.
- Return type:
pd.DataFrame
- Raises:
FileNotFoundError – Database file (sqlite3) not found.
- phenonaut.predict.predict_utils.get_metric(metric: str | dict | PhenonautPredictionMetric)
Get metric function from various options for metric definition
Helper function which allows specification of metrics with strings indicating common names, dictionaries.
Currently understands the shortcut strings: accuracy, accuracy_score mse, MSE, mean_squared_error rmse, RMSE, root_mean_squared_error AUROC, auroc, area_under_roc_curve
- Parameters:
metric (Union[str, dict, PhenonautPredictionMetric]) – String, dict or PhenonautPredictionMetric to be used for scoring
- Returns:
Prediction metric
- Return type:
- Raises:
ValueError – No metrics found matching short string name.
KeyError – Given dictionary did not include all required fields.
ValueError – metric argument was not of a suitable type.
- phenonaut.predict.predict_utils.get_prediction_type_from_y(y: ndarray | DataFrame | Series | list | Dataset) PredictionType
For a given target y - get prediction type
Looking at the data in y, return prediction type from classification, regression or multiregression (view prediction).
- Parameters:
y (np.ndarray|pd.DataFrame|pd.Series) – Target
- Returns:
PreditionType enum of PredictionType.classification or PredictionType.regression
- Return type:
- Raises:
ValueError – y was not of type pd.Series, list, np.ndarray, pd.DataFrame, or phenonaut.data.Dataset
- phenonaut.predict.predict_utils.get_y_from_target(data: Dataset | DataFrame | list[Dataset] | list[DataFrame], target: str | Series | ndarray | tuple | None = None) list | Series | tuple | ndarray
Get target y from Dataset(s) or DataFrame
- Parameters:
data (Union[list[Dataset],list[DataFrame]]) – Dataset, DataFrame, list of Datasets or list of DataFrames where the target values are present.
target (Optional[Union[str, Series, np.ndarray, tuple]], optional) – Target column name containing target y values in Dataset/DataFrame. If None, then the target y is attempted to be inferred by looking for Dataset columns which are not listed as features. Target y cannot be inferred from DataFrames. If np.ndarray, tuple, or pd.Series, then this is directly returned as the target of prediciton y, by default None.
- Returns:
Target y values for prediction
- Return type:
Union[list, Series, tuple, np.ndarray]
- Raises:
ValueError – Given target string (column title) was a feature of the dataset.
ValueError – Object should be phenonaut.data.Dataset or pd.DataFrame.
ValueError – Given target string (column title) was not found in any supplied Datasets/Dataframes.
ValueError – Target was not set, trying to guess it from a phenonaut.data.Dataset, but did not find this type.
ValueError – Could not guess the target from supplied Dataset(s). Pass target as string for a dataset.df column heading, or the prediction target directly.
phenonaut.predict.predictor_dataclasses module
- class phenonaut.predict.predictor_dataclasses.HyperparameterCategorical(name: str, choices: list | tuple, needed: bool = True)
Bases:
OptunaHyperparameter
Optuna hyperparameter dataclass for categorical lists
- choices: list | tuple
- needed: bool = True
- class phenonaut.predict.predictor_dataclasses.HyperparameterFloat(name: str, lower_bound: int, upper_bound: int, needed: bool = True)
Bases:
OptunaHyperparameterNumber
Optuna hyperparameter dataclass for floats
- class phenonaut.predict.predictor_dataclasses.HyperparameterInt(name: str, lower_bound: int, upper_bound: int, needed: bool = True)
Bases:
OptunaHyperparameterNumber
Optuna hyperparameter dataclass for ints
- class phenonaut.predict.predictor_dataclasses.HyperparameterLog(name: str, lower_bound: int, upper_bound: int, needed: bool = True)
Bases:
OptunaHyperparameterNumber
Optuna hyperparameter dataclass for loguniform distributions of floats
- class phenonaut.predict.predictor_dataclasses.OptunaHyperparameter(name: str)
Bases:
object
OptunaHyperparameter base dataclass which is inherited from
- name: str
- class phenonaut.predict.predictor_dataclasses.OptunaHyperparameterNumber(name: str, lower_bound: int, upper_bound: int, needed: bool = True)
Bases:
OptunaHyperparameter
Optuna hyperparameter dataclass for numbers, inherited for int and float
- Raises:
ValueError – Lower bound must be lower than upper bound.
- lower_bound: int
- needed: bool = True
- upper_bound: int
- class phenonaut.predict.predictor_dataclasses.PhenonautPredictionMetric(func: Callable, name: str, lower_is_better: bool)
Bases:
object
PhenonautPredictionMetric dataclass to hold metric, name and direction.
- __call__(*args: Any, **kwds: Any) float
Call self as a function.
- func: Callable
- lower_is_better: bool
- name: str
- class phenonaut.predict.predictor_dataclasses.PhenonautPredictor(name: str, predictor: ~sklearn.base.BaseEstimator | ~collections.abc.Callable, optuna: ~collections.abc.Iterable[~phenonaut.predict.predictor_dataclasses.OptunaHyperparameter] | ~phenonaut.predict.predictor_dataclasses.OptunaHyperparameter | None = None, num_views: int = 1, max_optuna_trials: int | None = None, dataset_size_cutoff: int | None = None, constructor_kwargs: dict = <factory>, max_classes: int | None = None, conditional_hyperparameter_generator_constructor_keyword: str | None = (None, ), conditional_hyperparameter_generator: ~collections.abc.Callable | None = None, embed_in_results: bool = True)
Bases:
object
PhenonautPredictor dataclass
The PhenonautPredictor wraps classes with fit and predict methods, augmenting them with additional information like name, the number of views it may operate on at once, and hyperparameter lists which may be optimised using Optuna.
- conditional_hyperparameter_generator: Callable | None = None
- conditional_hyperparameter_generator_constructor_keyword: str | None = (None,)
- constructor_kwargs: dict
- dataset_size_cutoff: int | None = None
- embed_in_results: bool = True
- max_classes: int | None = None
- max_optuna_trials: int | None = None
- name: str
- num_views: int = 1
- optuna: Iterable[OptunaHyperparameter] | OptunaHyperparameter | None = None
- predictor: BaseEstimator | Callable
phenonaut.predict.profile module
- phenonaut.predict.profile.profile(phe: Phenonaut, output_directory: str, dataset_combinations: None | list[int] | list[str] = None, target: str | Series | ndarray | Dataset | None = None, predictors: list[PhenonautPredictor] | None = None, prediction_type: str | PredictionType | None = None, n_splits: int = 5, random_state: int | Generator | None = None, optuna_db_path: Path | str | None = None, optuna_db_protocol: str = 'sqlite:///', n_optuna_trials=20, metric: PhenonautPredictionMetric | str | None = None, no_output: bool = False, write_pptx: bool = True, optuna_merge_folds: bool = False, test_set_fraction: float = 0.2)
Profile predictors in their ability to predict a given target.
This predict.profile function operates on a Phenonaut object (optionally containing multiple Datasets) and a given or indicated prediction target. The data within the prediction target is examined and the prediction type determined from classification, regression, and multiregression/view prediction (prediction of 1 omics view from another). In the case of Example 1 in the Phenonaut paper, on TCGA, with the prediction target of “survives_1_year”, the data types within the metadata are examined and only two values found 0 (no) or 1 (yes). Classification is chosen. With no classifiers explicitly given as arguments to the profile function, Phenonaut selects all default classifiers. User supplied classifiers and predictors may be passed, including PyTorch neural networks and similar objects through wrapping in PhenonautPredictor dataclasses of any class that implements the fit and predict methods. See PhenonautPredictor API documentation for further information on including user defined and packaged predictors. With no views explicitly listed, all possible view combinations are selected. For TCGA, the four omics views allow 15 view combinations (4x singular, 6x double, 4x triple and 1x quad). For each unique view combination and predictor, perform the following:
Merge views and remove samples which do not have features across currently needed views.
Shuffle the samples.
Withhold 20% of the data as a test set, to be tested against the trained and hyperparameter optimised predictor.
Split the data using 5-fold cross validation into train and validation sets.
For each fold, perform Optuna hyperparameter optimisation for the given predictor using the train sets, using hyperparameters described by the default predictors for classification, regression and multiregression.
- Parameters:
phe (phenonaut.Phenonaut) – A Phenonaut object containing Datasets for prediction on
output_directory (str) – Directory into which profiling output (boxplots, heatmaps, CSV, JSON and PPTX should be written).
dataset_combinations (Optional[Union[None, list[int], list[str]]], optional) – If the Phenonaut object contains multiple datasets, then tuples of ‘views’ may be specified for exploration. If None, then all combinations of available views/Datasets are enumerated and used. By default None.
target (Optional[Union[str, pd.Series, np.ndarray, phenonaut.data.Dataset]], optional) – The prediction target. May be an array-like structure for prediction from aligned views, a string denoting the column in which to find the prediction target data within one of the Phenonaut Datasets, or a pd.Series, by default None.
predictors (Optional[list[PhenonautPredictor]], optional) – A list of PhenonautPredictors may be supplied to the function. If None, the all suitable predictors for the type of prediction problem that it is are selected (through loading of default_classifiers, default_regressors, or default_multiregressors from phenonaut/predict/default_predictors/ . By default None.,
prediction_type (Optional[Union[str, PredictionType]], optional) – The type of prediction task like “classification”, “regression” and “view”, or the proper PredictionType enum. If None, then the prediction task is assigned through inspection of the target data types and values present. By default, None.
n_splits (int, optional) – Number of splits to use in the N-fold cross validation, by default 5.
random_state (Union[int, np.random.Generator], optional) – If an int, then use this to seed a np.random.Generator for reproducibility of random operations (like shuffling etc). If a numpy.random.Generator, then this is used as the source of randomness. Can also be None, in which case a random seed is used. By default, None.
optuna_db_path (Optional[Union[Path, str]], optional) – Path to Optuna sqlite3 database file. If None, then a default filename will be assigned by Phenonaut. By default None.
optuna_db_protocol (_type_, optional) – Protocol that Optuna should use for accessing its required persistent storage. By default “sqlite:///”
n_optuna_trials (int, optional) – Number of Optuna trials for hyperparameter optimisation, by default 20
metric (Optional[Union[PhenonautPredictionMetric, str]], optional) – Metric used for scoring, currently understands the shortcut strings: accuracy, accuracy_score mse, MSE, mean_squared_error rmse, RMSE, root_mean_squared_error AUROC, auroc, area_under_roc_curveby default None
no_output (bool, optional) – If True, then no output is writen to disk. Hyperparameter optimisation is performed and the (usually) sqlite3 file written, without writing boxplot and heatmap images, CSVs, JSONS, and PPTX files By default False.
write_pptx (bool, optional) – If True, then the output boxplots and heatmaps are written to a PPTX file for presentation/sharing of data. By default True.
optuna_merge_folds (bool, optional) – By default, each fold has hyperparameters optimised and the trained predictor with parameters reported. If this optuna_merge_folds is true, then each fold is trained on and and hyperparameters optimised across folds (not per-fold). Setting this to False may be useful depending on the intended use of the predictor. It is believed that when False, and parameters are not optimised across folds, then more accurate prediction variance/accuracy estimates are produced. By default False.
test_set_fraction (float, optional) – When optimising a predictor, by default a fraction of the total data is held back for testing, separate from the train-validation splits. This test_set_fraction controls the size of this split. By default 0.2.
save_models (bool, optional) – Save the trained models to pickle files for later use. By default False.