Tutorial 1: Built in demonstration scripts

[1]:
# import dlim packages
from dlim.model import DLIM
from dlim.dataset import Data_model
from dlim.api import DLIM_API


import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
from numpy.random import choice

Load data and split the data into training and validation

Load data, data needs to be a csv file

  • the first n-1 columns n-1 genes or different variable, in each genes, there will be multiple mutations

  • the last column is fitness value

[2]:
df_data = pd.read_csv("./data/data_epis_1.csv", sep = ',', header = None)

Load data into tensor format and feed to DLIM later on

Manually split data into 70% for training and 30% for testing

[ ]:
data = Data_model(data=df_data, n_variables=2)
train_id = choice(range(data.data.shape[0]), int(data.data.shape[0]*0.7), replace=False)
val_id = [i for i in range(data.data.shape[0]) if i not in train_id]
train_data = data.subset(train_id)
val_data = data.subset(val_id)

Construct your model and train your data

DLIM model: - n_variables: the number of variables (e.g., genes, environments) - hid_dim: the number of neurons in hidden layers - nb_layer: the number of hidden layers of your NN

DLIM_API:

this is the api to train, predict and plot the final landscape obtained

  • model: DLIM model

  • flag_spectral: bool, True if use spectral initialization; False, use Xavier-Glorot initialization

  • dlim_regressor.fit: train the model with training data

Hyperparameters

  • lr: learning rate

  • nb_epoch: number of epoches

  • batch_size: size of batch

  • emb_regularization: weight on regularization on embeddings

  • save_path: path for saveing model, if None, the model won’t be saved

[15]:
model = DLIM(n_variables = train_data.nb_val, hid_dim = 32, nb_layer = 1)
dlim_regressor = DLIM_API(model=model, flag_spectral=True)
losses = dlim_regressor.fit(train_data, lr = 1e-3, nb_epoch=40, batch_size=32, emb_regularization=0, \
                            save_path= './pretrain/harry_epis_model.pt')
spectral gap = 0.865142822265625
spectral gap = 0.568690836429596
Model saved to ./pretrain/harry_epis_model.pt

Show the learning loss during training

[16]:
plt.figure
plt.plot(losses)
plt.show()
../_images/tutorials_tutorial_1_plot_landscape_10_0.png

Prediction on validation data

dlim_regressor.predict

  • data should be tensor

  • detach: bool, True to get numpy array type output; False, get tensor type output

[11]:
fit_a, var_a, lat_a = dlim_regressor.predict(val_data.data[:,:-1], detach=True)

Visualization on landscape

  • first panel: landscape by DLIM and dots are measurements

  • second panel: the infered phenotype of each mutations on gene 1 versus all fitness mesured on this mutation

  • third panel: the infered phenotype of each mutations on gene 2 versus all fitness mesured on this mutation

[12]:
score = pearsonr(fit_a.flatten(), val_data.data[:, [-1]].flatten())[0]
print(score)

fig, (bx, cx, dx) = plt.subplots(1, 3, figsize=(6, 2))
dlim_regressor.plot(bx, data)

for xx in [bx, cx, dx]:
    for el in ["top", "right"]:
        xx.spines[el].set_visible(False)

# Plot the a00verage curve
print(pearsonr(lat_a[:, 0], val_data[:, -1]))
cx.scatter(lat_a[:, 0], val_data[:, -1], s=5, c="grey")
dx.scatter(lat_a[:, 1], val_data[:, -1], s=5, c="grey")
cx.set_ylabel("F")
cx.set_xlabel("$\\varphi^1$")
dx.set_xlabel("$\\varphi^2$")
plt.tight_layout()
plt.show()
0.9723149104776725
PearsonRResult(statistic=-0.010524778629337813, pvalue=0.8407427503016924)
../_images/tutorials_tutorial_1_plot_landscape_14_1.png

Use pretrained model

DLIM model: - n_variables: the number of variables (e.g., genes, environments) - hid_dim: the number of neurons in hidden layers - nb_layer: the number of hidden layers of your NN

DLIM_API:

if the model is already saved, you can load the model by add

  • load_model: the path of the saved model

  • flag_spectral: bool, True if use spectral initialization; False, use Xavier-Glorot initialization

  • dlim_regressor.predict: use saved model to predict data

[13]:
model = DLIM(n_variables = data.nb_val, hid_dim = 32, nb_layer = 0)
dlim_regressor = DLIM_API(model=model, flag_spectral=True, load_model='./pretrain/harry_epis_model.pt')
fit_a, var_a, lat_a = dlim_regressor.predict(data.data[:,:-1], detach=True)

Visualization of obtained landscape

[14]:
score = pearsonr(fit_a.flatten(), data.data[:, [-1]].flatten())[0]
print(score)

fig, (bx, cx, dx) = plt.subplots(1, 3, figsize=(6, 2))
dlim_regressor.plot(bx, data)

for xx in [bx, cx, dx]:
    for el in ["top", "right"]:
        xx.spines[el].set_visible(False)

# Plot the a00verage curve
print(pearsonr(lat_a[:, 0], data[:, -1]))
cx.scatter(lat_a[:, 0], data[:, -1], s=5, c="grey")
dx.scatter(lat_a[:, 1], data[:, -1], s=5, c="grey")
cx.set_ylabel("F")
dx.set_xlabel("$\\varphi^1$")
cx.set_xlabel("$\\varphi^2$")
plt.tight_layout()
plt.show()
0.9763892968817973
PearsonRResult(statistic=0.03691106761164796, pvalue=0.19706734038768178)
../_images/tutorials_tutorial_1_plot_landscape_19_1.png
[ ]: