Tutorial 1: Built in demonstration scripts

[1]:

# import dlim packages
from dlim.model import DLIM
from dlim.dataset import Data_model
from dlim.api import DLIM_API


import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
from numpy.random import choice

Load data and split the data into training and validation

Load data, data needs to be a csv file

the first n-1 columns n-1 genes or different variable, in each genes, there will be multiple mutations
the last column is fitness value

[2]:

df_data = pd.read_csv("./data/data_epis_1.csv", sep = ',', header = None)

Load data into tensor format and feed to DLIM later on

Manually split data into 70% for training and 30% for testing

[ ]:

data = Data_model(data=df_data, n_variables=2)
train_id = choice(range(data.data.shape[0]), int(data.data.shape[0]*0.7), replace=False)
val_id = [i for i in range(data.data.shape[0]) if i not in train_id]
train_data = data.subset(train_id)
val_data = data.subset(val_id)

Construct your model and train your data

DLIM model: - n_variables: the number of variables (e.g., genes, environments) - hid_dim: the number of neurons in hidden layers - nb_layer: the number of hidden layers of your NN

DLIM_API:

this is the api to train, predict and plot the final landscape obtained

model: DLIM model
flag_spectral: bool, True if use spectral initialization; False, use Xavier-Glorot initialization
dlim_regressor.fit: train the model with training data

Hyperparameters

lr: learning rate
nb_epoch: number of epoches
batch_size: size of batch
emb_regularization: weight on regularization on embeddings
save_path: path for saveing model, if None, the model won’t be saved

[15]:

model = DLIM(n_variables = train_data.nb_val, hid_dim = 32, nb_layer = 1)
dlim_regressor = DLIM_API(model=model, flag_spectral=True)
losses = dlim_regressor.fit(train_data, lr = 1e-3, nb_epoch=40, batch_size=32, emb_regularization=0, \
                            save_path= './pretrain/harry_epis_model.pt')

spectral gap = 0.865142822265625
spectral gap = 0.568690836429596
Model saved to ./pretrain/harry_epis_model.pt

Show the learning loss during training

[16]:

plt.figure
plt.plot(losses)
plt.show()

../_images/tutorials_tutorial_1_plot_landscape_10_0.png

Prediction on validation data

dlim_regressor.predict

data should be tensor
detach: bool, True to get numpy array type output; False, get tensor type output

[11]:

fit_a, var_a, lat_a = dlim_regressor.predict(val_data.data[:,:-1], detach=True)

Visualization on landscape

first panel: landscape by DLIM and dots are measurements
second panel: the infered phenotype of each mutations on gene 1 versus all fitness mesured on this mutation
third panel: the infered phenotype of each mutations on gene 2 versus all fitness mesured on this mutation

[12]:

score = pearsonr(fit_a.flatten(), val_data.data[:, [-1]].flatten())[0]
print(score)

fig, (bx, cx, dx) = plt.subplots(1, 3, figsize=(6, 2))
dlim_regressor.plot(bx, data)

for xx in [bx, cx, dx]:
    for el in ["top", "right"]:
        xx.spines[el].set_visible(False)

# Plot the a00verage curve
print(pearsonr(lat_a[:, 0], val_data[:, -1]))
cx.scatter(lat_a[:, 0], val_data[:, -1], s=5, c="grey")
dx.scatter(lat_a[:, 1], val_data[:, -1], s=5, c="grey")
cx.set_ylabel("F")
cx.set_xlabel("$\\varphi^1$")
dx.set_xlabel("$\\varphi^2$")
plt.tight_layout()
plt.show()

0.9723149104776725
PearsonRResult(statistic=-0.010524778629337813, pvalue=0.8407427503016924)

../_images/tutorials_tutorial_1_plot_landscape_14_1.png

Use pretrained model

DLIM model: - n_variables: the number of variables (e.g., genes, environments) - hid_dim: the number of neurons in hidden layers - nb_layer: the number of hidden layers of your NN

DLIM_API:

if the model is already saved, you can load the model by add

load_model: the path of the saved model
flag_spectral: bool, True if use spectral initialization; False, use Xavier-Glorot initialization
dlim_regressor.predict: use saved model to predict data

[13]:

model = DLIM(n_variables = data.nb_val, hid_dim = 32, nb_layer = 0)
dlim_regressor = DLIM_API(model=model, flag_spectral=True, load_model='./pretrain/harry_epis_model.pt')
fit_a, var_a, lat_a = dlim_regressor.predict(data.data[:,:-1], detach=True)

Visualization of obtained landscape

[14]:

score = pearsonr(fit_a.flatten(), data.data[:, [-1]].flatten())[0]
print(score)

fig, (bx, cx, dx) = plt.subplots(1, 3, figsize=(6, 2))
dlim_regressor.plot(bx, data)

for xx in [bx, cx, dx]:
    for el in ["top", "right"]:
        xx.spines[el].set_visible(False)

# Plot the a00verage curve
print(pearsonr(lat_a[:, 0], data[:, -1]))
cx.scatter(lat_a[:, 0], data[:, -1], s=5, c="grey")
dx.scatter(lat_a[:, 1], data[:, -1], s=5, c="grey")
cx.set_ylabel("F")
dx.set_xlabel("$\\varphi^1$")
cx.set_xlabel("$\\varphi^2$")
plt.tight_layout()
plt.show()

0.9763892968817973
PearsonRResult(statistic=0.03691106761164796, pvalue=0.19706734038768178)

../_images/tutorials_tutorial_1_plot_landscape_19_1.png

[ ]: