DNN model of gene ontology GO protein embedding based on tensorflow framework

DNN model of gene ontology GO protein embedding based on tensorflow framework

  • Foreword: Research papers related to Gene Ontology GO
  • (1) Gene Ontology GO
    • 1.1 Model purpose
    • 1.2 Protein sequence
    • 1.3 Gene Ontology:
    • 1.4 Composition of GO terms
    • 1.5 File description
    • 1.6 About the labels of data sets
  • (2) Protein embedding for training and testing data
    • 2.1 Library import
  • (3) Loading the data set
  • (4) Loading protein embedding
  • (5) Prepare data set
  • (6) Training model
  • (7) Plot model loss and accuracy for each period
  • (8) Submit model
  • (9) Forecast

Foreword: Research papers related to Gene Ontology GO

1. Gene ontology: tool for the unification of biology
Author: Michael Ashburne

2. The Gene Ontology Resource: 20 years and still GOing strong
Author: The Gene Ontology Consortium

3. The Gene Ontology (GO) database and informatics resource
Author: Michael A. Harris

(1) Gene Ontology GO

1.1 Model Purpose

It is to predict the function of a set of proteins based on their amino acid sequence and other data (aka GO term ID)

1.2 Protein sequence

Each protein is made up of dozens or hundreds of amino acids linked in sequence. Each amino acid in the sequence can be represented by a one-letter or three-letter code. Therefore, the sequence of a protein is usually annotated as a sequence of letters.

1.3 Gene Ontology:

Use Gene Ontology (GO) to define the functional properties of proteins1. Gene Ontology (GO) describes our understanding of three aspects of the biological domain:

1. Molecular Function (MF)
The activities of a single gene product (including protein and RNA) or a complex of multiple gene products at the molecular level, such as “catalysis” and “transport”

It should be noted that the description here only represents the activity and does not specify the entity (molecule or complex) that performs the function, the place, time or context in which the action occurs.

Examples in a broad sense are catalytic activity and transporter activity. Specific examples are adenylyl cyclase activity or Toll-like receptor binding

To avoid confusion between the name of a gene product and its molecular function, the word “activity” is usually appended to the GO molecular function. For example, protein kinase has a GO molecular function: protein kinase activity

2. Cellular Component (CC)
The location in the cellular structure where the gene product performs its function, such as in mitochondria, ribosomes

Note: Cellular components are cell anatomy structures and do not refer to processes

3. Biological Process (BP)
Biological processes accomplished through a variety of molecular activities, broad examples being DNA repair or signal transduction. More specific examples are the pyrimidine nucleoside biosynthesis process or glucose transmembrane transport (biological processes are not equivalent to pathways. Currently, GO does not have the descriptive information about the dynamics or dependencies required to represent complete pathway information)

chestnut:

Gene product: cytochrome c
Molecular function: oxidoreductase activity
Cellular components: mitochondrial matrix
Biological Process: Oxidative Phosphorylation

1.4 The composition of GO terms

1. Basic elements

  • Unique identifiers (GO IDs) and names: such as GO:0005739, GO:1904659, GO:0016597 and mitochondria, glucose transmembrane transport, amino acid binding
  • Aspect: Which of the cellular components, biological processes, or molecular functions the term belongs to.
  • Definition: A textual description of a term, with a citation to the source of the information.
  • Relationship: The relationship of the term to other terms in the ontology. For example, glucose transport across membranes (GO: 1904659) is monosaccharide transport (GO: 0015749). 2

Official website documentation: GO overview

2. GO graph
The structure of GO can be described as a graph, where each GO item is a node and the relationships between items are the edges between nodes. GO is loosely hierarchical, with “child” terms being more specialized than their “parent” terms, but unlike strict hierarchies, a term may have multiple parent terms (note that the parent/child model does not work for all types For relationships, see the relationships document). For example, the biological process term hexose biosynthetic process has two parents, hexose metabolic process and monosaccharide biosynthetic process. This reflects the fact that a biosynthetic process is a descendant of a metabolic process. A subtype, and hexoses are a subtype of monosaccharides.

1.5 File Description

The file “train_terms.tsv” contains a list of annotated terms (ground truth) for the proteins in “train_sequences.fasta”.
In “train_terms.tsv”:

  • The first column represents the UniProt accession ID of the protein (unique protein ID)
  • The second column is “GO term ID”
  • The third column indicates in which ontology the term appears.

1.6 About data set tags

The goal of the model is to predict the terms (functions) of a protein sequence. A protein sequence can have many functions and therefore be divided into any number of terms. Each term is uniquely identified by a “GO Term ID”. Therefore, the model must predictall “GO Term IDs” of the protein sequence. This means that the task is amulti-label classification problem.

(2) Protein embedding for training and test data

To train a machine learning model, the letter sequences in train_sequences.fasta cannot be used directly. They must be converted to vector format. 3
Models are trained using embeddings of protein sequences. (Protein embeddings can be thought of as similar to word embeddings used to train NLP models.)

To make calculations and data preparation easier, use precomputed protein embeddings4

Protein embedding is a machine-friendly method that captures the structural and functional characteristics of a protein mainly through its sequence. 5
One approach is to train a custom ML model6. Since this dataset uses amino acid sequences (a standard approach) to represent proteins, any publicly available pretrained protein embedding model can be used to generate embeddings. 7

2.1 Library import

import tensorflow as tf
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Required for progressbar widget
import progressbar
print("TensorFlow v" + tf.__version__)
print("Numpy v" + np.__version__)

(3) Loading data set

Load the file “train_terms.tsv” which contains a list of annotated terms (functions) for proteins. The tags, aka “GO term IDs”, will be extracted and a tag dataframe8 will be created for the protein embeddings.

train_terms = pd.read_csv("input/cafa-5-protein-function-prediction/Train/train_terms.tsv",sep="\t")
print(train_terms.shape)

The train_terms data frame consists of 3 columns and 5363863 entries. We can view all 5 dimensions of the dataset using the following code to print out the first 3 entries

train_terms.head()

()]

(4) Loading protein embedding

Using Rost Lab’s T5 protein language model, the protein embeddings used for training are recorded in “train_embeds.npy” and the corresponding protein IDs are available in “train_ids.npy”.

Load the protein ids of the protein embeddings in the training dataset contained in train_ids.npy’ into a numpy array.

train_protein_ids = np.load('/input/t5embeds/train_ids.npy')
print(train_protein_ids.shape)
train_protein_ids[:5]

After loading the files as numpy arrays, convert them into Pandas dataframes.

Each protein embedding is a vector of length 1024. Create the resulting data frame so that there are 1024 columns to represent the value of each of the 1024 positions in the vector.

train_embeddings = np.load('/kaggle/input/t5embeds/train_embeds.npy')

# Now lets convert embeddings numpy array(train_embeddings) into pandas dataframe.
column_num = train_embeddings.shape[1]
train_df = pd.DataFrame(train_embeddings, columns = ["Column_" + str(i) for i in range(1, column_num + 1)])
print(train_df.shape)

(5) Prepare data set

Extract all required tags (“GO term IDs”) from the “train_terms.tsv” file. There are over 40,000 tags in total.

To simplify the model, the most common 1500 GO term IDs’ will be selected as labels.

Plot the top 100 “GO Term IDs” in “train_terms.tsv”.

# Select first 1500 values for plotting
plot_df = train_terms['term'].value_counts().iloc[:100]

figure, axis = plt.subplots(1, 1, figsize=(12, 6))

bp = sns.barplot(ax=axis, x=np.array(plot_df.index), y=plot_df.values)
bp.set_xticklabels(bp.get_xticklabels(), rotation=90, size = 6)
axis.set_title('Top 100 frequent GO term IDs')
bp.set_xlabel("GO term IDs", fontsize = 12)
bp.set_ylabel("Count", fontsize = 12)
plt.show()


Save the top 1500 most common GO term IDs into a list.

# Set the limit for label
num_of_labels = 1500

# Take value counts in descending order and fetch first 1500 `GO term ID` as labels
labels = train_terms['term'].value_counts().index[:num_of_labels].tolist()

Create a new dataframe using the selected “GO Term ID” to filter the training terms

# Fetch the train_terms data for the relevant labels only
train_terms_updated = train_terms.loc[train_terms['term'].isin(labels)]

Plot the aspect ratio values in the new train_terms_updated data frame using a pie chart:

pie_df = train_terms_updated['aspect'].value_counts()
palette_color = sns.color_palette('bright')
plt.pie(pie_df.values, labels=np.array(pie_df.index), colors=palette_color, autopct='%.0f%%')
plt.show()


63% of “GO term IDs” have BPO (Biological Process Ontology) as the main factor.

Since this is a multi-label classification problem, use 1 or 0 to represent the presence or absence of each Go Term ID of the protein ID in the label array.
First, create a numpy array train_labels’ of the desired size for the labels. To update the train_labels’ array with the appropriate values, iterate over the list of labels.

# Setup progressbar settings.
# This is strictly for aesthetic.
bar = progressbar.ProgressBar(maxval=num_of_labels, \
    widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage()])

# Create an empty dataframe of required size for storing the labels,
# i.e, train_size x num_of_labels (142246 x 1500)
train_size = train_protein_ids.shape[0] # len(X)
train_labels = np.zeros((train_size ,num_of_labels))

# Convert from numpy to pandas series for better handling
series_train_protein_ids = pd.Series(train_protein_ids)

# Loop through each label
for i in range(num_of_labels):
    # For each label, fetch the corresponding train_terms data
    n_train_terms = train_terms_updated[train_terms_updated['term'] == labels[i]]
    
    # Fetch all the unique EntryId aka proteins related to the current label(GO term ID)
    label_related_proteins = n_train_terms['EntryID'].unique()
    
    # In the series_train_protein_ids pandas series, if a protein is related
    # to the current label, then mark it as 1, else 0.
    # Replace the ith column of train_Y with that pandas series.
    train_labels[:,i] = series_train_protein_ids.isin(label_related_proteins).astype(float)
    
    # Progress bar percentage increase
    bar.update(i + 1)

# Notify the end of progress bar
bar.finish()

# Convert train_Y numpy into pandas dataframe
labels_df = pd.DataFrame(data = train_labels, columns = labels)
print(labels_df.shape)

The final label data frame (label_df’) consists of 1500 columns and 142246 entries. Print out the first 5 entries, thus viewing all 1500 dimensions of the dataset

labels_df.head()

(6) Training model

Use Tensorflow to train deep neural networks with protein embeddings

INPUT_SHAPE = [train_df.shape[1]]
BATCH_SIZE = 5120

model = tf.keras.Sequential([
    tf.keras.layers.BatchNormalization(input_shape=INPUT_SHAPE),
    tf.keras.layers.Dense(units=512, activation='relu'),
    tf.keras.layers.Dense(units=512, activation='relu'),
    tf.keras.layers.Dense(units=512, activation='relu'),
    tf.keras.layers.Dense(units=num_of_labels,activation='sigmoid')
])


# Compile model
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
    loss='binary_crossentropy',
    metrics=['binary_accuracy', tf.keras.metrics.AUC()],
)

history = model.fit(
    train_df, labels_df,
    batch_size=BATCH_SIZE,
    epochs=5
)

(7) Plot the model loss and accuracy in each period

history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss']].plot(title="Cross-entropy")
history_df.loc[:, ['binary_accuracy']].plot(title="Accuracy")


(8) Submit model

test_embeddings = np.load('/kaggle/input/t5embeds/test_embeds.npy')

# Convert test_embeddings to dataframe
column_num = test_embeddings.shape[1]
test_df = pd.DataFrame(test_embeddings, columns = ["Column_" + str(i) for i in range(1, column_num + 1)])
print(test_df.shape)

test_df.head()

(9) Prediction

predictions = model.predict(test_df)

# Reference: https://www.kaggle.com/code/alexandervc/baseline-multilabel-to-multitarget-binary

df_submission = pd.DataFrame(columns = ['Protein Id', 'GO Term Id','Prediction'])
test_protein_ids = np.load('/kaggle/input/t5embeds/test_ids.npy')
l = []
for k in list(test_protein_ids):
    l + = [ k] * predictions.shape[1]

df_submission['Protein Id'] = l
df_submission['GO Term Id'] = labels * predictions.shape[0]
df_submission['Prediction'] = predictions.ravel()
df_submission.to_csv("submission.tsv",header=False, index=False, sep="\t")
df_submission


  1. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000 May;25(1):25-9. doi: 10.1038/75556. PMID: 10802651; PMCID: PMC3037419.

  2. https://zhuanlan.zhihu.com/p/99789859

  3. Heinzinger, M., Bagheri, M., & Kelley, L. A. (2017). Deep learning for predicting protein backbone structure from amino acid sequences. Scientific reports, 7(1), 1-11.

  4. https://zhuanlan.zhihu.com/p/593730989

  5. https://blog.csdn.net/weixin_43841338/article/details/103171724

  6. AlQuraishi, M. (2019). End-to-end differentiable learning of protein structure. Cell Systems, 8(4), 292-301.

  7. https://zhuanlan.zhihu.com/p/498353259

  8. Senior, A. W., Evans, R., Jumper, J., Kirkpatrick, J., Sifre, L., Green, T., … & Zisserman, A. (2020). Improved protein structure prediction using potentials from deep learning . Nature, 577(7792), 706-710.