Welcome to follow my CSDN: https://spike.blog.csdn.net/
URL of this article: https://spike.blog.csdn.net/article/details/134144591
In protein complex structure prediction, when the sequence is a heterologous multi-chain, whether it is AB or AABB, MSA pairing is required, that is, MSA Pairing. During the search process of MSA, the search is performed according to the single chain dimension, merged through MSA Pairing, and input as features to Multimer structure prediction.
Controlling the number of MSA includes 3 hyperparameters:
max_msa_crop_size
, used to determine the length of the MSA, default setting is 2048max_msa_clusters
, used to determine the length of MSA features in inference, default setting 252max_extra_msa
, used to limit the length of Extra MSA features in inference, default setting 1024
These 3 parameters are set in sequence, including each other from front to back, and can be adjusted according to different situations, where the 1st parameter > (2nd parameter + 3rd parameter).
The default single-chain search files are as follows:
bfd_uniref_hits.a3m mgnify_hits.sto pdb_hits.sto uniprot_hits.sto uniref90_hits.sto
Among them, uniref90_hits.sto
is used for MSA Pairing, pdb_hits.sto
is used for template features, bfd_uniref_hits.a3m
, mgnify_hits. sto
, uniref90_hits.sto
, for single-chain MSA features. We assume a 4-chain PDB in ABAB format.
Optimization 1: MSA Pairing only uses uniprot_hits.sto
by default. When the number is small, you can use uniref90_hits.sto
as a supplement. It is mainly necessary to add additional Fake species< /strong>.
The source code openfold/data/data_pipeline.py
is as follows:
# + + + + + + + + + + Supplement the logic of MSA Pairing source + + + + + + + + + + # # target_seq is not used in the standard AF2 Multimer process, that is, target_seq is None # logger.info(f"[CL] target_seq: {target_seq}") msa = parsers.parse_stockholm(result, query_seq=target_seq) msa = msa.truncate(max_seqs=self._max_uniprot_hits) msa_extra = parsers.parse_stockholm(result_extra, query_seq=target_seq) msa_extra = msa_extra.truncate(max_seqs=self._max_uniprot_hits) logger.info(f"[CL] all_seq msa: {<!-- -->len(msa.sequences)}, add uniref msa: {<!-- -->len(msa_extra.sequences)}") all_seq_features = make_msa_features([msa, msa_extra]) logger.info(f"[CL] all_seq msa: {<!-- -->all_seq_features['msa'].shape}") # + + + + + + + + + + Supplement the logic of MSA Pairing source + + + + + + + + + + #
Optimization 2: When the number of single-chain MSA is small, use uniprot_hits.sto
as a supplement to MSA.
The source code openfold/data/data_pipeline.py
is as follows:
# + + + + + + + + + + Supplementary logic for single-stranded MSA sequences + + + + + + + + + + # msa_seq_list = set() for _, msa in msa_dict.items(): for sequence_index, sequence in enumerate(msa.sequences): msa_seq_list.add(sequence) msa_seq_list = list(msa_seq_list) thr = 64 # This affects sequences without pairing. The value should not be too large. msa_size = len(msa_seq_list) if msa_size <thr and uniprot_path: logger.info(f"[CL] single msa too small {<!-- -->msa_size} < {<!-- -->thr} (thr), uniprot_path: {<!-- -->uniprot_path} ") with open(uniprot_path) as f: sto_string = f.read() msa_obj = parsers.parse_stockholm(sto_string) msa_seq_list + = msa_obj.sequences msa_seq_list = list(set(msa_seq_list)) diff_size = len(msa_seq_list) - msa_size logger.info(f"[CL] single msa from {<!-- -->msa_size} to {<!-- -->len(msa_seq_list)}, add {<!-- -->diff_size}") if diff_size > 0: msa_list.append(msa_obj) #Add additional data # + + + + + + + + + + Supplementary logic for single-stranded MSA sequences + + + + + + + + + + #
Optimization 3: When the number of MSA Pairing is too small, especially when the number of full-chain Pairing is too small, use MSA from other species as a supplement to MSA Pairing.
The source code openfold/data/msa_pairing.py
is as follows:
# + + + + + + + + + + Supplement the logic of MSA Pairing + + + + + + + + + + # thr = 128 num_all_pairing = len(tmp_dict1[num_examples]) if num_all_pairing <thr: logger.info(f"[CL] full msa pairing ({<!-- -->num_examples} chains) is too little ({<!-- -->num_all_pairing}<{<!-- -->thr} ), " f"so add more!") tmp_dict2 = process_species( num_examples, common_species, all_chain_species_dict, prokaryotic, is_fake=True) # all_paired_msa_rows_dict = tmp_dict2 tmp_item = list(tmp_dict1[num_examples]) + list(tmp_dict2[num_examples]) # Supplementary part MSA tmp_item = np.unique(tmp_item, axis=0) # Remove duplicates first tmp_item = tmp_item[:thr] # intercept again if len(tmp_item) > num_all_pairing: all_paired_msa_rows_dict[num_examples] = tmp_item logger.info(f"[CL] full msa pairing ({<!-- -->num_examples} chains) add to {<!-- -->len(tmp_item)}! ") # + + + + + + + + + + Supplement the logic of MSA Pairing + + + + + + + + + + #
Assume the sequence is AABB, the order is not important, it can also be ABAB, the chain is
N
c
N_{c}
Nc?, MSA Pairing only considers the msa_all_seq
field (uniprot_hits
and uniref90_hits
optimization), that is, the number of MSA included in A chain is
L
A
L_{A}
LA?, the number of B chain including MSA is
L
B
L_{B}
LB?, MSA Pairing quantity is
L
P
a
b
L_{P_{ab}}
LPab. Among them, MSA Pairing includes 2 to
N
c
N_{c}
Nc?, for example, 4 chains, that is, it can be paired into 2 chains, 3 chains, 4 chains, etc. 4 situations. When there is only 1 chain, it is discarded.
Source code openfold/data/msa_pairing.py
, that is:
# Skip species that are present in only one chain. if species_dfs_present <= 1: continue
In the process of MSA Pairing, modify the MSA order of the msa_all_seq
field, and remove the situation where there is only 1 chain (no pairing). Assume that the final number of MSA Pairing is
L
P
a
b
L_{P_{ab}}
LPab, all chains are the same, fill in the gaps.
Through the msa_pairing.merge_chain_features()
function, merge the single-chain MSA together, namely bfd_uniref_hits.a3m
, mgnify_hits.sto
, All MSA of uniref90_hits.sto
form the msa
field feature. Among them, MSA parameter 1 is max_msa_crop_size
, which indicates the maximum number of merged MSA. For example, the number of msa_all_seq
in chain A is 900, and the maximum number is 2048. The maximum number of msa
fields in a single chain is 1148, and the rest are randomly discarded, that is, 1148 + 900= 2048
.
Source code openfold/data/msa_pairing.py
, note that feat_all_seq
comes first and feat
comes after, that is, MSA Pairing is more important, that is:
def _concatenate_paired_and_unpaired_features( example: pipeline.FeatureDict, ) -> pipeline.FeatureDict: """Merges paired and block-diagonalised features.""" features = MSA_FEATURES for feature_name in features: if feature_name in example: feat = example[feature_name] feat_all_seq = example[feature_name + "_all_seq"] merged_feat = np.concatenate([feat_all_seq, feat], axis=0) example[feature_name] = merged_feat example["num_alignments"] = np.array(example["msa"].shape[0], dtype=np.int32) return example
Through the openfold/data/data_transforms_multimer.py
function, the input msa
features (merging msa
and msa_all_seq
) are processed Intercept, first intercept max_seq
, then intercept max_extra_msa_seq
, that is, the second and third parameters, max_msa_clusters
and max_extra_msa
, as the final training or inference msa
feature.
logits + = cluster_bias_mask * inf index_order = gumbel_argsort_sample_idx(logits, generator=g) logger.info(f"[CL] truly use msa raw size: {<!-- -->len(index_order)}, msa: {<!-- -->max_seq}, extra_msa: {<!-- - ->max_extra_msa_seq}") sel_idx = index_order[:max_seq] extra_idx = index_order[max_seq:][:max_extra_msa_seq] for k in ["msa", "deletion_matrix", "msa_mask", "bert_mask"]: if k in batch: batch["extra_" + k] = batch[k][extra_idx] batch[k] = batch[k][sel_idx]
Predict the structure of protein complexes through different training models and different parameters.