ModelAngelo: atomic modelling¶
As of release 5.0, relion comes with a machine-learning approach for automated atomic model building called ModelAngelo
[JKZ+24]. ModelAngelo is a graph neural network that combines information from the cryo-EM map with sequence information of the proteins that are present in the map to build an atomic model.
Because we know that the sample in this data set is beta-galactosidase from E.coli, we can provide the protein sequence as a FASTA
file. You can download it from uniprot
through this link. Save it as betagal.fasta
in the project directory.
Running the job¶
Using the ModelAngelo building job type, we set on the I/O tab:
- B-factor sharpened map::
PostProcess/job030/postprocess_masked.mrc
- FASTA sequence for proteins::
betagal.fasta
- FASTA sequence for DNA::
-
(There is no DNA in this complex.)
- FASTA sequence for RNA::
-
(There is no RNA in this complex.)
- Which GPUs to use::
0,1,2,3
On the Hmmer tab make sure you this is set:
- Perform HMMer search?:
No
Analysing the results¶
Running with 4 GPUs (relatively old 1080s), this job took approximately 18 minutes.
The output is a coordinate file called ModelAngelo/job032/job032.cif
that may be used in UCSF chimera together wit the Postprocess/job030/postprocess.mrc
map.
Note that although ModelAngelo
did a very good job on this relatively easy test case, you should always check its results carefully in a program like coot
[ELSC10] . You will also need to perform a stereochemical refinement of the coordinates. For this, we like servalcat
[YPBM21].
Discovering unknown proteins¶
One does not always know the identity of all proteins that are present in a cryo-EM reconstruction. One can also run ModelAngelo
without providing a FASTA
file with the known protein sequence. It will still try to model protein chains in the map, but will perform less well without the knowledge of the sequence. But, one can then perform a Hidden Markov Model search using the full probabilities for all 20 amino acids for every residue, for example against a FASTA file that contains the entire genome for that organism. In that way, one can identify unknown proteins in the map.
Download the E.coli genome (K-12 substrain) from this link to NCBI. Use the Send to
dropdown menu on the top right to select Coding Sequences
, select the FASTA protein
format, and click Create File
. Save the resulting file as Ecoli.fasta
in your project directory.
Using the same ModelAngelo building job type, this time we set on the I/O tab:
- B-factor sharpened map::
PostProcess/job030/postprocess_masked.mrc
- FASTA sequence for proteins::
- FASTA sequence for DNA::
-
(There is no DNA in this complex.)
- FASTA sequence for RNA::
-
(There is no RNA in this complex.)
- Which GPUs to use::
0,1,2,3
On the Hmmer tab, we now set:
- Perform HMMer search?:
Yes
- Library with sequences for HMMer search::
Ecoli.fasta
- Alphabet for HMMer search::
amino
And we leave the rest of the HMMSearch parameters at their defaults.
Which one is my protein?¶
Running with 4 GPUs (again our relatively old 1080s), this job took approximately 12 minutes: running without the sequence is faster than running with the sequence. However, without knowledge of the sequence, ModelAngelo has trouble building a single chain, as you can see by visualising ModelAngelo/job033/job033.cif
in UCSF chimera.
The subsequent HMM search easily identifies beta-galactosidase, as you can see in ModelAngelo/job033/best_hits.csv
. Cool, huh?