ModelAngelo: atomic modelling

As of release 5.0, relion comes with a machine-learning approach for automated atomic model building called ModelAngelo [JKZ+23]. ModelAngelo is a graph neural network that combines information from the cryo-EM map with sequence information of the proteins that are present in the map to build an atomic model.

Because we know that the sample in this data set is beta-galactosidase from E.coli, we can provide the protein sequence as a FASTA file. You can download it from uniprot through this link. Save it as betagal.fasta in the project directory.

Running the job

Using the ModelAngelo building job type, we set on the I/O tab:

B-factor sharpened map::

PostProcess/job030/postprocess_masked.mrc

FASTA sequence for proteins::

betagal.fasta

FASTA sequence for DNA::

(There is no DNA in this complex.)

FASTA sequence for RNA::

(There is no RNA in this complex.)

Which GPUs to use::

0,1,2,3

On the Hmmer tab make sure you this is set:

Perform HMMer search?:

No

Analysing the results

Running with 4 GPUs (relatively old 1080s), this job took approximately 18 minutes. The output is a coordinate file called ModelAngelo/job032/job032.cif that may be used in UCSF chimera together wit the Postprocess/job030/postprocess.mrc map. Note that although ModelAngelo did a very good job on this relatively easy test case, you should always check its results carefully in a program like coot [ELSC10] . You will also need to perform a stereochemical refinement of the coordinates. For this, we like servalcat [YPBM21].

Discovering unknown proteins

One does not always know the identity of all proteins that are present in a cryo-EM reconstruction. One can also run ModelAngelo without providing a FASTA file with the known protein sequence. It will still try to model protein chains in the map, but will perform less well without the knowledge of the sequence. But, one can then perform a Hidden Markov Model search using the full probabilities for all 20 amino acids for every residue, for example against a FASTA file that contains the entire genome for that organism. In that way, one can identify unknown proteins in the map.

Download the E.coli genome (K-12 substrain) from this link to NCBI. Use the Send to dropdown menu on the top right to select Coding Sequences, select the FASTA protein format, and click Create File. Save the resulting file as Ecoli.fasta in your project directory.

Using the same ModelAngelo building job type, this time we set on the I/O tab:

B-factor sharpened map::

PostProcess/job030/postprocess_masked.mrc

FASTA sequence for proteins::

FASTA sequence for DNA::

(There is no DNA in this complex.)

FASTA sequence for RNA::

(There is no RNA in this complex.)

Which GPUs to use::

0,1,2,3

On the Hmmer tab, we now set:

Perform HMMer search?:

Yes

Library with sequences for HMMer search::

Ecoli.fasta

Alphabet for HMMer search::

amino

And we leave the rest of the HMMSearch parameters at their defaults.

Which one is my protein?

Running with 4 GPUs (again our relatively old 1080s), this job took approximately 12 minutes: running without the sequence is faster than running with the sequence. However, without knowledge of the sequence, ModelAngelo has trouble building a single chain, as you can see by visualising ModelAngelo/job033/job033.cif in UCSF chimera. The subsequent HMM search easily identifies beta-galactosidase, as you can see in ModelAngelo/job033/best_hits.csv. Cool, huh?