The following inquiry was sent to the Jmol-Users email list on 2017-12-22. Determination of chain termini was implemented following the methods described below. (original email list post).
Quite recently, I realized that that Jmol's groupindex may offer
the best solution. If there is a better or simpler solution, I
would like to know it! Groupindex is not the sequence
number. It is a group/residue number assigned by Jmol for internal
use. Jmol assigns somewhat different groupindex numbering schemes
to PDB format vs. mmCIF format files.
I will make two assumptions that appear to me to be true.
Let's try for the N terminal residue of chain A by finding the lowest groupindex:
AMin = {chain=A and protein}.groupindex.min;In determining the minimum groupindex, one cannot exclude HETERO
groups because termini may be non-standard amino acids, or not
even amino acids (e.g. acetyl, ACE). Examples: the N-terminal
residues of 12 of the 24 chains in 4gxu are pyroglutamic acid
[PCA]1, a hetero group. The N-terminal residues in 4mdh are ACE,
acetyl.
2. Is the end candidate peptide-bonded to the adjacent
residue?
Next we need to determine if the chain A residue with the lowest groupindex is peptide-bonded to the residue with the next higher groupindex. At the N-terminus I expect that it always will be but perhaps I will be surprised.
select groupindex=0 and *.c and connected(groupindex=1 and *.n)Some ligands are deemed "protein" by Jmol, presumaby because they have an alpha carbon (e.g. 3ES in 2xy9), but are not covalent members of the protein chain.
When the group with the highest index is not covalently
peptide-bonded to the chain, our next candidate will be the next
lower groupindex, and so forth until we find the highest
groupindex that is peptide-bonded to the chain.
3. Mono- and dipeptide ligands
All chains of length 3 or more amino acids are represented with ATOM records, have distinct chain identifiers (A, B, C, etc.), and have SEQRES records.
However, single amino acid ligands (e.g. Gly308 in 4cpa) and
dipeptides are by PDB rules deemed HETERO, have no SEQRES records
(but HET and HETNAM records), and are assigned the same chain
identifier as the chain to which they are bound. Rachel Kramer
Green of RCSB has confirmed this rule. (There are a handful of
cases deposited by PDBe that do not conform to this rule -- they
will be remediated.)
A single amino acid is easily excluded since it is not
peptide-bonded to the residue having one groupindex lower.
However, the C-terminal residue of a dipeptide ligand is
peptide-bonded to the residue one groupindex lower, but is not
part of the larger protein chain with the same chain identifier.
Therefore we must require that any candidate for a terminal
residue be peptide bonded to the adjacent residue, and that
adjacent residue be peptide-bonded to the next adjacent residue
(THREE peptide bonded residues).
For example 3deq contains chain B of length 345 amino acids, 341 of which have coordinates (2 missing at each end). Sequence numbers are 3-343 and groupindices 436-776. Bound to chain B is a dipeptide (HETATM, also deemed "chain B") Ala411-Leu412. Jmol assigns this dipeptide groupindices 777-778 for the PDB format file, and 1366-1367 for the mmCIF file.
Taking the PDB file of 3deq, dipeptide groupindex 778 (the highest protein groupindex in chain B) is peptide-bonded to 777, but 777 is NOT peptide-bonded to 776, and thus 778 is REJECTED as the C-terminus of chain B. When we get to 776, we find it is peptide-bonded to 775, which is peptide-bonded to 774, and thus we ACCEPT 776 as the C-terminal residue.
4. Is the terminal residue with coordinates charged?
Now we have determined the terminal residues with coordinates in
each chain. But are their amino or carboxy termini charged?
4A. Is the terminal residue missing coordinates?
If the terminal residue is missing coordinates due to disorder,
then the terminal residue with coordinates (determined by the
above methods) is NOT charged.
If the C-terminal residue has an OXT atom, it is charged (and
there cannot be a more-C-terminal residue that is missing
coordinates).
For all N-terminal residues, and C-terminal residues lacking OXT,
we need to ask whether the terminal residue in the experimental
protein is missing coordinates. (Recall that authors often fail to
put OXT atoms on C-terminal residues, and this is "legal".)
I am aware of two possible methods.
I plan to use method 1 although in rare cases where near-terminal sequence numbers are not sequential, this would fail.
(mmCIF files contain the SEQRES to ATOM sequence alignment,
lacking in PDB format files, but I have not yet adapted
FirstGlance to use mmCIF.)
Missing residues are listed in REMARK 465 by chain, residue name,
and sequence number.
Take the case of 3deq. We have decided, using the methods above, that Lys343 (groupindex 776) is the C-terminal residue with coordinates in chain B. But we find that chain B sequence number 344 is listed as missing coordinates in REMARK 465. Therefore, we know that Lys343 is not the real C-terminus, and hence it has no carboxy-terminal charge.
Although I don't know of an example (and don't know how to search
for one), there may be cases where the terminal residue is missing
and is not protein. Presumably an N-terminal acetyl group that is
missing coordinates would be listed in REMARK 465.
4B. Is the terminal residue blocked?
Fairly often, the N terminus may be acetylated with an ACE hetero group (e.g. 4mdh).
We can determine whether the N-terminus is blocked by asking
whether its main chain nitrogen atom is covalently bonded (Jmol
"connected" function) to two non-hydrogen atoms (three in the case
of proline). If yes, then it is not charged. If no, and if the
real N-terminal residue is not missing, then our N-terminus
candidate is charged.