Primer : Understanding PDB vs CIF files - what's inside your pymol file?

Whenever you visualize a protein structure using PyMOL or Chimera, have you ever wondered what’s really inside the file? Nowadays, I am getting into informatics, and hope that this series helps beginners like me to get into it.

Today, let's explore the PDB and CIF file formats, and their differences,

Let's start by going to RCSB PDB and searching for 5GZA, a kinase enzyme that interacts with mannose sugar. Click the download tab and select PDB and CIF file formats.

Exploring the PDB File Format

To visualize the contents of a PDB file, open it in VS Code (or any text editor—though I personally dislike Notepad!).

PDB files contain structured information with well-defined column formats. The first few sections include:

HEADER, TITLE, COMPND → Metadata about the protein
SOURCE, AUTHOR → Experimental details
ATOM & HETATM → Atomic coordinates and ligands
CONECT → Connectivity (bonds between atoms)

The most useful part is the ATOM section, which contains 3D atomic coordinates (x, y, z), making it crucial for molecular docking and, MD simulations.

Now let's scroll down and what we have:

Each column in the ATOM section represents an atom with key information:

Atom number, Atom type, Residue Name (Amino acid name), Chain Identifier, Residue number), X, Y, Z coordinate, occupancy, and B-Factor.

Cystein, the first residue of this protein, for example,

N (Nitrogen) – The N-terminal atom of the first residue.

CA (C-alpha) – The backbone carbon atom of the protein.

C (Carbonyl Carbon) – Forms the amide bond with the next residue.

CB (C-beta) – The first side-chain carbon of cysteine.

SG (Sulfur Gamma) – The sulfur atom involved in disulfide bonds.

Occupancy is generally 1 - it shows the conformer (rotamer information). If is below <1, this amino acid has alternative conformer.

B-factor is another super useful value. Higher B-factor may indicate flexible regions (so enzymologists are interested in analyzing B-factor of loop region adjacent to the substrate). B-factor above 30 is generally regarded as significant motion or disorder. Also, we can visualize B-factor using pymol!

Where is Ligand Information Stored?

Ligands (e.g., small molecules, sugars, drugs) are stored in the HETATM section (Please scroll down further!), similar to amino acid residues but treated separately.

As similar as amino acid residues, Ligand atoms also have unique residue numbers within Chain.

Lastly, the CONECT section defines bond connectivity, but PDB files do not store bond types (single/double bonds)! This explains why PyMOL sometimes misrepresents ligand bond structure.

Where Do We Find Bond Information? Check CIF !

Since PDB files lack bond type information, we need to check the CIF file. Search for _chem_comp_bond . This section contains bond types (single/double/triple) between atoms.

Unlike PDB, CIF files follow a loop structure where each "_tag" corresponds to a column. We can easily map the columns with this _tag.

In the bond type section, we can see which atoms are involved in the bond, type of the bond (sing for single, doub for double) and aromaticity and streochemistry. CIF file is useful when you run the alphafold3 with user defined CCD format.

Next topic will be about....

RDkit and Ligand Alignment in Pymol.

Diary of (Sort of) a Chemical Biologist

Search This Blog

Primer : Understanding PDB vs CIF files - what's inside your pymol file?

Exploring the PDB File Format

Where is Ligand Information Stored?

Where Do We Find Bond Information? Check CIF !

Labels

Comments

Post a Comment

Popular posts from this blog

Weekly Reading: Five tasks that still challenge protein designers (Nature 635, 246-248 (2024))

Weekly Reading: Regulated N-glycosylation controls chaperone function and receptor trafficking (Ma etal.,Science386,667–672(2024))