Pfam MySQL database

The Pfam MySQL database contains all of the data accessible via the website. The database currently consists of 63 tables. Below is some basic documentation on the schema layout and how smaller numbers of tables can be put together to enable access to a subset of the data. At the time of writing, release 36.0, the fields within the tables and the results of queries are correct. The data within the tables will change with each release. Although we do not anticipate any major changes to the database, we reserve the right to make changes with or without warning; we will endeavour to update this document if such changes are made. Please note that we do not provide a public version of the Pfam database after release 36.0.

A red diamond in the images below indicates a foreign key. In some images there are tables which appear not to be linked to any other table in the image. This is due to a foreign key being populated late in production of the database. The ‘floating’ table can still be joined and example queries of how to do so are given under each image.

Version table

_images/version.png

The version table contains information that relates to a particular Pfam release. It contains the version number of the Pfam database, the version numbers of the Swiss-Prot and TrEMBL databases that were used to build Pfam, and some statistics about the number of families and coverage. This table is stand-alone and does not link to any of the other tables.

Example query: Give me all of the version information for the Pfam database

SELECT *
FROM version

Example output:

          pfam_release: 36.0
     pfam_release_date: 2023-07-18
    swiss_prot_version: 2022_05
        trembl_version: 2022_05
         hmmer_version: 3.3
        pfamA_coverage: 76.2
pfamA_residue_coverage: 48.6

Domain information

_images/domain.png

Two of the central tables in the Pfam database are pfamseq, which contains UniProtKB reference proteomes and pfamA, which contains information about the Pfam-A families. Most of the other tables in the database link to one or both of these tables, either directly or indirectly. Note that prior to Pfam 29.0, the pfamseq table contained the whole of UniProtKB. From Pfam 29.0, this table contains only the reference proteome portion of UniProtKB. The full alignments in Pfam are based on the sequences in the pfamseq table.

The table pfamA_reg_seed contains the Pfam regions that are present in a seed alignment. All sequences in pfamA_reg_seed are in the pfamseq table or the uniprot table (the uniprot table contains all the sequences in UniProtKB). The pfamA_reg_full_significant table contains all of the sequence regions from the pfamseq table that match the HMM and score above the curated threshold, i.e. are significant matches, for each family. There is also a table named pfamA_reg_full_insignificant which contains, as the name suggests, all the insignificant matches for each family. Insignificant matches are those which match the HMM with an E-value less than 1000, but score below the curated bit score threshold for each family.

In addition to providing matches to the sequences in the pfamseq table, we also provide the significant matches for the sequences in the uniprot table. These can be found in the table uniprot_reg_full.

The tables pfamA_reg_full_significant and uniprot_reg_full contain a column called ‘in_full’. The matches that are present in the full alignment for a Pfam family have this column set to 1, while those that are not present in the full alignment have the ‘in_full’ column set to 0. A significant match will only be excluded from the full alignment (in_full = 0) if it matches a family that belongs to a clan, and the match overlaps with another more significant (lower E-value) match to a family within the clan.

For each sequence match we store two sets of coordinates, the envelope coordinates and the alignment coordinates. The envelope co-ordinates delineate the region on the sequence where the match has been probabilistically determined to lie, whereas the alignment coordinates delineate where HMMER is confident that the alignment of the sequence to the profile HMM is correct. Our full alignments contain the envelope coordinates. In the database, envelope start and end positions are stored in the seq_start and seq_end fields columns, and the alignment coordinates are stored in the ali_start and ali_end fields.

The Pfam database has historically been built on the UniProtKB database. However, as of release 22.0 we also provide Pfam domain data for the NCBI sequence database (GenPept) and a set of metagenomics sequences. As of release 28.0, we no longer store Pfam information at the sequence level for the NCBI and metagenomics data sets in the MySQL database, but we still provide the family alignments for them in the alignment_and_tree table.

Example query: Give me all of the domains for sequence ‘VAV_HUMAN’

SELECT   pfamA.pfamA_acc, pfamA_id, seq_start, seq_end
FROM     pfamseq, pfamA, pfamA_reg_full_significant
WHERE    pfamseq_id = 'VAV_HUMAN'
AND      in_full = 1
AND      pfamseq.pfamseq_acc = pfamA_reg_full_significant.pfamseq_acc
AND      pfamA_reg_full_significant.pfamA_acc = pfamA.pfamA_acc
ORDER BY seq_start

Example output:

pfamA_acc  pfamA_id  seq_start  seq_end
PF00307    CH                1      121
PF00621    RhoGEF          198      371
PF00169    PH              403      504
PF00130    C1_1            516      568
PF00018    SH3_1           615      652
PF00017    SH2             671      745
PF00018    SH3_1           788      834

To report all of the overlapping domains within any clans, leave out the ‘in_full =1’ clause. More information on clans can be found later in this document.

Example query: Give me all the sequences in the full alignment for the family ‘B12D’

SELECT pfamseq_id, pfamseq.pfamseq_acc, seq_start, seq_end, pfamA_id
FROM   pfamA, pfamseq, pfamA_reg_full_significant
WHERE  pfamA_id = 'B12D'
AND    in_full = 1
AND    pfamA.pfamA_acc = pfamA_reg_full_significant.pfamA_acc
AND    pfamA_reg_full_significant.pfamseq_acc = pfamseq.pfamseq_acc

Example output:

pfamseq_id        pfamseq_acc   seq_start  seq_end  pfamA_id
A0A4W6ED51_LATCA  A0A4W6ED51    11         80       B12D
A0A2C6L090_9APIC  A0A2C6L090    43         96       B12D
A0A1U8P0E7_GOSHI  A0A1U8P0E7     5         66       B12D
A0A6P6JK09_CARAU  A0A6P6JK09    11         79       B12D
A0A436ZUI8_9PEZI  A0A436ZUI8    55         96       B12D
A0A0D9WNU3_9ORYZ  A0A0D9WNU3     8         76       B12D
A0A5N5FAS6_9ROSA  A0A5N5FAS6     7         38       B12D
A0A1V4KBT8_PATFA  A0A1V4KBT8    11         79       B12D
B6DDV9_ANODA      B6DDV9        11         78       B12D
A0A0C9M2H1_9FUNG  A0A0C9M2H1    19         86       B12D
...

Example query: Give me all the sequences in the seed alignment for the family ‘B12D’

SELECT pfamseq_id, pfamseq.pfamseq_acc, seq_start, seq_end, pfamA_id
FROM   pfamA, pfamseq, pfamA_reg_seed
WHERE  pfamA_id = 'B12D'
AND    pfamA.pfamA_acc = pfamA_reg_seed.pfamA_acc
AND    pfamA_reg_seed.pfamseq_acc = pfamseq.pfamseq_acc

Example output:

pfamseq_id        pfamseq_acc  seq_start  seq_end  pfamA_id
M3ZJB0_XIPMA      M3ZJB0       36         105      B12D
M1C9M8_SOLTU      M1C9M8        7          75      B12D
E2B831_HARSA      E2B831       11          78      B12D
A0A4S8III1_MUSBA  A0A4S8III1    5          73      B12D
I1H296_BRADI      I1H296        6          74      B12D
K7DYW7_MONDO      K7DYW7        9          77      B12D
I1H293_BRADI      I1H293       14          82      B12D
F4WMZ4_ACREC      F4WMZ4       11          78      B12D
J3MJY8_ORYBR      J3MJY8       17          85      B12D
A0A804JLQ3_MUSAM  A0A804JLQ3    5          73      B12D
...

Pfamseq - other tables

_images/pfamseq_other.png

This section contains a few tables that link to the pfamseq table, but don’t fit nicely into any of the sections described above.

The evidence table contains the UniProtKB evidence code key that is used in the evidence field in the pfamseq and uniprot tables.

UniProtKB sequences have secondary accessions if they have been merged or split. Secondary accession numbers are stored in the table called secondary_pfamseq_acc.

Example query: Give me the secondary accession(s) for the sequence ‘P15455’

SELECT secondary_acc
FROM   pfamseq, secondary_pfamseq_acc
WHERE  pfamseq.pfamseq_acc = secondary_pfamseq_acc.pfamseq_acc
AND    pfamseq.pfamseq_acc= 'P15455'

Example output:

secondary_acc
Q3E711
Q56Z11
Q9FFH7

Other regions

_images/other_regions.png

These tables contain sequence specific information about the reference proteome sequences. The other_regions table contains coiled coil, low complexity, signal peptide, transmembrane and disordered regions data. The pfamseq_markup table contains active site information which is taken from the UniProtKB feature table. Additional active site residues are predicted by Pfam based on conserved residues in a Pfam alignment. The pfamseq_disulphide table contains disulphide bond information from the UniProtKB feature table.

Example query: Give me all of the transmembrane, signal-peptide, coiled-coils, low-complexity and disorder information for the sequence ‘VAV_HUMAN’

SELECT type_id, source_id, seq_start, seq_end
FROM   other_reg, pfamseq
WHERE  pfamseq.pfamseq_id = 'VAV_HUMAN'
AND    other_reg.pfamseq_acc = pfamseq.pfamseq_acc

Example output:

type_id         source_id  seq_start  seq_end
disorder        IUPred           128      129
disorder        IUPred           140      141
disorder        IUPred           160      161
disorder        IUPred           173      177
disorder        IUPred           179      180
disorder        IUPred           568      588
disorder        IUPred           635      636
low_complexity  segmasker         42       51
low_complexity  segmasker        356      367

Example query: Give me all of the active site information for sequence ‘F7PG13’

SELECT pfamseq.pfamseq_acc, pfamseq_id, residue, label
FROM   pfamseq, pfamseq_markup, markup_key
WHERE  pfamseq.pfamseq_acc = pfamseq_markup.pfamseq_acc
AND    pfamseq_markup.auto_markup = markup_key.auto_markup
AND    pfamseq.pfamseq_acc = 'F7PG13'

Example output:

pfamseq_acc  pfamseq_id    residue  label
F7PG13       F7PG13_9EURY       92  Pfam predicted active site
F7PG13       F7PG13_9EURY      248  Pfam predicted active site
F7PG13       F7PG13_9EURY       92  UniProt predicted active site

Example query: Give me all the residues involved in disulphide bonds in the sequence ‘Q43495’

SELECT pfamseq.pfamseq_acc, pfamseq_id, bond_start, bond_end
FROM   pfamseq, pfamseq_disulphide
WHERE  pfamseq_disulphide.pfamseq_acc = pfamseq.pfamseq_acc
AND    pfamseq.pfamseq_acc = 'Q43495'

Example output:

pfamseq_acc  pfamseq_id  bond_start  bond_end
Q43495       108_SOLLC           41        77
Q43495       108_SOLLC           79        99
Q43495       108_SOLLC           51        66
Q43495       108_SOLLC           67        92

Annotation information for a family

_images/annotation.png

In addition to the Pfam annotation, we also store InterPro annotation and their associated GO terms for each family. Links to other databases, e.g. SCOP) are also stored where appropriate. The pfamA table contains the GA, TC and NC cut-offs for each family, and additional information surrounding the Pfam-A family, including the number of sequences in the seed and full alignment. The pfamA_interactions table contains, where data are available, pairs of interacting Pfam domains. The data in this table are taken from the iPfam resource (no longer maintained), which describes physical interactions between Pfam domains that have a representative structure in the PDB.

Example query: Give me the Pfam annotation for the family ‘CBS’

SELECT comment
FROM   pfamA WHERE pfamA_id = 'CBS'

Example output:

comment: CBS domains are small intracellular modules that pair together
to form a stable globular domain [2]. This family represents a single CBS
domain. Pairs of these domains have been termed a Bateman domain [6]. CBS
domains have been shown to bind ligands with an adenosyl group such as
AMP, ATP and S-AdoMet [5].  CBS domains are found attached to a wide
range of other protein domains suggesting that CBS domains may play a
regulatory role making proteins sensitive to adenosyl carrying ligands.
The region containing the CBS domains in Cystathionine-beta synthase is
involved in regulation by S-AdoMet [4]. CBS domain pairs from AMPK bind
AMP or ATP [5]. The CBS domains from IMPDH and the chloride channel CLC2
bind ATP [5].

Example query: Give me all of the literature references for the family ‘CBS’

SELECT pfamA_literature_reference.comment, order_added, pmid, title, literature_reference.author, journal
FROM   pfamA, pfamA_literature_reference, literature_reference
WHERE  pfamA_id = 'CBS'
AND    pfamA.pfamA_acc = pfamA_literature_reference.pfamA_acc
AND    pfamA_literature_reference.auto_lit = literature_reference.auto_lit

Example output:

comment: Discovery and naming of the CBS domain.
order_added: 1
pmid: 9020585
title: The structure of a domain common to archaebacteria and the homocystinuria disease protein.
author: Bateman A;
journal: Trends Biochem Sci 1997;22:12-13.
...

Example query: Give me all of the database references for the family ‘A2M’

SELECT db_id, pfamA_database_links.comment, db_link, other_params
FROM   pfamA, pfamA_database_links
WHERE  pfamA_id = 'A2M'
AND    pfamA.pfamA_acc = pfamA_database_links.pfamA_acc

Example output:

db_id     comment  db_link      other_params
PROSITE            PDOC00440
SCOP               1c3d         fa
HOMSTRAD           A2M_A
HOMSTRAD           A2M_B

Note: The other_params column contains ‘fa;’ where the Pfam family corresponds to a SCOP family, and ‘sf;’ where the Pfam family corresponds to a SCOP superfamily.

Clan data

_images/clan.png

A Pfam clan is a set of related Pfam-A families. The information we use to determine which families belong to the same clan includes related structure, related function, matching of the same sequence to HMMs from different families, and profile-profile comparisons. Note that not all Pfam-A families belong to a clan and that a Pfam-A family cannot belong to more than one clan.

Example query: Give me the id and accession of the clan to which Pfam family ‘EGF’ belongs

SELECT clan_id, clan.clan_acc
FROM   clan, clan_membership, pfamA
WHERE  clan.clan_acc = clan_membership.clan_acc
AND    clan_membership.pfamA_acc = pfamA.pfamA_acc
AND    pfamA.pfamA_id = 'EGF'

Example output:

clan_id  clan_acc
EGF      CL0001

Example query: Give me all of the Pfam-A families that belong to clan ‘CL0001’

SELECT pfamA.pfamA_acc, pfamA_id
FROM   clan, clan_membership, pfamA
WHERE  clan.clan_acc = clan_membership.clan_acc
AND    clan_membership.pfamA_acc = pfamA.pfamA_acc
AND    clan.clan_acc = 'CL0001'

Example output:

PF01414   DSL
PF04863   EGF_alliinase
PF00053   Laminin_EGF
PF07645   EGF_CA
PF00008   EGF
PF07974   EGF_2
PF09064   Tme5_EGF_like
PF09289   FOLN
PF12661   hEGF
PF12662   cEGF
PF12946   EGF_MSP1_1
PF12947   EGF_3
PF14670   FXa_inhibition
PF06247   Plasmod_Pvs28
PF09443   CFC
PF00084   Sushi
PF09014   Sushi_2
PF00594   Gla
PF18193   Fibrillin_U_N
PF18372   I-EGF_1
PF18720   EGF_Tenascin
PF07699   Ephrin_rec_like
PF20626   Sp38_C
PF21195   C8A_B_C6_EGF-like
PF21284   C7_FIM2_N
PF21286   CFAI_FIMAC_N
PF21364   FBN_EGF_st1
PF21700   DL-JAG_EGF-like
PF21795   JAG1-like_EGF2

Example query: Give me the clan description and comment for clan ‘CL0001’

SELECT clan.clan_acc, clan_id, clan_description, clan_comment
FROM   clan
WHERE  clan_acc = 'CL0001'

Example output:

clan_acc: CL0001
clan_id: EGF
clan_description: EGF superfamily
clan_comment: Members of this clan all belong to the EGF superfamily ...

Example query: Give me the literature references for clan ‘CL0001’

SELECT comment, order_added, pmid, title, author, journal
FROM   clan, literature_reference, clan_lit_ref
WHERE  clan.clan_acc = clan_lit_ref.clan_acc
AND    clan_lit_ref.auto_lit = literature_reference.auto_lit
AND    clan.clan_acc = 'CL0001'

Example output:

    comment: NULL
order_added: 1
       pmid: 3282918
      title: Structure and function of epidermal growth factor-like regions in proteins.
     author: Appella E, Weber IT, Blasi F;
    journal: FEBS Lett 1988;231:1-4.

    comment: NULL
order_added: 2
       pmid: 11852228
      title: Domain structure and organisation in extracellular matrix proteins.
     author: Hohenester E, Engel J;
    journal: Matrix Biol 2002;21:115-128.

Example query: Give me the database links for clan ‘CL0001’

SELECT db_id, comment, db_link, other_params
FROM   clan_database_links, clan
WHERE  clan_database_links.clan_acc = clan.clan_acc
AND    clan.clan_acc = 'CL0001'

Example output:

db_id  comment  db_link     other_params
SCOP   NULL     57196
CATH   NULL     2.10.25.10

Dead families and clans

_images/dead.png

Sometimes we find that two or more Pfam-A families can be merged into a single family, which leads to the deletion of Pfam-A families. Likewise we might merge two clans together, which results in the deletion of a clan. The dead_family and dead_clan tables contain information about Pfam-A families and clans that have been deleted. These tables may be of use if you need to track what happened to the members of a particular family/clan that is no longer in Pfam.

Example query: Give me all of the information about ‘dead’ Pfam-A family ‘PF09410’

SELECT *
FROM dead_family
WHERE pfamA_acc = 'PF09410'

Example output:

 pfamA_acc: PF09410
  pfamA_id: DUF2006
   comment: Merged into PF07143
forward_to: PF07143
      user: jm14
    killed: 2009-08-25 10:33:41
     title: NULL

Example query: Give me all of the information about ‘dead’ clan ‘CL0152’

SELECT *
FROM dead_clan
WHERE clan_acc = 'CL0152'

Example output:

        clan_acc: CL0152
         clan_id: XI_TIM
clan_description: Xylose isomerase-like TIM barrel superfamily
 clan_membership:
         comment: Merged clan in to TIM_barrel clan
      forward_to: CL0036
            user: rdf
          killed: 2009-06-22 17:47:17

Nested domains

_images/nested1.png

Some Pfam-A domains are disrupted by the insertion of another domain (or domains) within them. The domain that is inserted into another is known as a nested domain. The nested_locations table stores all the nested Pfam-A domains. It also stores the coordinates of the nested domain with respect to a sequence that is present in the seed alignment of the domain in which it nests.

Example query: Give me all of the nested domains and the domains in which they are nested

SELECT A.pfamA_id, B.pfamA_id AS nested_domain
FROM   pfamA AS A, pfamA AS B, nested_domains
WHERE  A.pfamA_acc = nested_domains.pfamA_acc
AND    B.pfamA_acc = nested_domains.nests_pfamA_acc

Example output:

pfamA_id     nested_domain
CusB_dom_1   HlyD_D4
CusB_dom_1   HlyD_D23
RTC          RTC_insert
RtcB         HNH_3
RtcB         LAGLIDADG_3
RtcB         Intein_splicing
RtcB         Intein_splicing
Pro_dh       EF-hand_7
LGT          PDZ
HcyBio       Fer4
...

Example query: Give me the nested data for the family IMPDH

SELECT pfamA_id, nested_pfamA_acc, pfamseq_acc, seq_version, seq_start, seq_end
FROM   pfamA, nested_locations
WHERE  pfamA.pfamA_acc = nested_locations.pfamA_acc
AND    pfamA_id ="IMPDH"

Example output:

pfamA_id  nested_pfamA_acc  pfamseq_acc  seq_version  seq_start  seq_end
IMPDH     PF00571           Q21KQ8                 1        155      271

Proteomes

_images/proteome.png

As of Pfam 29.0, all sequences in the pfamseq table belong to a reference proteome, and therefore a complete proteome. Prior to Pfam 29.0 this was not the case. The complete_proteomes table contains statistics about the number of families and coverage. The tables in this section allow you to retrieve domain information about a particular species, or to retrieve all of the species which contain a particular Pfam domain.

Example query: Give me the Pfam summary for the human (Homo sapiens, ncbi taxid 9606) proteome

SELECT *
FROM   complete_proteomes
WHERE  ncbi_taxid=9606

Example output:

           ncbi_taxid: 9606
              species: Homo sapiens (Human)
             grouping: Eukaryota
 num_distinct_regions: 0
    num_total_regions: 115682
         num_proteins: 81828
    sequence_coverage: 72
     residue_coverage: 45
total_genome_proteins: 81828
      total_aa_length: 29701382
     total_aa_covered: 13467634
   total_seqs_covered: 58973

Example query: Give me all the Pfam-A domains for the species ‘Arabidopsis thaliana’

SELECT   r.pfamA_acc, pfamA_id, description, sum(number_domains)
FROM     pfamA p, proteome_regions r
WHERE    p.pfamA_acc=r.pfamA_acc
AND      ncbi_taxid=3702
GROUP BY r.pfamA_acc

Example output:

pfamA_acc  pfamA_id       description                                                      sum(number_domains)
PF00004    AAA            ATPase family associated with various cellular activities (AAA)                  178
PF00005    ABC_tran       ABC transporter                                                                  284
PF00006    ATP-synt_ab    ATP synthase alpha/beta family, nucleotide-binding domain                         13
PF00009    GTP_EFTU       Elongation factor Tu GTP binding domain                                           48
PF00010    HLH            Helix-loop-helix DNA-binding domain                                              233
PF00012    HSP70          Hsp70 protein                                                                     29
PF00013    KH_1           KH domain                                                                        106
...

Note: The ncbi_code for the species ‘Arabidopsis thaliana’ is 3702. This information can be found in the ncbi_taxonomy table.

Example query: Give me all of the protein sequences for the species ‘Arabidopsis thaliana’

SELECT pfamseq_acc
FROM   pfamseq
WHERE  ncbi_taxid = '3702'

Example output:

pfamseq_acc
A0A068FL09
A0A0A7EPL0
A0A140JWM8
A0A178U7J2
A0A178U7Y7
A0A178U807
A0A178U889
...

Example query: Give me all of the protein sequences from the species ‘Arabidopsis thaliana’ that belong to Pfam-A domain ‘PF00106’

SELECT pfamseq.pfamseq_acc
FROM   pfamseq, pfamA_reg_full_significant
WHERE  ncbi_taxid = '3702'
AND    pfamseq.pfamseq_acc = pfamA_reg_full_significant.pfamseq_acc
AND    pfamA_acc = 'PF00106'

Example output:

pfamseq_acc
A0A1I9LMA8
A0A1I9LNL3
A0A1I9LNL4
A0A1I9LNT2
A0A1I9LNT2
A0A1I9LQ55
A0A1I9LQ55
...

Data Files - Alignments, trees and HMMs

_images/alignments_and_trees.png

The seed, full, UniProtKB, NCBI, representative proteome and metaseq alignments are all stored as gzipped files in the database, as is the HMM for each family. Note that the NCBI and metaseq alignments may contain overlapping matches to Pfam-A families that belong to the same clan, however, the UniprotKB alignments (seed, full, uniprot and representative proteome sets) will not. This is because we have performed a clan filtering step on the UniProtKB data such that where there are overlapping Pfam-A matches within a clan, only the lowest E-value scoring match is included in the full alignment.