Welcome to Pfam’s documentation

Pfam is a large collection of protein families, each represented by multiple sequence aligments and profile hidden Markov models (HMMs).

Contents:

Summary

Proteins are generally comprised of one or more functional regions, commonly termed domains. The presence of different domains in varying combinations in different proteins gives rise to the diverse repertoire of proteins found in nature. Identifying the domains present in a protein can provide insights into the function of that protein.

The Pfam database is a large collection of protein domain families. Each family is represented by multiple sequence alignments and a profile hidden Markov model (HMM).

Each Pfam family, usually referred to as a Pfam-A entry, consists of a curated seed alignment containing a small set of representative members of the family, profile HMMs built from the seed alignment, and an automatically generated full alignment, which contains all detectable protein sequences belonging to the family, as defined by profile HMM searches of primary sequence databases.

Pfam entries are classified in one of six types:

Types of Pfam entries.

Types of Pfam entries.

Pfam clans

Structural properties are often more conserved than the underlying sequence. Therefore, a single profile HMM is often insufficient to model an entire, diverse, structural superfamily and related Pfam entries are sometimes grouped together into clans; the relationship may be defined by:

  • sequence similarity (whilst still originating from a common ancestor)

  • similarity of known three-dimensional structures

  • functional similarity

  • and/or similarity between their profile HMMs (as determined by algorithms such as HHsearch) similarity of sequence, structure or profile HMM.

The majority of Pfam Clans are groupings of domains and families.

Getting Started

Pfam is hosted by InterPro. All the information contained within Pfam is accessible in the website of this protein sequences resource by browsing by member database and choosing Pfam. For more information about InterPro you can have a look at its documentation.

Site organisation

Scheme of the organisation of the information in the Pfam database.

Schematic representation of the organisation of the information in the Pfam database. The arrows represent the flow of data.

Note the InterPro website uses the words ‘Conserved site’ for ‘Motif’ for consistency with the rest of the member databases that form part of the InterPro consortium.

Searching Pfam

There are multiple ways to look for information in Pfam by using the IntePro website.

Searching a specific Pfam entry

Users can navigate to specific Pfam entry pages by entering the Pfam identifier or accession number or a keyword that form part of its name via three different Search boxes:

  1. When selecting the Browse + By member database option, the search box is located in the header of the results table.

Selecting the 'Browse + By member database' option and Pfam.

Example of browsing the Pfam database. A paginated list of all available Pfam entries is displayed. A Search box appears on top of this list.

  1. After selecting Search + By text, a larger text box is shown in the center of the page.

Selecting **Search + By text**

Example of searching specific Pfam entry pages by entering the Pfam identifier or accession number or a keyword.

  1. In the top right corner of any InterPro page, next to the magnifying glass.

Search box available on the top right corner of any InterPro page.

On the InterPro website header, a search box appears when hovering the mouse next to the magnifying glass on the right; it can be used to search for Pfam information.

This text box allows you to go quickly to the relevant page in the InterPro site, by using:

Search

Find

Pfam accession number

Pfam entry page

Pfam identifier or name

Pfam entry page

Clan identifier

Pfam Clan page

UniProt accession

IntePro protein page, which includes Pfam matches (with coordinates)

Gene names

IntePro protein page, which includes Pfam matches (with coordinates)

PDB identifier

IntePro structure page, which includes a 3D visualisation of Pfam matches

Proteomes

If it is a reference proteome, the InterPro proteome page will be displayed

Keywords, free text

List of possible matches

Searching a protein sequence against Pfam

Searching a protein sequence against the Pfam library of HMMs will enable you to find out the domain architecture of the protein, and thus what its potential function might be. If your protein is present UniProt version used to make the current release of InterPro, we have already calculated its domain architecture. You can access this by entering the Uniprot sequence identifier in any of the Search boxes mentioned above (see Searching a specific Pfam entry).

Finding proteins with a specific set of domain combinations (Domain architectures)

Users can search protein sequences that contain specific Pfam entries in a particular arrangement by selecting Search + By Domain architecture in the InterPro website menu. Pfam entries that the proteins should or should not contain can be included or excluded from the domain architecture. The Order of domain matters option offers the possibility to arrange the domains in a particular order. The Exact match option fine tunes the search to find only proteins containing the selected domains (no extra domain in the proteins). Domains can be selected by entering a domain name, Pfam accession or InterPro accession.

Selecting Search + By Domain architecture

Select Search + By Domain architecture in the InterPro menu, enter the desired Pfam entries and select/unselect the relevant options.

Pfam entry page organisation

In each Pfam entry page, different tabs with relevant information are available, as shown in the figure below.

Example of a Pfam entry with the default tab selected (Overview)

Example of a Pfam entry page (PF02171). All the tabs explained below can be found on the left-hand side menu. The Overview tab is displayed by default.

Overview

The entry overview tab is the default display, where the type of Pfam entry, the short name and the clan (if the entry belongs to any) are shown at the top, more information about how clans are defined can be found in Summary. Usually, a curated description of the entry is displayed below, with the relevant literature references.

If there is a Wikipedia page for the entry, the first paragraph and the box with an image of a tridimensional structure and some cross-links are displayed. The full Wikipedia article can be open in a new tab by clicking on the title.

Proteins

The list of proteins matching this entry is displayed in this tab. This view can be customised to show:

  1. All proteins (from the whole UniProtKB database).

  2. Only Reviewed proteins (from SwissProt - manually curated).

  3. Only Unreviewed proteins (from TrEMBL - derived from public databases automatically integrated into UniProt).

For each protein, the corresponding protein page in InterPro can be accessed by clicking on the protein accession or name; the InterPro taxonomy page can be accessed by clicking on the species name; and a small-size protein viewer displays the location of the Pfam entry in the protein. The coordinates of the match can be shown by hovering the mouse over it. You can also export this data in different formats, by clicking on the Export button, and customise the page settings, by clicking on the wheel icon.

Example of a Pfam entry with the tab Proteins selected.

Example of a Pfam entry page (PF02171) with the Proteins tab selected. The table is customised to show only Reviewed proteins. The screenshot was taken when hovering the mouse over the small-size protein viewer of Uniprot Q6QME8.

Domain architectures

This tab shows the various domain arrangements of the proteins matched by the entry, ordered in descending order by the number of times that this architecture is seen. Identifying the different domains present in proteins is crucial to understand how they function.

The protein viewer displays a representative sequence for each domain architecture, where the domain size is based on the real length of the domain in the protein. When hovering over a domain, more details are shown in a tooltip, including the domain’s position.

From this page, all related Pfam entry pages can also be accessed by clicking on a Pfam accession at the top of the viewer or on a short name on the right-hand side of the viewer. The list of proteins with this architecture is available by clicking on the protein number.

Example of a Pfam entry with the Domain architectures tab selected.

Example of a Pfam entry page (PF02171) with the tab Domain architectures selected.

Taxonomy

This tab shows by default a sunburst chart of all the species that the proteins matched by the Pfam entry belong to.

By default, eight individual nodes that are derived from the taxonomic lineage of each protein sequence, ranging from superkingdoms down to species, are displayed. For each node in the taxonomy tree there is a separate ring - and each ring is arranged radially, with the superkingdoms at the centre and the species around the outermost ring. The length of each ring is proportional to the number of proteins found within each taxon. You can choose how many rings you want to see from the options on the right-hand side of the page.

Segments of the sunburst chart are coloured according to their superkingdom, as explained in the Legends section. Mousing over any part of the sunburst chart shows the taxonomic name and level, with both the number of sequences and the number of species found at that level shown on the right-hand side.

These data can also be seen as a table and as a tree. In addition, it is possible to choose to see only data from key species instead. These visualisation options can be chosen from the icon panel above the sunburst. All this information can be downloaded in different formats.

Example of a Pfam entry with the Taxonomy tab selected.

Example of a Pfam entry page (PF02171) with the Taxonomy tab selected. The default sunburst chart is shown on the left-hand side, with the mouse hovering over the taxon mammalia, and tables listing the species having proteins belonging to this Pfam entry are displayed on the right-hand side.

Proteomes

A list of the reference proteomes matched by the entry is displayed in this tab. Each item in this list shows the Proteome ID (which is a link to the Proteome page in InterPro), the name of the species carrying this proteome and the number of proteins in this proteome that match the entry. From the Actions column, users can also see a list of these proteins by clicking the first icon (View matching proteins), download the data in different formats or View proteome information.

Example of a Pfam entry with the Proteomes tab selected.

Example of a Pfam entry page (PF02171) with the tab Proteomes selected.

Structures

This tab displays a list of all the PDB structures linked to the proteins matching the Pfam entry. For each structure, you can see the PDB accession, the name of the structure in PDB, and a small-sized protein sequence viewer displaying the location of the Pfam entry in the protein structure chain.

Example of a Pfam entry with the Structures tab selected.

Example of a Pfam entry page (PF02171) with the tab Structures selected.

Viewing the structures of domains and proteins helps to understand what their function might be, and how individual residues are arranged in the three-dimensional space. Often, two residues which seem distant along the linear protein sequence can be very close in the folded protein.

By clicking on a PDB accession, name or small image of the structure, a view of the corresponding InterPro structure page that summarises all of the entries of Pfam and other databases and resources for each chain of the structure will be displayed in a protein sequence viewer.

The position of each entry within the overall 3D structure can be visualised by choosing the Pfam entry of interest in the drop-down list Highlight Entry in the 3D structure or by clicking on the bar corresponding to the entry match in the protein sequence viewer. Additionally, links to similar PDB viewers and cross-references to other structural databases are provided in the External links section.

Signature

This tab shows the HMM logo of the Pfam model, visualised using Skylign. HMM logos are one way of visualising profile HMMs. Logos provide a quick overview of the properties of an HMM in a graphical form.

The visualisation displays the amino acid conservation for each residue in the model. The rendered area can be dragged to a desired position to navigate large logos. Alternatively, a specific residue number can be written in the Model column text box. When selecting a particular residue in the logo, the probabilities of each amino acid are displayed in the bottom part.

Example of a Pfam entry with the Signature tab selected.

Example of a Pfam entry page (PF02171) with the tab Signature and the second residue position in the protein sequence selected.

AlphaFold

Many of the proteins found in the Pfam entry may have a predicted structure generated by AlphaFoldDB. A list of all the predicted structures available in AlphaFoldDB for the proteins belonging to this entry is displayed in this tab. For each protein in the list, its Uniprot accession, name, the species it belongs to, its length, and a button that allows you to show the predicted structure of this protein in the structure viewer are displayed.

It is also possible to click on the Uniprot accession to go to the InterPro protein page and go to the Alphafold tab, where the position of the different entries in the 3D structure viewer are displayed by clicking on the bar corresponding to the entry match in the protein sequence viewer.

Example of a Pfam entry with the AlphaFold tab selected.

Example of a Pfam entry page (PF02171) with the AlphaFold tab selected.

Alignment

Three different alignments can be chosen and visualised in this tab:

  1. The seed alignment shows the multiple sequence alignment used to create the HMM model in Pfam. This is a representative set of sequences of the family and it normally has a relatively short number of protein sequences (from the Uniprot Reference proteomes).

  2. The full alignment shows all the protein sequences from the Uniprot Reference proteomes that match this model.

  3. The uniprot alignment includes all the protein sequences matched by this Pfam model in the whole Uniprot database.

The colour coding of the alignment can be customised through the options available in the Colors section.

All the alignments can be downloaded by clicking on the Download button.

Example of a Pfam entry with the Alignment tab selected.

Example of a Pfam entry page (PF02171) with the Alignment tab and the seed alignment selected. The right edge of the grey bar was dragged to the left to zoom in and visualise better an specific region of the protein sequence selected.

Curation

This tab is divided into two subsections:

  1. In the first section, you can see details about Pfam curators and Sequence ontology.

  2. The second section displays the HMM building command used to generate the HMM profile defining the Pfam entry and offers the possibility to download it.

Example of a Pfam entry with the Curation tab selected.

Example of a Pfam entry page (PF02171) with the tab Curation selected.

Pfam entries creation and annotation

For each Pfam entry, the HMM model is run against the protein sequences belonging to the UniProt Reference Proteomes. Subsequently, Pfam curators set a statistical cut-off, known as a gathering threshold (GA) for an entry. Sequences failing to make a statistical match above this threshold are not reported as hits. The threshold is quite conservative, to minimise false positives (although they are unavoidable sometimes). The Pfam model is then run against the whole UniProtKB database before every InterPro release and these are the matches shown in the Proteins tab on the Pfam entry page.

Different Pfam entries have annotations providing diverse amounts of information. Many of them have a description created by Pfam curators. Anyone can contribute to this annotation by contacting directly the curators through the Add your annotation toolbox located on the right-hand side of the Overview tab.

If you know of a domain that is not present in Pfam, you can submit it to the Pfam helpdesk and we will endeavour to build a Pfam entry for it. We ask that you supply us with a multiple sequence alignment of the domain (please send the alignment file as a text file (e.g. .txt) and not in the format of a specific application such as Microsoft Word (e.g. a .doc) file), and associated literature evidence if available.

Give feedback to the curators.

Select Add your annotation to give feedback to curators.

In addition, Pfam encourages the annotation of Pfam families via Wikipedia. Below the traditional description of the Pfam entry, you may find the text from a Wikipedia article that we feel provides a good description of the Pfam family.

If a family does not yet have a Wikipedia article assigned to it, there are several ways for you to help us add one. You can find more information about the process in the Wikipedia section.

Clan page organisation

If a Pfam entry is included in a Pfam clan this information will be displayed in the Overview tab in the Pfam entry page, next to Clan, below the Pfam short name, with a link to the corresponding clan page. More information about how clans are defined can be found in Summary.

Additionally, it is possible to browse through the Pfam clans by selecting Browse + By Clan/Set in the InterPro website menu and select Pfam in the database section.

Example of a Pfam clan page with the default tab selected (Overview)

Example of a Pfam clan page (CL0219). All the tabs described below can be found on the left-hand side menu. The Overview tab is displayed by default.

In each Pfam clan page, different tabs with relevant information are available, the information they contain is described below.

Overview

The clan Overview tab is the default display, where the clan accession number, its short name and the author(s) are shown at the top. A description of the clan is displayed below, with the relevant literature references.

An interactive view of the Pfam entries included in the clan is also displayed, different label types can be chosen through the Label Content menu: Accession, Name and Short name.

Entries

The list of Pfam entries included in the clan is provided in this tab. For each entry, accession, name, short name and links to the entries SEED alignment and domain architectures pages are available.

Users can export this data in different formats, by clicking on the Export button, and customise the page settings, by clicking on the wheel icon.

Example of a Pfam clan page with the Entries tab selected.

Example of a Pfam clan page (CL0219) with the Entries tab selected.

Proteins

The list of proteins matching any Pfam entry belonging to the clan is displayed in this tab. The view can be customised to show:

  1. All proteins (from the whole UniProtKB database).

  2. Only Reviewed proteins (from SwissProt - manually curated).

  3. Only Unreviewed proteins (from TrEMBL - derived from public databases automatically integrated into UniProt).

For each protein, the corresponding protein page in InterPro can be accessed by clicking on the protein accession or name, and the InterPro taxonomy page can be accessed by clicking on the species name.

Users can export this data in different formats, by clicking on the Export button, and customise the page settings, by clicking on the wheel icon.

Example of a Pfam clan page with the tab Proteins selected.

Example of a Pfam clan page (CL0219) with the Proteins tab selected. The table is customised to show only Reviewed proteins.

Structures

This tab displays a list of all the PDB structures linked to the proteins matching any Pfam entry belonging to the clan. For each structure, you can see the PDB accession and the name of the structure in PDB.

By clicking on a PDB accession, name or small image of the structure, a view of the corresponding InterPro structure page that summarises all of the entries of Pfam and other databases and resources for each chain of the structure will be displayed in a protein sequence viewer.

The position of each entry within the overall 3D structure can be visualised by choosing the Pfam entry of interest in the drop-down list Highlight Entry in the 3D structure or by clicking on the bar corresponding to the entry match in the protein sequence viewer. Additionally, links to similar PDB viewers and cross-references to other structural databases are provided in the External links section.

Example of a Pfam clan page with the Structures tab selected.

Example of a Pfam clan page (CL0219) with the Structures tab selected.

Taxonomy

This tab shows by default a list of all the species that the proteins matched by any Pfam entry of the clan belong to.

These data can also be seen as a tree. These visualisation options can be chosen from the icon panel above the list. All this information can be downloaded in different formats.

Example of a Pfam clan page with the Taxonomy tab selected.

Example of a Pfam clan page (CL0219) with the Taxonomy tab selected. The default table listing the species having proteins belonging to this Pfam clan is displayed on top and an example view of a taxonomic tree for this clan is shown below.

Proteomes

A list of the reference proteomes matched by any Pfam entry belonging to the clan is displayed in this tab. For each item in this list the Proteome ID (which is a link to the Proteome page in InterPro), the name of the species carrying this proteome and the number of proteins in this proteome that match the entry are displayed. From the Actions column, users can also access a list of these proteins by clicking the first icon (View matching proteins), download the data in different formats or View proteome information.

Example of a Pfam clan page with the Proteomes tab selected.

Example of a Pfam clan page (CL0219) with the tab Proteomes selected.

Alignment

This tab shows a list of the Pfam entries belonging to the clan with a relationship to each other. By clicking on each entry, users can see a small-size protein viewer showing the alignment of the related entries.

Example of a Pfam clan page with the Proteomes tab selected.

Example of a Pfam clan page (CL0219) with the tab Alignment selected.

Training materials

Pfam Quick tour

  • Quick tour provides a brief introduction to the Pfam database and how to access its annotations through the InterPro website.

Creating Families

Repeats in Pfam

  • Repeats describes how repeats are represented in Pfam.

Finding Pfam information in the InterPro website

  • Webinar explaining where to find Pfam annotations in the InterPro website.

Frequently Asked Questions (FAQs)

What is Pfam?

Pfam is a collection of multiple sequence alignments and profile hidden Markov models (HMMs). Each Pfam profile HMM represents a protein family or domain. By searching a protein sequence against the Pfam library of profile HMMs, you can determine which domains it carries i.e. its domain architecture. Pfam can also be used to analyse proteomes and questions of more complex domain architectures.

For each Pfam accession, we have an entry page. See Searching a specific Pfam entry for more information on how to access them.

What is a Pfam entry page?

On the Pfam entry page you can view all the associated information, from annotation to structure predictions of the protein members. See Pfam entry page organisation for a detailed description on how this data is presented.

What is a clan?

Some of the Pfam entries are grouped into clans. Pfam defines a clan as a collection of entries that have arisen from a single evolutionary origin. Evidence of their evolutionary relationship can be in the form of similarity in tertiary structures, or, when structures are not available, from common sequence motifs.

When a sequence region has overlapping matches to more than one entry within the same clan, we only show one of those matches. If the sequence region is also in the seed alignment for an entry, only the match to that entry is shown. Otherwise we show the entry that corresponds to the match with the lowest E-value.

The clan pages can be accessed by following a link from the Pfam entry page, or alternatively they can be accessed by by selecting Browse + By Clan/Set in the InterPro website menu and select Pfam in the database section.

For each clan page, you can access all the related data. See Clan page organisation for more information.

What criteria do you use for adding families into clans?

We use a variety of measures. Where possible we do use experimental and predicted structures to guide us and that is always the gold standard. We also intend to harmonise this organisation with the ECOD classification. In the absence of a structure we use:

  • Profile comparisons such as HHsearch

  • The fact that a sequence significantly matches two profile HMMs in the same region of the sequence

  • A method called SCOOP, that looks for common matches in search results that may indicate a relationship

All of this information is used by the Pfam curators to make a decision about where families are related and we strive to find information in the literature that support the relationship, e.g. common function.

What is Pfam-N?

Pfam-N (N for network) provides additional Pfam matches identified by the Google Research team using deep learning approaches. You can read more about it in this initial blog post and this update. The matches for Pfam-N are displayed under the ‘Other features’ section in the protein sequence viewer.

Example of InterPro protein page showing the protein viewer

Example of InterPro protein page for the Uniprot accession A1AA27. The protein viewer shows the integrated and unintegrated Pfam entries matching this protein sequence, as well as other features such as the Pfam-N matches. The colour code of the protein viewer is customised as Colour By + Member Database for all Pfam entries to be highlighted in blue. The tooltip is active and the mouse was hovering over one of the Pfam-N matches when this screenshot was taken.

What is the relation between Pfam and InterPro?

The Pfam database is a large collection of protein domain families. Each family is represented by multiple sequence alignments and a profile hidden Markov model (HMM) and has information associated. All the information in the Pfam database can be accessed through the InterPro website, where it is hosted. See Getting started for more information.

InterPro is a resource that provides functional analysis of protein sequences by classifying them into families and predicting the presence of domains and important sites through the use of predictive models, known as signatures, provided by several collaborating databases (referred to as member databases). One of it 13 member databases is Pfam. For further information you can explore the InterPro About pages.

Members of the Pfam team at the EMBL-EBI are also part of the InterPro team. In this way, while both protein resources are independently maintained, there is a really close relation between them, with feedback constantly going in both directions to improve protein classification.

This Pfam entry is not integrated into InterPro, is it useful anyway?

Yes! The criteria for creating a new Pfam entry and a new InterPro entry are different. A Pfam entry might not yet be curated in IntePro or might not reach InterPro’s standards for integration. However, it can still provide very important information about a protein of interest.

Is possible to build Wise2 with HMMER3 support?

The way we get round the problem with the difference in HMMER versions, is to convert the profile HMMs that are in HMMER3 format to HMMER2 format using the HMMER3 program “hmconvert” (with -2) flag. To make the searches feasible, we screen the DNA for potential domains using ncbi-blast and the Pfam-A.fasta as a target library. GeneWise is then used to calculate a subset of profile HMMs against the DNA. There is some down-weighting of the bits-per-position between H2 and H3 HMMs that the conversion does not account for, leading inevitably to some false negatives for some families/sequences. However, until GeneWise is patched to deal with HMMER3 models, this is the best course of action.

How can I search Pfam locally?

If you have a large number of sequences or you don’t want to post your sequence across the web, you can search your sequence locally using InterProScan.

Why doesn’t Pfam include my sequence?

Pfam is built from a fixed release of UniProtKB. At each InterPro release we incorporate sequences from the latest release of UniProtKB. This means that, at any time, the sequences used by Pfam might be several weeks behind those in the most up-to-date versions of the sequence databases. If your sequence isn’t in Pfam, you can still find out what domains it contains by pasting it into the sequence search box (see InterPro online sequence search for more information).

Why is there apparent redundancy of UniProtKB IDs in the full-length FASTA sequence file?

A given Pfam family may match a single protein sequence multiple times, if the domain/family is a repeating unit, for example, or when the profile HMM matches only to short stretches of the sequence but matches several times. In such cases the FASTA file with the full length sequences will contain multiple copies of the same sequence.

How can I submit a new domain?

If you know of a domain that is not present in Pfam, you can submit it to the Pfam helpdesk and we will endeavour to build a Pfam entry for it. We ask that you supply us with a multiple sequence alignment of the domain (please send the alignment file as a text file (e.g. .txt) and not in the format of a specific application such as Microsoft Word (e.g. a .doc) file) or a list of Uniprot accessions, and associated literature evidence if available.

Can I search my protein against Pfam?

Of course! Please look at the sequence search section for instructions on how to do it.

What is the difference between the ‘-‘ and ‘.’ characters in your full alignments?

The ‘-‘ and ‘.’ characters both represent gap characters. However they do tell you some extra information about how the profile HMM has generated the alignment. The ‘-‘ symbols are where the alignment of the sequence has used a delete state in the profile HMM to jump past a match state. This means that the sequence is missing a column that the profile HMM was expecting to be there. The ‘.’ character is used to pad gaps where one sequence in the alignment has sequence from the profile HMMs insert state. See the alignment below where both characters are used. The profile HMM states emitting each column are shown. Note that residues emitted from the Insert (I) state are in lower case.

_images/alignment.png

How can I visualise the position of a Pfam entry in a structure?

In the Structures tab of a Pfam entry or a Pfam clan page you can find links to relevant InterPro structure pages.

In an InterPro structure page, or each chain of the structure matches to Pfam and other databases and resources are displayed in a protein sequence viewer. On top you can see the 3D structure viewer.

The position of each Pfam entry within the overall 3D structure can be visualised by: * hovering the mouse over the coloured bar representing the Pfam match in the protein sequence viewer. * choosing the Pfam entry of interest in the drop-down list Highlight Entry in the 3D structure.

The AlphaFold tab of a Pfam entry provides links to the predicted structure of every protein matching the entry. In the AlphaFold tab of InterPro protein pages, the position of each Pfam entry within the overall 3D structure can be visualised by hovering the mouse over the coloured bar representing the Pfam match in the protein sequence viewer.

Example of the AlphaFold tab of an InterPro protein page showing the structure viewer

Example of the AlphaFold tab in the InterPro protein page for the Uniprot accession A1AA27. When the screenshot was taken, the mouse was hovering over the Pfam entry PF20258.

Why don’t you have domain YYYY in Pfam?

We are very keen to be alerted to new domains. If you can provide us with a multiple sequence alignment then we will try hard to incorporate it into the database. If you know of a domain, but don’t have a multiple sequence alignment, we still want to know, for simple families just one sequence is enough. Again contact the Pfam helpdesk.

Are there other databases which do this?

To a certain extent yes, there are a number of “second generation” databases which are trying to organise protein space into evolutionarily conserved regions. InterPro combines information from several of them in a single searchable resource.

So which database is better?

As with everything, it depends on your problem: we would certainly suggest using more than one method. Pfam is likely to provide more interpretable results, with crisp definitions of domains in a protein.

Glossary

Alignment coordinates

HMMER3 reports two sets of domain coordinates for each profile HMM match. The envelope coordinates delineate the region on the sequence where the match has been probabilistically determined to lie, whereas the alignment coordinates delineate the region over which HMMER is confident that the alignment of the sequence to the profile HMM is correct. Our full alignments contain the envelope coordinates from HMMER3.

Architecture

The collection of domains that are present on a protein.

Clan

A collection of related Pfam entries. The relationship may be defined by similarity of sequence, structure or profile-HMM.

Domain

A structural unit.

Domain score

The score of a single domain aligned to a profile HMM. Note that, for HMMER2, if there was more than one domain, the sequence score was the sum of all the domain scores for that Pfam entry. This is not quite true for HMMER3.

DUF

Domain of unknown function.

Envelope coordinates

See Alignment coordinates.

Family

A collection of related protein regions.

Full alignment

An alignment of the set of related sequences which score higher than the manually set threshold values for the profile HMMs of a particular Pfam entry.

Gathering threshold (GA)

Also called the gathering cut-off, this value is the search threshold used to build the full alignment. The gathering threshold is assigned by a curator when the family is built. The GA is the minimum score a sequence must attain in order to belong to the full alignment of a Pfam entry. For each Pfam profile HMM we have two GA cutoff values, a sequence cutoff and a domain cutoff.

HMMER

The suite of programs that Pfam uses to build and search profile HMMs. Since Pfam release 24.0 we have used HMMER version 3 to make Pfam. See the HMMER site for more information.

Hidden Markov model (HMM)

A profile HMM is a probabilistic model. In Pfam we use profile HMMs to transform the information contained within a multiple sequence alignment into a position-specific scoring system. We search our profile HMMs against the UniProt protein database to find homologous sequences.

Motif

A short unit found outside globular domains.

Noise cutoff (NC)

The bit scores of the highest scoring match not in the full alignment.

Pfam-A

A profile HMM based hand curated Pfam entry which is built using a small number of representative sequences. We manually set a threshold value for each profile-HMM and search our models against the UniProtKB database. All of the sequences which score above the threshold for a Pfam entry are included in the entry’s full alignment.

Pfam-B

A set of unannotated, computationally generated multiple sequence alignments. They are one of the sources we use for creating Pfam-A entries.

Posterior probability

HMMER reports a posterior probability for each residue that matches a ‘match’ or ‘insert’ state in the profile HMM. A high posterior probability shows that the alignment of the amino acid to the match/insert state is likely to be correct, whereas a low posterior probability indicates that there is alignment uncertainty. This is indicated on a scale with ‘*’ being 10, the highest certainty, down to 1 being complete uncertainty. Within Pfam we display this information as a heat map view, where green residues indicate high posterior probability, and red ones indicate a lower posterior probability.

Repeat

A short unit which is unstable in isolation but forms a stable structure when multiple copies are present.

Seed alignment

An alignment of a set of representative sequences for a Pfam entry. We use this alignment to construct the profile HMMs for the Pfam entry.

Sequence score

The total score of a sequence aligned to a profile HMM. If there is more than one domain, the sequence score is the sum of all the domain scores for that Pfam entry. If there is only a single domain, the sequence and the domains score for the protein will be identical. We use the sequence score to determine whether a sequence belongs to the full alignment of a particular Pfam entry.

Trusted cutoff (TC)

The bit scores of the lowest scoring match in the full alignment.

Pfam scores

E-values and Bit-scores

Pfam-A is based around hidden Markov model (HMM) searches, as provided by the HMMER3 package. In HMMER3, like BLAST, E-values (expectation values) are calculated. The E-value is the number of hits that would be expected to have a score equal to or better than this value by chance alone. A good E-value is much less than 1. A value of 1 is what would be expected just by chance. In principle, all you need to decide on the significance of a match is the E-value.

E-values are dependent on the size of the database searched, so we use a second system in-house for maintaining Pfam models, based on a bit score (see below), which is independent of the size of the database searched. For each Pfam family, we set a bit score gathering (GA) threshold by hand, such that all sequences scoring at or above this threshold appear in the full alignment. It works out that a bit score of 24 equates to an E-value of approximately 0.1, and a score 27 of to approximately 0.01. From the gathering threshold both a “trusted cutoff” (TC) and a “noise cutoff” (NC) are recorded automatically. The TC is the score for the next highest scoring match above the GA, and the NC is the score for the sequence next below the GA, i.e. the highest scoring sequence not included in the full alignment.

Sequence versus domain scores

There’s an additional wrinkle in the scoring system. HMMER3 calculates two kinds of scores, the first for the sequence as a whole and the second for the domain(s) on that sequence. The “sequence score” is the total score of a sequence aligned to the model (the HMM); the “domain score” is the score for a single domain — these two scores are virtually identical where only one domain is present on a sequence. Where there are multiple occurrences of the domain on a sequence any individual match may be quite weak, but the sequence score is the sum of all the individual domain scores, since finding multiple instances of a domain increases our confidence that that sequence belongs to that protein family, i.e. truly matches the model.

Meaning of bit-score for non-mathematicians

A bit score of 0 means that the likelihood of the match having been emitted by the model is equal to that of it having been emitted by the Null model (by chance). A bit score of 1 means that the match is twice as likely to have been emitted by the model than by the Null. A bit score of 2 means that the match is 4 times as likely to have been emitted by the model than by the Null. So, a bit score of 20 means that the match is 2 to the power 20 times as likely to have been emitted by the model than by the Null.

Citing Pfam

Pfam References

Pfam: The protein families database in 2021 J. Mistry, S. Chuguransky, L. Williams, M. Qureshi, G.A. Salazar, E.L.L. Sonnhammer, S.C.E. Tosatto, L. Paladin, S. Raj, L.J. Richardson, R.D. Finn, A. Bateman Nucleic Acids Research (2020) doi: 10.1093/nar/gkaa913

The Pfam protein families database in 2019: S. El-Gebali, J. Mistry, A. Bateman, S.R. Eddy, A. Luciani, S.C. Potter, M. Qureshi, L.J. Richardson, G.A. Salazar, A. Smart, E.L.L. Sonnhammer, L. Hirsh, L. Paladin, D. Piovesan, S.C.E. Tosatto, R.D. Finn Nucleic Acids Research (2019) doi: 10.1093/nar/gky995

The Pfam protein families database: towards a more sustainable future: R.D. Finn, P. Coggill, R.Y. Eberhardt, S.R. Eddy, J. Mistry, A.L. Mitchell, S.C. Potter, M. Punta, M. Qureshi, A. Sangrador-Vegas, G.A. Salazar, J. Tate, A. Bateman Nucleic Acids Research (2016) Database Issue 44:D279-D285

The Pfam protein families database: R.D. Finn, A. Bateman, J. Clements, P. Coggill, R.Y. Eberhardt, S.R. Eddy, A. Heger, K. Hetherington, L. Holm, J. Mistry, E.L.L. Sonnhammer, J. Tate, M. Punta Nucleic Acids Research (2014) Database Issue 42:D222-D230

The Pfam protein families database: M. Punta, P.C. Coggill, R.Y. Eberhardt, J. Mistry, J. Tate, C. Boursnell, N. Pang, K. Forslund, G. Ceric, J. Clements, A. Heger, L. Holm, E.L.L. Sonnhammer, S.R. Eddy, A. Bateman, R.D. Finn Nucleic Acids Research (2012) Database Issue 40:D290-D301

The Pfam protein families database: R.D. Finn, J. Mistry, J. Tate, P. Coggill, A. Heger, J.E. Pollington, O.L. Gavin, P. Gunesekaran, G. Ceric, K. Forslund, L. Holm, E.L. Sonnhammer, S.R. Eddy, A. Bateman Nucleic Acids Research (2010) Database Issue 38:D211-D222

The Pfam protein families database: R.D. Finn, J. Tate, J. Mistry, P.C. Coggill, J.S. Sammut, H.R. Hotz, G. Ceric, K. Forslund, S.R. Eddy, E.L. Sonnhammer and A. Bateman Nucleic Acids Research (2008) Database Issue 36:D281-D288

Pfam: clans, web tools and services: R.D. Finn, J. Mistry, B. Schuster-Böckler, S. Griffiths-Jones, V. Hollich, T. Lassmann, S. Moxon, M. Marshall, A. Khanna, R. Durbin, S.R. Eddy, E.L.L. Sonnhammer and A. Bateman Nucleic Acids Research (2006) Database Issue 34:D247-D51

Enhanced protein domain discovery by using language modeling techniques from speech recognition: L. Coin, A. Bateman and R. Durbin Proc. Natl. Acad. Sci. USA. (2003) 100(8):4516-20

The Pfam Protein Families Database: A. Bateman, L. Coin, R. Durbin, R.D. Finn, V. Hollich, S. Griffiths-Jones, A. Khanna, M. Marshall, S. Moxon, E.L.L. Sonnhammer, D.J. Studholme, C. Yeats and S.R. Eddy Nucleic Acids Research (2004) 32:D138-D141

The Pfam Protein Families Database: A. Bateman, E. Birney, L. Cerruti, R. Durbin, L. Etwiller, S.R. Eddy, S. Griffiths-Jones, K.L. Howe, M. Marshall and E.L. Sonnhammer Nucleic Acids Research (2002) 30(1):276-280

The Pfam Protein Families Database: A. Bateman, E. Birney, R. Durbin, S.R. Eddy, K.L. Howe and E.L. Sonnhammer Nucleic Acids Research (2000) 28:263-266

Pfam 3.1: 1313 multiple alignments match the majority of proteins: A. Bateman, E. Birney, R. Durbin, S.R. Eddy, R.D. Finn and E.L.L. Sonnhammer Nucleic Acids Research (1999) 27:260-262

Pfam: multiple sequence alignments and HMM-profiles of protein domains: E.L.L. Sonnhammer, S.R. Eddy, E. Birney, A. Bateman and R. Durbin Nucleic Acids Research (1998) 26:320-322

Pfam: a comprehensive database of protein families based on seed alignments: E.L.L. Sonnhammer, S.R. Eddy and R. Durbin Proteins (1997) 28:405-420

Book Chapters on Pfam

Homology-Based Annotation of Large Protein Datasets M. Punta, J. Mistry Data Mining Techniques for the Life Sciences. Methods in Molecular Biology vol 1415 (2016) doi: 10.1007/978-1-4939-3572-7_8

Identifying Protein Domains with the Pfam Database P. Coggill, R.D. Finn, A. Bateman Current Protocols in Bioinformatics Chapter 2, Unit 2.5 (2008) doi: 10.1002/0471250953.bi0205s23

Pfam: a domain-centric method for analysing proteins and proteomes J. Mistry and R.D. Finn Comparative Genomics. Methods in Molecular Biology vol 396 (2007) doi: 10.1007/978-1-59745-515-2_4

Pfam: the protein families database R.D. Finn (eds M.J. Dunn, L.B. Jorde, P.F.R. Little, S. Subramaniam) Genetics, Genomics, Proteomics and Bioinformatics, Section 6: Protein Families (2005) doi: 10.1002/047001153X.g306303

Identifying protein domains with the Pfam database R.D. Finn, A. Bateman and S. Griffiths-Jones Current Protocols in Bioinformatics (2003) doi: 10.1002/0471250953.bi0205s01

Pfam Annotation in Wikipedia

Pfam encourages the annotation of Pfam entries via Wikipedia. Below the traditional description of the Pfam entry, you may find the text from a Wikipedia article that we feel provides a good description of the Pfam entry.

Wikipedia content in the website

When we build a new Pfam family, we try to find a Wikipedia article that describes the family and provides what we feel to be a valuable annotation for it.

Where a Wikipedia article has been assigned to a family, the Overview tab of the Pfam family page will show the first paragraph of the article together with the image and main table on it, below the traditional Pfam annotation created by curators. Click on the title of the Wikipedia article for the full article to open in a new tab.

The **Overview** tab of the Pfam entry page displays the associated Wikipedia article when available

The Overview tab of the Pfam entry page displays the associated Wikipedia article when available.

Contributing annotations

One of the advantages of using Wikipedia to provide our annotations is that any user can now contribute to that annotation text. In many cases, families that do not yet have a Wikipedia article can be assigned an article that already exists. In some cases, however, no suitable article exists, and in that case we would encourage you to consider adding one to Wikipedia yourself.

You can now contribute to the improvement of Pfam annotations in several ways. Besides giving feedback directly to the curators to improve the traditional description, you can improve existing Wikipedia articles linked to Pfam families. In addition, if you come across a family that does not yet have a Wikipedia article assigned to it, we would really like to add one. If you know of an article that would provide a useful description of a family, please let us know via our annotation submission form (click the Add your annotation button on the family page).

Editing Wikipedia articles

Before you edit for the first time

Wikipedia is a free, online encyclopedia. Although anyone can edit or contribute to an article, Wikipedia has some strong editing guidelines and policies, which promote the Wikipedia standard of style and etiquette. Your edits and contributions are more likely to be accepted (and remain) if they are in accordance with this policy.

You should take a few minutes to view the following pages:

How your contribution will be recorded

Anyone can edit a Wikipedia entry. You can do this either as a new user or you can register with Wikipedia and log on. When you click on the “Edit Wikipedia article” button, your browser will direct you to the edit page for this entry in Wikipedia. If you are a registered user and currently logged in, your changes will be recorded under your Wikipedia user name. However, if you are not a registered user or are not logged on, your changes will be logged under your computer’s IP address. This has two main implications. Firstly, as a registered Wikipedia user your edits are more likely seen as valuable contribution (although all edits are open to community scrutiny regardless). Secondly, if you edit under an IP address you may be sharing this IP address with other users. If your IP address has previously been blocked (due to being flagged as a source of ‘vandalism’) your edits will also be blocked. You can find more information on this and creating a user account in Wikipedia.

Does Pfam agree with the content of the Wikipedia entry?

Pfam has chosen to link families to Wikipedia articles. In some case we have created or edited these articles but in many other cases we have not made any direct contribution to the content of the article. The Wikipedia community does monitor edits to try to ensure that (a) the quality of article annotation increases, and (b) vandalism is very quickly dealt with. However, we would like to emphasise that Pfam does not curate the Wikipedia entries and we cannot guarantee the accuracy of the information on the Wikipedia page.

Contact us

If you have problems editing or experience problems with these pages please contact us through the Pfam helpdesk.

Generating graphics

We provide different tools to generate graphical representation of the features found within a sequence. There are a variety of different shapes and styles and each one has a particular meaning. This page gives an in-depth description of the elements of the library from the Nightingale component and the Domain graphic tool.

Domain visualisation using Nightingale

The Nightingale component is used throughout the InterPro website to display protein features in the protein sequence viewer. We provide a tool that allows to generate a personalised representation of protein features using Nightingale v4.

In the JavaScript part in the link above, you can edit the sequence and feature variables to display the features for your protein of interest. You can then take a screenshot of the graphical representation generated.

For each component, you can specific the following parameters:

{ // family/single domain
  accession: "PF14826",
  start: 19,
  end: 181,
  color: "blue",
  short_name: "FACT-Spt16_Nlob",
  shape:"roundRectangle"
},

{ // discontinuous domain
  accession: "PF08644",
  locations: [{ fragments: [{ start: 520, end: 616 }, { start: 725, end: 810 }] }],
  color: "#A42ea2",
  short_name: "SPT16",
  shape:"roundRectangle"
}

Recommended shapes:

  • Family or domain components are rendered as rectangles with curved ends (roundRectangle), while other components are represented by rectangle shapes

  • Repeat/motif: rectangle

  • Other sequence motifs (e.g. signal peptides, low complexity regions, coiled-coils and transmembrane regions): rectangle

  • disulphide bridges: bridge

  • signal peptide: diamond

_images/nightingale_dom_graph.png

Example of a domain visualisation using Nightingale v4.

For more information about how to use the Nightingale component, you can have a look at its documentation.

Domain graphics tool

The domain graphics tool provides graphical representation of the features found within a sequence, termed domain graphics. There are a variety of different shapes and styles and each one has a particular meaning. This page gives an in-depth description of the elements of the Domain graphics library. Please note that we do not recommend to use this tool anymore, but to use the Domain visualisation using Nightingale instead.

The library that generates the images in this page uses a JSON string to describe the domain graphic.

You can generate your own graphics using the domain graphics library available on github.

The sequence

The base sequence, undecorated by any domains or features, is represented by a plain grey bar:

_images/seq.png
{
  "length" : "400"
}

The length of the domain graphic that is drawn is proportional to the length of the sequence itself. Any domains or features which are drawn on the sequence are also scaled by the same factor.

Pfam-A

The high quality, curated Pfam-A domains are classified into one of six different types: family, domain, coiled-coil, disordered, repeat and motif (for more details see Summary). These different classification types are rendered slightly differently.

Family/domain

It is possible for a sequence to match either the full length of a Pfam HMM (a full length match), or to match a portion of an HMM (a fragment match). The two types of match are rendered differently.

Both family and domain entries are rendered as rectangles with curved ends when the sequence is a full length match. Different types of domain are displayed with different colours. When the domain image is long enough, the domain name is shown within the domain itself. In most cases, you can click on the domains to visit the “family page” for that domain. Moving the mouse over the domain image should also display a tooltip showing the domain name, as well as the start and end positions of the domain.

_images/seqDomain.png
{
  "length" : "400",
  "regions" : [
    {
      "type" : "pfama",
      "text" : "Domain",
      "colour" : "#9999ff",
      "display": "true",
      "startStyle" : "curved",
      "endStyle" : "curved",
      "start" : "40",
      "end" : "200",
      "aliStart" : "50",
      "aliEnd" : "175"
    },
    {
      "type" : "pfama",
      "text" : "LongFamilyNamesNotShown",
      "colour" : "#399",
      "display" : true,
      "startStyle" : "straight",
      "endStyle" : "straight",
      "start" : "210",
      "end" : "250",
      "aliStart" : "215",
      "aliEnd" : "245"
    }
  ]
}

From Pfam 24.0 onwards, Pfam has been generated using HMMER3, which introduces the concept of “envelope coordinates” for a match. Envelope regions are represented in domain graphics as lighter coloured regions. The graphic above shows short envelope regions at the ends of both domains.

When the sequence does not match the full length of the HMM that models a Pfam entry, matching domain fragments are shown. When a sequence match does not pass through the first position in the HMM, the N-terminal side of the domain graphic is drawn with a jagged edge instead of a curved edge. Similarly, when a sequence match does not pass through the last position of the HMM, the C-terminal side of the domain graphic is drawn with a jagged edge. In some rarer cases, the sequence match may not pass through either of the first or last positions of the HMM, in which case both sides are drawn with jagged edges. Examples of all three cases are shown below.

_images/partial.png
{
  "length" : "400",
  "regions" : [
    {
      "type" : "pfama",
      "text" : "PartN",
      "colour" : "#9999ff",
      "display": "true",
      "startStyle" : "jagged",
      "endStyle" : "curved",
      "start" : "10",
      "end" : "110"
    },
    {
      "type" : "pfama",
      "text" : "PartN_C",
      "colour" : "#399",
      "display" : true,
      "startStyle" : "jagged",
      "endStyle" : "jagged",
      "start" : "115",
      "end" : "204"
    },
    {
      "type" : "pfama",
      "text" : "PartC",
      "colour" : "#1fc01f",
      "display" : true,
      "startStyle" : "curved",
      "endStyle" : "jagged",
      "start" : "210",
      "end" : "350"
    }
  ]
}
Repeat/motif

Repeats and motifs are types of Pfam domain which do not form independently folded units. In order to distinguish them from domains of type family and domain, repeats and motifs are represented by rectangles with straight edges. As for families and domains, partial matches are represented with jagged edges.

_images/repeat.png
{
  "length" : "200",
  "regions" : [
    {
      "type" : "pfama",
      "text" : "HEAT",
      "colour" : "#1fc01f",
      "display": "true",
      "startStyle" : "straight",
      "endStyle" : "straight",
      "start" : "2",
      "end" : "34"
    },
    {
      "type" : "pfama",
      "text" : "HEAT",
      "colour" : "#1fc01f",
      "display": "true",
      "startStyle" : "straight",
      "endStyle" : "straight",
      "start" : "82",
      "end" : "118"
    },
    {
      "type" : "pfama",
      "text" : "HEAT",
      "colour" : "#1fc01f",
      "display": "true",
      "startStyle" : "straight",
      "endStyle" : "straight",
      "start" : "120",
      "end" : "155"
    },
    {
      "type" : "pfama",
      "text" : "HEAT",
      "colour" : "#1fc01f",
      "display": "true",
      "startStyle" : "straight",
      "endStyle" : "straight",
      "start" : "159",
      "end" : "195"
    }
  ]
}
Discontinuous nested domains

Some domains in Pfam are disrupted by the insertion of another domain (or domains) within them. A number of names have been given to this arrangement: discontinuous (referring to the outer domain), inserted or nested (both referring to the inner domain). For example, in many sequences containing an IMPDH domain (PF00478), the IMPDH domain is continuous along the primary sequence. However, in some cases the linear sequence of the IMPDH domain is broken by the insertion of a CBS domain (PF00571), as shown below.

Where three-dimensional structures are available for representatives of a Pfam domain, it is generally clear that the three-dimensional arrangement of the domain containing the nested domain is maintained. Typically the nested domain is found inserted within a surface exposed loop, having little or no effect on the structure of the other domain. Such an arrangement explains why and how these nested domains can be functionally tolerated.

To represent this arrangement of domain graphically, the discontinuous domain is represented in two parts (as shown below). These two parts are joined by a line bridging them.

_images/nested.png
{
  "length" : "200",
  "regions" : [
    {
      "type" : "pfama",
      "text" : "IMPDH",
      "colour" : "#1fc01f",
      "display": "true",
      "startStyle" : "curved",
      "endStyle" : "jagged",
      "start" : "5",
      "end" : "80"
    },
    {
      "type" : "pfama",
      "text" : "CBS",
      "colour" : "#c00f0f",
      "display": "true",
      "startStyle" : "curved",
      "endStyle" : "curved",
      "start" : "81",
      "end" : "135"
    },
    {
      "type" : "pfama",
      "text" : "IMPDH",
      "colour" : "#1fc01f",
      "display": "true",
      "startStyle" : "jagged",
      "endStyle" : "curved",
      "start" : "136",
      "end" : "197"
    }
  ],
  "markups" : [
    {
      "type" : "Nested",
      "colour" : "#000000",
      "display" : true,
      "v_align" : "top",
      "start" : "76",
      "end" : "136"
    }
  ]
}
Other sequence motifs

In addition to domains, smaller sequences motifs are represented by the domain graphics. Currently the following motifs are represented: signal peptides, low complexity regions, coiled-coils and transmembrane regions. These usually take lower priority than other regions that are drawn and they are therefore often obscured by, for example, a Pfam-A graphic being drawn over the top of them. An example of each motif is shown below.

_images/motifs.png
{
  "length" : "200",
  "motifs" : [
    {
      "type" : "sig_p",
      "colour" : "#ff9c00",
      "display" : true,
      "start" : 1,
      "end" : 27
    },
    {
      "type" : "low_complexity",
      "colour" : "#0FF",
      "display" : true,
      "start" : 39,
      "end" : 47
    },
    {
      "type" : "low_complexity",
      "colour" : "#0FF",
      "display" : true,
      "start" : 67,
      "end" : 76
    },
    {
      "type" : "coiled_coil",
      "colour" : "#9cff00",
      "display" : true,
      "start" : 103,
      "end" : 123
    },
    {
      "type" : "transmembrane",
      "colour" : "#F00",
      "display" : true,
      "start" : 155,
      "end" : 175
    },
    {
      "type" : "transmembrane",
      "colour" : "#F00",
      "display" : true,
      "start" : 180,
      "end" : 195
    }
  ]
}
Signal peptides

Signal peptides are short regions (<60 residues long) found at the N-terminus of proteins, which direct the post-translational transport of a protein and are subsequently removed by peptidases. More specifically, a signal peptide is characterised by a short hydrophobic helix (approximately 7-15 residues). This helix is preceded by a slight positively charged region of highly variable length (approximately 1-12 residues). Between the hydrophobic helix and the cleavage site is a somewhat polar and uncharged region, of between 3 and 8 amino-acids. In InterPro, we use Phobius and SignalP for the prediction of signal peptides and they can be represented graphically by a small orange box.

Low complexity regions

Low complexity regions are regions of biased sequence composition, usually comprised of different types of repeats. These regions have been shown to be functionally important in some proteins, but they are generally not well understood and are masked out to focus on globular domains within the protein.

The presence of a low complexity region can be indicated by a cyan rectangle.

Disordered regions

We use MobiDB-lite for the prediction of disordered regions in the query sequence.

Coiled-coils

Coiled coils are motifs found in proteins that structurally form alpha-helices that wrap or wind around each other. Normally, two to three helices are involved, but cases of up to seven alpha-helices have been reported. Coiled-coils are found in a wide variety of proteins, many functionally very important. In InterPro they are obtained from COILS.

Coiled-coils can be represented by a small lime-green rectangle.

Transmembrane regions

Integral membrane proteins contain one or more transmembrane regions that are comprised of an alpha-helix that passes through or “spans” a membrane. Transmembrane helices are quite variable in length, with the average being about 20 amino-acids in length. Phobius and TMHMM are used for the annotation of transmembrane regions, which can be represented by a red rectangle.

Other Sequence features

Below is a demonstration of how disulphide bridges and active site residues can be represented. Each of these features can appear above or below the sequence, but in the example below the disulphide bridges are shown above the sequence and the active site residues below the line.

_images/activeSite.png
{
  "length" : "400",
  "regions" : [
    {
      "colour" : "#1fc01f",
      "endStyle" : "curved",
      "startStyle" : "curved",
      "display" : true,
      "end" : "104",
      "href" : "/family/Inhibitor_I29",
      "text" : "Inhibitor_I29",
      "metadata" : {
        "scoreName" : "e-value",
        "score" : "1.3e-38",
        "description" : "Inhibitor_I29",
        "accession" : "PF08246",
        "end" : "104",
        "database" : "pfam",
        "identifier" : "Inhibitor_I29",
        "type" : "Domain",
        "start" : "48"
      },
      "type" : "pfama",
      "start" : "48"
    },
    {
      "colour" : "#c00f0f",
      "endStyle" : "curved",
      "startStyle" : "curved",
      "display" : true,
      "end" : "343",
      "href" : "/family/Peptidase_C1",
      "text" : "Peptidase_C1",
      "modelLength" : "307",
      "metadata" : {
        "scoreName" : "e-value",
        "score" : "1.3e-38",
        "description" : "Peptidase_C1",
        "accession" : "PF00112",
        "end" : "343",
        "database" : "pfam",
        "identifier" : "Peptidase_C1",
        "type" : "Domain",
        "start" : "134"
      },
      "type" : "pfama",
      "start" : "134"
    }
  ],
  "markups" : [
    {
      "lineColour" : "#CCC",
      "colour" : "#CCC",
      "display" : true,
      "end" : "196",
      "v_align" : "top",
      "metadata" : {
        "database" : "pfam",
        "type" : "Disulphide, 155-196",
        "end" : "196",
        "start" : "155"
      },
      "type" : "Disulphide",
      "start" : "155"
    },
    {
      "lineColour" : "#CCC",
      "colour" : "#CCC",
      "display" : true,
      "end" : "228",
      "v_align" : "top",
      "metadata" : {
        "database" : "pfam",
        "type" : "Disulphide, 189-228",
        "end" : "228",
        "start" : "189"
      },
      "type" : "Disulphide",
      "start" : "189"
    },
    {
      "lineColour" : "#CCC",
      "colour" : "#CCC",
      "display" : true,
      "end" : "333",
      "v_align" : "top",
      "metadata" : {
        "database" : "pfam",
        "type" : "Disulphide, 286-333",
        "end" : "333",
        "start" : "286"
      },
      "type" : "Disulphide",
      "start" : "286"
    },
    {
      "lineColour" : "#000",
      "colour" : "#F36",
      "display" : true,
      "residue" : "C",
      "headStyle" : "diamond",
      "v_align" : "bottom",
      "type" : "Active site",
      "metadata" : {
        "database" : "pfam",
        "description" : "Active site, C158",
        "start" : "158"
      },
      "start" : "158"
    },
    {
      "lineColour" : "#000",
      "colour" : "#90C",
      "display" : true,
      "residue" : "H",
      "headStyle" : "diamond",
      "v_align" : "bottom",
      "type" : "Pfam predicted active site, H292",
      "metadata" : {
        "database" : "pfam",
        "description" : "Pfam predicted active site, H292",
        "start" : "292"
      },
      "start" : "292"
    },
    {
      "lineColour" : "#000",
      "colour" : "#F6F",
      "display" : true,
      "residue" : "N",
      "headStyle" : "diamond",
      "v_align" : "bottom",
      "type" : "Pfam predicted active site, N308",
      "metadata" : {
        "database" : "pfam",
        "description" : "Pfam predicted active site, N308",
        "start" : "308"
      },
      "start" : "308"
    }
  ],
  "motifs" : [
    {
      "colour" : "#ff9c00",
      "metadata" : {
        "database" : "seq",
        "type" : "Signal peptide",
        "end" : "26",
        "start" : "1"
      },
      "type" : "sig_p",
      "display" : true,
      "end" : 26,
      "start" : 1
    }
  ]
}
Disulphide bridges

Disulphide bridges play a fundamental role in the folding and stability of some proteins. They are formed by covalent bonding between the thiol groups from two cysteine residues. The disulphide bridge annotations can be represented by a solid bridge-shaped line. When multiple disulphide bonds occur, the heights of the bridges are adjusted to avoid overlaps between them. Inter-protein disulphides are represented by single vertical lines. Moving the mouse over the “bridge graphic” shows the details of the bond in a tooltip.

Active site residues

Within an enzyme, a small number of residues are directly involved in catalysis of a reaction. These are termed active site residues. Within Pfam there are three categories of active site: those that are experimentally determined, those that are predicted by UniProt and those predicted by Pfam. All three types can be represented by a “lollipop” with a diamond head. The head is coloured red, pink and purple for each of the three types respectively.

“Lollipops”

A wide range of different lollipop styles can be create by combining different line and head colours with different drawing styles. The lollipop head can be drawn as a square, circle or diamond, as a simple coloured bar, or as an arrow (pointing away from the sequence) or a “pointer” (an arrow pointing towards the sequence).

_images/lollipop.png
{
  "length" : "200",
  "markups" : [
    {
      "lineColour" : "#666",
      "colour" : "#F36",
      "display" : true,
      "v_align" : "top",
      "headStyle" : "square",
      "type" : "Red square, above sequence",
      "start" : "20"
    },
    {
      "lineColour" : "#F00",
      "colour" : "#F0F",
      "display" : true,
      "v_align" : "bottom",
      "headStyle" : "square",
      "type" : "Purple square, red line, below sequence",
      "start" : "40"
    },
    {
      "lineColour" : "#666",
      "colour" : "#F00",
      "display" : true,
      "v_align" : "top",
      "headStyle" : "diamond",
      "type" : "Red diamond, above sequence",
      "start" : "60"
    },
    {
      "lineColour" : "#666",
      "colour" : "#0F0",
      "display" : true,
      "v_align" : "bottom",
      "headStyle" : "circle",
      "type" : "Green circle, below sequence",
      "start" : "80"
    },
    {
      "lineColour" : "#666",
      "colour" : "#0F0",
      "display" : true,
      "v_align" : "top",
      "headStyle" : "arrow",
      "type" : "Green arrow, above sequence",
      "start" : "100"
    },
    {
      "lineColour" : "#666",
      "colour" : "#08F",
      "display" : true,
      "v_align" : "bottom",
      "headStyle" : "pointer",
      "type" : "Blue pointer, below sequence",
      "start" : "120"
    },
    {
      "lineColour" : "#666",
      "colour" : "#F80",
      "display" : true,
      "v_align" : "top",
      "headStyle" : "line",
      "type" : "Orange line, above sequence",
      "start" : "140"
    }
  ]
}
Tooltips

If appropriate metadata are present in the sequence description, the domain graphics library can also add tooltips to the image. The example below shows a domain graphic and its description includes the necessary metadata for generating tooltips.

_images/tooltip.png
{
  "length" : "950",
  "regions" : [
    {
      "modelStart" : "5",
      "modelEnd" : "292",
      "colour" : "#2dcf00",
      "endStyle" : "jagged",
      "startStyle" : "jagged",
      "display" : true,
      "end" : "361",
      "aliEnd" : "361",
      "href" : "/family/PF00082",
      "text" : "Peptidase_S8",
      "modelLength" : "307",
      "metadata" : {
        "scoreName" : "e-value",
        "score" : "1.3e-38",
        "description" : "Subtilase family",
        "accession" : "PF00082",
        "end" : "587",
        "database" : "pfam",
        "aliEnd" : "573",
        "identifier" : "Peptidase_S8",
        "type" : "Domain",
        "aliStart" : "163",
        "start" : "159"
      },
      "type" : "pfama",
      "aliStart" : "163",
      "start" : "159"
    },
    {
      "modelStart" : "5",
      "modelEnd" : "292",
      "colour" : "#2dcf00",
      "endStyle" : "jagged",
      "startStyle" : "jagged",
      "display" : true,
      "end" : "587",
      "aliEnd" : "573",
      "href" : "/family/PF00082",
      "text" : "Peptidase_S8",
      "modelLength" : "307",
      "metadata" : {
        "scoreName" : "e-value",
        "score" : "1.3e-38",
        "description" : "Subtilase family",
        "accession" : "PF00082",
        "end" : "587",
        "database" : "pfam",
        "aliEnd" : "573",
        "identifier" : "Peptidase_S8",
        "type" : "Domain",
        "aliStart" : "163",
        "start" : "159"
      },
      "type" : "pfama",
      "aliStart" : "470",
      "start" : "470"
    },
    {
      "modelStart" : "12",
      "modelEnd" : "100",
      "colour" : "#ff5353",
      "endStyle" : "curved",
      "startStyle" : "jagged",
      "display" : true,
      "end" : "469",
      "aliEnd" : "469",
      "href" : "/family/PF02225",
      "text" : "PA",
      "modelLength" : "100",
      "metadata" : {
        "scoreName" : "e-value",
        "score" : "7.1e-09",
        "description" : "PA domain",
        "accession" : "PF02225",
        "end" : "469",
        "database" : "pfam",
        "aliEnd" : "469",
        "identifier" : "PA",
        "type" : "Family",
        "aliStart" : "385",
        "start" : "362"
      },
      "type" : "pfama",
      "aliStart" : "385",
      "start" : "362"
    },
    {
      "modelStart" : "1",
      "modelEnd" : "112",
      "colour" : "#5b5bff",
      "endStyle" : "curved",
      "startStyle" : "curved",
      "display" : true,
      "end" : "726",
      "aliEnd" : "726",
      "href" : "/family/PF06280",
      "text" : "DUF1034",
      "modelLength" : "112",
      "metadata" : {
        "scoreName" : "e-value",
        "score" : "1.1e-13",
        "description" : "Fn3-like domain (DUF1034)",
        "accession" : "PF06280",
        "end" : "726",
        "database" : "pfam",
        "aliEnd" : "726",
        "identifier" : "DUF1034",
        "type" : "Domain",
        "aliStart" : "613",
        "start" : "613"
      },
      "type" : "pfama",
      "aliStart" : "613",
      "start" : "613"
    }
  ],
  "markups" : [
    {
      "lineColour" : "#ff0000",
      "colour" : "#000000",
      "display" : true,
      "end" : "470",
      "v_align" : "top",
      "metadata" : {
        "database" : "pfam",
        "type" : "Link between discontinuous regions",
        "end" : "470",
        "start" : "361"
      },
      "type" : "Nested",
      "start" : "361"
    },
    {
      "lineColour" : "#333333",
      "colour" : "#e469fe",
      "display" : true,
      "residue" : "S",
      "headStyle" : "diamond",
      "v_align" : "top",
      "type" : "Pfam predicted active site",
      "metadata" : {
        "database" : "pfam",
        "description" : "S Pfam predicted active site",
        "start" : "538"
      },
      "start" : "538"
    },
    {
      "lineColour" : "#333333",
      "colour" : "#e469fe",
      "display" : true,
      "residue" : "D",
      "headStyle" : "diamond",
      "v_align" : "top",
      "type" : "Pfam predicted active site",
      "metadata" : {
        "database" : "pfam",
        "description" : "D Pfam predicted active site",
        "start" : "185"
      },
      "start" : "185"
    },
    {
      "lineColour" : "#333333",
      "colour" : "#e469fe",
      "display" : true,
      "residue" : "H",
      "headStyle" : "diamond",
      "v_align" : "top",
      "type" : "Pfam predicted active site",
      "metadata" : {
        "database" : "pfam",
        "description" : "H Pfam predicted active site",
        "start" : "235"
      },
      "start" : "235"
    }
  ],
  "metadata" : {
    "database" : "uniprot",
    "identifier" : "Q560V8_CRYNE",
    "organism" : "Cryptococcus neoformans (Filobasidiella neoformans)",
    "description" : "Putative uncharacterized protein",
    "taxid" : "5207",
    "accession" : "Q560V8"
  },
  "motifs" : [
    {
      "colour" : "#ffa500",
      "metadata" : {
        "database" : "Phobius",
        "type" : "sig_p",
        "end" : "23",
        "start" : "1"
      },
      "type" : "sig_p",
      "display" : true,
      "end" : 23,
      "start" : 1
    },
    {
      "colour" : "#00ffff",
      "metadata" : {
        "database" : "seg",
        "type" : "low_complexity",
        "score" : "2.5100",
        "end" : "21",
        "start" : "3"
      },
      "type" : "low_complexity",
      "display" : false,
      "end" : 21,
      "start" : 3
    },
    {
      "colour" : "#86bcff",
      "metadata" : {
        "database" : "seg",
        "type" : "low_complexity",
        "score" : "1.4900",
        "end" : "156",
        "start" : "134"
      },
      "type" : "low_complexity",
      "display" : true,
      "end" : "156",
      "start" : "134"
    },
    {
      "colour" : "#00ffff",
      "metadata" : {
        "database" : "seg",
        "type" : "low_complexity",
        "score" : "2.0200",
        "end" : "187",
        "start" : "173"
      },
      "type" : "low_complexity",
      "display" : false,
      "end" : "187",
      "start" : "173"
    },
    {
      "colour" : "#00ffff",
      "metadata" : {
        "database" : "seg",
        "type" : "low_complexity",
        "score" : "2.0800",
        "end" : "218",
        "start" : "207"
      },
      "type" : "low_complexity",
      "display" : false,
      "end" : "218",
      "start" : "207"
    },
    {
      "colour" : "#00ffff",
      "metadata" : {
        "database" : "seg",
        "type" : "low_complexity",
        "score" : "2.1300",
        "end" : "231",
        "start" : "220"
      },
      "type" : "low_complexity",
      "display" : false,
      "end" : "231",
      "start" : "220"
    },
    {
      "colour" : "#00ffff",
      "metadata" : {
        "database" : "seg",
        "type" : "low_complexity",
        "score" : "2.0000",
        "end" : "554",
        "start" : "538"
      },
      "type" : "low_complexity",
      "display" : false,
      "end" : "554",
      "start" : "538"
    },
    {
      "colour" : "#86bcff",
      "metadata" : {
        "database" : "seg",
        "type" : "low_complexity",
        "score" : "1.9100",
        "end" : "590",
        "start" : "578"
      },
      "type" : "low_complexity",
      "display" : true,
      "end" : "590",
      "start" : 588
    },
    {
      "colour" : "#00ffff",
      "metadata" : {
        "database" : "seg",
        "type" : "low_complexity",
        "score" : "1.7600",
        "end" : "831",
        "start" : "822"
      },
      "type" : "low_complexity",
      "display" : false,
      "end" : "831",
      "start" : "822"
    }
  ]
}

Querying Pfam using the InterPro API

This is an introduction to the InterPro API to retrieve Pfam annotations. A programmatic interface, commonly called an Application Programming Interface (API) allows users to write scripts or programs to access data, rather than having to rely on a browser to view a site.

Basic concepts

URLs

A RESTful service typically sends and receives data over HTTP, the same protocol that’s used by websites and browsers. As such, the services provided through a RESTful interface are identified using URLs.

In the InterPro website we use a different URL to provide the standard HTML representation of Pfam data and the alternative programmatic JSON format through the API.

To see the data for a particular Pfam-A family, you would visit the following URL in your browser:

To retrieve the data in JSON format, just add an extra parameter, api, to the URL:

The response from the server will now be an JSON document, rather than an HTML page.

The table below lists the website vs API url (scroll the table to right/left to see the corresponding API url):

Data

Example website url

Example API url

List all Pfam entries

/entry/pfam/#table

/api/entry/pfam/

List all Pfam entries of type Family

/entry/integrated/pfam/?type=family#table

/api/entry/pfam/?type=family

Information about a specific Pfam entry

/entry/pfam/PF02171/

/api/entry/pfam/PF02171/

List of proteins matching a specific entry

/entry/pfam/PF02171/protein/UniProt/

/api/protein/UniProt/entry/pfam/PF02171/

Different domain architectures matching

a specific entry

/entry/pfam/PF02171/domain_architecture/

/api/entry/pfam/PF02171?ida

List of PDB structures matching

a specific entry

/entry/pfam/PF02171/structure/PDB/

/api/structure/PDB/entry/pfam/PF02171/

Download the PDB file of the predicted

structure from RoseTTAFold

/entry/pfam/PF14331/rosettafold/

/api/entry/pfam/PF14331?model:structure

List all Pfam clans

/set/all/entry/pfam/#table

/api/set/pfam

List of Pfam entries in

a specific clan

/set/pfam/CL0219/entry/pfam/

/api/entry/pfam/set/pfam/CL0219?page_size=100

General information about a specific clan

/set/pfam/CL0219/

/api/set/pfam/CL0219

Available outputs formats

By default, the output of the API calls are in JSON format. However, we also support Text and TSV formats. To obtain the results in Text or TSV format, add the ?format=txt or ?format=tsv to the API url.

Examples of API outputs

Pfam-A annotations

You can retrieve a sub-set of the data in a Pfam-A family page as an JSON document using the following URL: /api/entry/pfam/PF02171

HTTP 200 OK
Allow: GET, HEAD
Content-Type: application/json
InterPro-Version: 94.0
InterPro-Version-Minor: 3
Vary: Accept

{
    "metadata": {
        "accession": "PF02171",
        "entry_id": null,
        "type": "family",
        "go_terms": null,
        "source_database": "pfam",
        "member_databases": null,
        "integrated": "IPR003165",
        "hierarchy": null,
        "name": {
            "name": "Piwi domain",
            "short": "Piwi"
        },
        "description": [
            "<p>This domain is found in the protein Piwi and its relatives.  The function of this domain is the dsRNA guided hydrolysis of ssRNA. Determination of the crystal structure of Argonaute reveals that PIWI is an RNase H domain, and identifies Argonaute as Slicer, the enzyme that cleaves mRNA in the RNAi RISC complex [[cite:PUB00020128]].  In addition, Mg+2 dependence and production of 3'-OH and 5' phosphate products are shared characteristics of RNaseH and RISC. The PIWI domain core has a tertiary structure belonging to the RNase H family of enzymes.  RNase H fold proteins all have a five-stranded mixed beta-sheet surrounded by helices. By analogy to RNase H enzymes which cleave single-stranded RNA guided by the DNA strand in an RNA/DNA hybrid, the PIWI domain can be inferred to cleave single-stranded RNA, for example mRNA, guided by double stranded siRNA.</p>"
        ],
        "wikipedia": {
            "title": "Argonaute",
            "extract": "<p>The <b>Argonaute</b> protein family, first discovered for its evolutionarily conserved stem cell function, plays a central role in RNA silencing processes as essential components of the RNA-induced silencing complex (RISC). RISC is responsible for the gene silencing phenomenon known as RNA interference (RNAi). Argonaute proteins bind different classes of small non-coding RNAs, including microRNAs (miRNAs), small interfering RNAs (siRNAs) and Piwi-interacting RNAs (piRNAs). Small RNAs guide Argonaute proteins to their specific targets through sequence complementarity, which then leads to mRNA cleavage, translation inhibition, and/or the initiation of mRNA decay.</p>",
            "thumbnail": "iVBORw0KGgoAAAANSUhEUgAAAUAAAAERCAYAAAAKQn74AAAABmJLR0QA/wD/AP+plGSa52gioXAaa3IEVq0YtzeLBALwLopkkEqrtLV1UFV1Bi5X3tc2hQbOd2BxHJxefuXrDIOykhL2rFvHJ+3tlE6fzkljxnD1+..."
        },
        "literature": {
            "PUB00020128": {
                "PMID": 15284453,
                "ISBN": null,
                "volume": "305",
                "issue": "5689",
                "year": 2004,
                "title": "Crystal structure of Argonaute and its implications for RISC slicer activity.",
                "URL": null,
                "raw_pages": "1434-7",
                "medline_journal": "Science",
                "ISO_journal": "Science",
                "authors": [
                    "Song JJ",
                    "Smith SK",
                    "Hannon GJ",
                    "Joshua-Tor L."
                ],
                "DOI_URL": "http://dx.doi.org/10.1126/science.1102514"
            },
            "PUB00018283": {
                "PMID": 11050429,
                "ISBN": null,
                "volume": "25",
                "issue": "10",
                "year": 2000,
                "title": "Domains in gene silencing and cell differentiation proteins: the novel PAZ domain and redefinition of the Piwi domain.",
                "URL": null,
                "raw_pages": "481-2",
                "medline_journal": "Trends Biochem Sci",
                "ISO_journal": "Trends Biochem. Sci.",
                "authors": [
                    "Cerutti L",
                    "Mian N",
                    "Bateman A."
                ],
                "DOI_URL": "http://dx.doi.org/10.1016/S0968-0004(00)01641-8"
            }
        },
        "set_info": {
            "accession": "CL0219",
            "name": "RNase_H"
        },
        "overlaps_with": null,
        "counters": {
            "subfamilies": 0,
            "domain_architectures": 602,
            "interactions": 0,
            "matches": 29456,
            "pathways": 0,
            "proteins": 28420,
            "proteomes": 2266,
            "sets": 1,
            "structural_models": {
                "alphafold": 21504,
                "rosettafold": 0
            },
            "structures": 112,
            "taxa": 9708
        },
        "entry_annotations": {
            "hmm": 0,
            "logo": 0,
            "alignment:uniprot": 22300,
            "alignment:full": 14038,
            "alignment:seed": 15
        },
        "cross_references": {}
    }
}

Some Pfam families are removed or merged into others, in which case they become “dead” families. If you try to retrieve annotation information about a dead family, you’ll get a simple JSON document that only tells you that there isn’t any content associated to this entry.

GET /api/entry/pfam/PF06700

HTTP 204 No Content
Allow: GET, HEAD
Content-Type: application/json
InterPro-Version: 94.0
InterPro-Version-Minor: 3
Vary: Accept

{
    "detail": "There is no data associated with the requested URL.\nList of endpoints: ['entry', 'pfam', 'PF06700']"
}
Pfam-A family list

You can retrieve a list of all Pfam-A families in the latest Pfam release, either as an JSON document or as a tab-delimited text file. Both formats contain the Pfam-A accession, Pfam-A identifier and description:

You can also view the list in a web browser by removing the format=json parameter from the URL.

HTTP 200 OK
Allow: GET, HEAD
Content-Type: application/json
InterPro-Version: 94.0
InterPro-Version-Minor: 3
Vary: Accept

{
  "count": 19632,
  "next": "https://www.ebi.ac.uk/interpro/api/entry/all/pfam/?cursor=cD1QRjAwMDIw",
  "previous": null,
  "results": [
      {
          "metadata": {
              "accession": "PF00001",
              "name": "7 transmembrane receptor (rhodopsin family)",
              "source_database": "pfam",
              "type": "family",
              "integrated": "IPR000276",
              "member_databases": null,
              "go_terms": null
          }
      },
      ...
Protein data

You can retrieve a sub-set of the data in a protein page as a JSON document using any of the following URL: /api/protein/uniprot/P00789

HTTP 200 OK
Allow: GET, HEAD
Content-Type: application/json
InterPro-Version: 94.0
InterPro-Version-Minor: 3
Vary: Accept

{
  "metadata": {
      "accession": "P00789",
      "id": "CANX_CHICK",
      "source_organism": {
          "taxId": "9031",
          "scientificName": "Gallus gallus",
          "fullName": "Gallus gallus (Chicken)"
      },
      "name": "Calpain-1 catalytic subunit",
      "description": [
          "Calcium-regulated non-lysosomal thiol-protease which catalyze limited proteolysis of substrates involved in cytoskeletal remodeling and signal transduction"
      ],
      "length": 705,
      "sequence": "MMPFGGIAARLQRDRLRAEGVGEHNNAVKYLNQDYEALKQECIESGTLFRDPQFPAGPTALGFKELGPYSSKTRGVEWKRPSELVDDPQFIVGGATRTDICQGALGDCWLLAAIGSLTLNEELLHRVVPHGQSFQEDYAGIFHFQIWQFGEWVDVVVDDLLPTKDGELLFVHSAECTEFWSALLEKAYAKLNGCYESLSGGSTTEGFEDFTGGVAEMYDLKRAPRNMGHIIRKALERGSLLGCSIDITSAFDMEAVTFKKLVKGHAYSVTAFKDVNYRGQQEQLIRIRNPWGQVEWTGAWSDGSSEWDNIDPSDREELQLKMEDGEFWMSFRDFMREFSRLEICNLTPDALTKDELSRWHTQVFEGTWRRGSTAGGCRNNPATFWINPQFKIKLLEEDDDPGDDEVACSFLVALMQKHRRRERRVGGDMHTIGFAVYEVPEEAQGSQNVHLKKDFFLRNQSRARSETFINLREVSNQIRLPPGEYIVVPSTFEPHKEADFILRVFTEKQSDTAELDEEISADLADEEEITEDDIEDGFKNMFQQLAGEDMEISVFELKTILNRVIARHKDLKTDGFSLDSCRNMVNLMDKDGSARLGLVEFQILWNKIRSWLTIFRQYDLDKSGTMSSYEMRMALESAGFKLNNKLHQVVVARYADAETGVDFDNFVCCLVKLETMFRFFHSMDRDGTGTAVMNLAEWLLLTMCG",
      "proteome": "UP000000539",
      "gene": null,
      "go_terms": [
          {
              "identifier": "GO:0004198",
              "name": "calcium-dependent cysteine-type endopeptidase activity",
              "category": {
                  "code": "F",
                  "name": "molecular_function"
              }
          },
          {
              "identifier": "GO:0006508",
              "name": "proteolysis",
              "category": {
                  "code": "P",
                  "name": "biological_process"
              }
          },
          {
              "identifier": "GO:0005509",
              "name": "calcium ion binding",
              "category": {
                  "code": "F",
                  "name": "molecular_function"
              }
          }
      ],
      "protein_evidence": 1,
      "source_database": "reviewed",
      "is_fragment": false,
      "ida_accession": "664e4b66bad68bfc279e99cc8deefa39a1edf04a",
      "counters": {
          "domain_architectures": 10280,
          "entries": 30,
          "isoforms": 0,
          "proteomes": 1,
          "sets": 5,
          "structures": 0,
          "taxa": 1,
          "dbEntries": {
              "prosite": 2,
              "panther": 1,
              "prints": 1,
              "profile": 2,
              "smart": 2,
              "pfam": 2,
              "cathgene3d": 3,
              "cdd": 3,
              "ssf": 3,
              "interpro": 11
          },
          "proteome": 1,
          "taxonomy": 1,
          "similar_proteins": 10280
      }
  }
}

Sending requests using a script

Most programming languages have the ability to send HTTP requests and receive HTTP responses. A Python script to retrieve data about a Pfam family might be as trivial as this:

#!/usr/bin/env python3

# standard library modules
import sys, errno, re, json, ssl
from urllib import request
from urllib.error import HTTPError
from time import sleep

BASE_URL = "https://www.ebi.ac.uk:443/interpro/api/entry/pfam/PF02171"

def output_list():
  #disable SSL verification to avoid config issues
  context = ssl._create_unverified_context()

  next = BASE_URL
  last_page = False


  attempts = 0
  while next:
    try:
      req = request.Request(next, headers={"Accept": "application/json"})
      res = request.urlopen(req, context=context)
      # If the API times out due a long running query
      if res.status == 408:
        # wait just over a minute
        sleep(61)
        # then continue this loop with the same URL
        continue
      elif res.status == 204:
        #no data so leave loop
        break
      payload = json.loads(res.read().decode())
      next = payload["next"]
      attempts = 0
      if not next:
        last_page = True
    except HTTPError as e:
      if e.code == 408:
        sleep(61)
        continue
      else:
        # If there is a different HTTP error, it wil re-try 3 times before failing
        if attempts < 3:
          attempts += 1
          sleep(61)
          continue
        else:
          sys.stderr.write("LAST URL: " + next)
          raise e

    for i, item in enumerate(payload["results"]):
      sys.stdout.write(item["metadata"]["name"]["short"] + "\n")
    # Don't overload the server, give it time before asking for more
    if next:
      sleep(1)

if __name__ == "__main__":
  output_list()

This script prints out the short name (Piwi) for the family (PF02171).

FTP Site

The Pfam FTP site is organised into the following structure:

The most important directory is probably the current_release directory. It contains the flat-files for the current release.

AntiFam

The AntiFam directory contains the different releases of the AntiFam database, identifying spurious proteins.

RoseTTAfold_aln

The RoseTTAfold_aln directory contains the alignments used by RoseTTAfold to predict their structural models using Pfam.

Tools

The Tools directory contains code for running pfam_scan.pl.

The README file in this directory contains detailed information on how to install and run the script. Note that we have gone for a modular design for the script, enabling the functionally on the script to be easily incorporated into other Perl scripts. The ChangeLog file lists the versions and changes to the current version of pfam_scan.pl (and modules).

There is also an archived version of pfam_scan.pl that works with HMMER2. This is no longer supported.

There is also Perl code for predicting active sites found in the ActSitePred directory, the functionality of which has been rolled into the latest version of pfam_scan.pl.

current_release

This directory contains the flat-files for the current release. Some of these files may be very large (of the order of several hundred megabytes). Please check the sizes on the FTP site before trying to download them over a slow connection. The files, most of which are compressed using gzip, are:

Pfam-A.dead.gz

Listing of families that have been deleted from the database

Pfam-A.fasta.gz

A 90% non-redundant set of fasta formatted sequence for each Pfam-A family. The sequences are only the regions hit by the model and not full length protein sequences.

Pfam-A.full.gz

The full alignments of the curated families, searched against pfamseq/UniProtKB reference proteomes (prior to Pfam 29.0, this file contained matches against the whole of UniProtKB).

Pfam-A.full.uniprot.gz

The full alignments of the curated families, searched against UniProtKB.

Pfam-A.full.metagenomics.gz

The full alignments of the curated families, searched against Metagenomic proteins.

Pfam-A.full.ncbi.gz

The full alignments of the curated families, searched against NCBI GenPept proteins.

Pfam-A.hmm.dat.gz

A data file that contains information about each Pfam-A family

Pfam-A.hmm.gz

The Pfam HMM library for Pfam-A families

Pfam-A.seed.gz

The SEED alignments of the curated families. Please note that from Pfam 36.0 onwards we do not process PDB data. Hence secondary structure annotations aren’t available in the SEED alignments anymore. However, PDBe provides mappings to Pfam which might be of interest.

Pfam-C.gz

A file that contains the information about clans and the Pfam-A membership

active_site.dat.gz

Tar-ball of data required for the predictions of active sites by Pfam scan.

database.tar

A tar-ball of the database_files directory.

database_files

Directory contains two files per table from the MySQL database. The .sql.gz file contains the table structure, the .txt.gz files contains the content of the table as a tab delimited file with field enclosed by a single quote (‘).

diff.gz

Stores the change status of entries between this release and last.

md5_checksums

A file containing the MD5 checksum for each release file

metaseq.gz

Metagenomic sequence database used in this release

ncbi.gz

NCBI GenPept sequence database used in this release.

pdbmap.gz

Mapping between PDB structures and Pfam domains.

pfamseq.gz

A fasta version of Pfam’s underlying sequence database

relnotes.txt

Release notes

swisspfam.gz

ASCII representation of the domain structure of UniProt proteins according to Pfam

uniprot_sprot.dat.gz

Data files from UniProt containing SwissProt annotations.

uniprot_trembl.dat.gz

Data files from UniProt containing TrEMBL annotations.

userman.txt

File containing information about the flatfile format

Pfam-A.regions.tsv.gz

A tab separated file containing UniProtKB reference proteome sequences and Pfam-A family information

Pfam-A.regions.uniprot.tsv.gz

A tab separated file containing UniProtKB sequences and Pfam-A family information

Pfam-A.clans.tsv.gz

A tab separated file containing Pfam-A family and clan information for all Pfam-A families

mappings

The mapping directory contains the mapping between PDB structures and Pfam entries.

papers

The papers directory contains each NAR database issue article describing Pfam. For a detailed description of the latest changes to Pfam, please consult (and cite) these papers.

releases

The releases directory contains all the flat files and database dumps (where appropriate) for all version of Pfam to-date. The files in more recent releases are the same as described for the current release, but in older releases the contents do change.

About Pfam

Pfam version 35.0 was produced at the European Bioinformatics Institute using a sequence database called Pfamseq, which is based on UniProt release 2021_03.

Pfam is freely available under the Creative Commons Zero (“CC0”) licence.

Pfam is powered by the HMMER3 package written by Sean Eddy and his group at HHMI/ Harvard University, and built by the Xfam consortium.

EMBL logo Harvard logo SBC logo BioComputing logo

Pfam is supported by the following organisations:

EMBL logo

EMBL is EMBL-EBI’s parent organisation. It provides core funding (staff, space, equipment) for Pfam.

Welcome Trust logo

The Wellcome Trust has supported Pfam since the database inception, via core funding when based at the Wellcome Trust Sanger Institute. As well as providing and maintaining the campus on which the EMBL-EBI is located, the Wellcome Trust also now provides significant funding for Pfam (grant 221320/Z/20/Z). The current grant runs from October 2020 to September 2025.

BBSRC logo

BBSRC is supporting Pfam activities (BB/S020381/1) from November 2019 to October 2023 and has previously supported Pfam activities via grants BB/L024136/1 and BB/N00521X/1.

HHMI logo

The Howard Hughes Medical Institute supports the Eddy group.


Many organisations have supported Pfam activities in the past.

For more information, please contact the Pfam helpdesk.

Authorship

We greatly appreciate the contribution made to Pfam from our user community. To acknowledge these contributions, and allow them to be an integral part of researchers’ profiles, we have incorporated ORCID identifiers, displaying these in the ‘curation and model’ tab of each Pfam entry.

To claim Pfam entries against your ORCID, first go to the EMBL-EBI website and search by putting your ORCID into the search box and selecting ‘Protein Families’ from the drop down.

EMBL-EBI website search

From the results page, select Pfam on the left-hand side and you should then see a link at the top of the results inviting you to Claim to ORCID. Select all the entries you want to add to your ORCID and click on the button. A pop-up window will appear, inviting you to authenticate in the ORCID website. Once you are logged-in, click on the Claim button.

ORCID claim in the EMBL-EBI website

Team Members

The Pfam Consortium

Pfam is the product from an international consortium of researchers that has been borne out of its original development by Erik Sonnhammer, Sean Eddy and Richard Durbin. The current list of consortium members, their institutes and primary roles are listed below.

European Bioinformatics Institute (EMBL-EBI), UK
Harvard University, USA
  • Sean Eddy - Founding developer and author of HMMER software

Stockholm Bioinformatics Center, Sweden
  • Erik Sonnhammer - Coordinator of Pfam-Sweden and founding developer

External contributors

Pfam includes families that have been built by external contributors:

NCBI, USA
  • Lakshminarayan Iyer

  • L. Aravind

  • Zhang Dapeng

  • Vivek Anantharaman

Sanford-Burnham Medical Research Institute, USA
  • Adam Godizk

Previous contributors

  • Gabriel Aldam

  • Shimelis Assefa

  • Matthew Bashton

  • Ewan Birney

  • Lorenzo Cerrutti

  • Yuanyuan Chang

  • Jody Clements

  • Penny Coggill

  • Lachlan Coin

  • Robson De Souza

  • Richard Durbin

  • Ruth Eberhardt

  • Sara El-Gebali

  • Kyle Ellrott

  • Matthew Fenech

  • Kristoffer Forslund

  • O. Luke Gavin

  • Prasad Gunasekaran

  • Sam Griffiths-Jones

  • Kevin Howe

  • Lukasz Jaroszewski

  • Nicola Kerrison

  • Marta Llagostera

  • Aurélien Luciani

  • Mhairi Marshall

  • Nina Mian

  • William Mifsud

  • Jaina Mistry

  • Simon Moxon

  • Simon Potter

  • Joanne Pollington

  • Marco Punta

  • Matloob Qureshi

  • Lorna Richardson

  • Stephen-John Sammut

  • Benjamin Schuster-Böckler

  • David Studholme

  • John Tate

  • Benjamin Vella-Briffa

  • Lowri Williams

  • Arthur Wuster

  • Corin Yeats

Pfam is a collaborative venture and we hope to be able to interact with as many people as possible, in order to provide a quality database. Please get in touch with any one of us for more information about Pfam. You can contact us trough the Pfam helpdesk.

Contact us

Helpdesk

We run a helpdesk , which handles annotation comments, data enquiries and general problems with the Pfam database. We use a request tracking system to monitor emails to the helpdesk, so you should receive an automated response to your email, letting you know that the system has logged your mail and notified us of its arrival.

Xfam blog

The Pfam group contributes to the Xfam blog. The blog is used to announce releases, new features and important changes to Pfam, as well as for posts discussing general issues surrounding the Pfam resource. You can see blog posts that are specific to Pfam here

Twitter

You can follow the @PfamDB team at EMBL-EBI.

License

Pfam is freely available under the Creative Commons Zero (“CC0”) licence.

Citing Pfam

If you use Pfam in your work, please consider citing the Pfam References.

Get in touch

If you have any questions or feedback, contact us through the Pfam helpdesk.