Welcome to Pfam’s documentation¶
Pfam is a large collection of protein families, each represented by multiple sequence aligments and profile hidden Markov models (HMMs).
Contents:¶
Summary¶
Proteins are generally comprised of one or more functional regions, commonly termed domains. The presence of different domains in varying combinations in different proteins gives rise to the diverse repertoire of proteins found in nature. Identifying the domains present in a protein can provide insights into the function of that protein.
The Pfam database is a large collection of protein domain families. Each family is represented by multiple sequence alignments and a profile hidden Markov model (HMM).
Each Pfam family, usually referred to as a Pfam-A entry, consists of a curated seed alignment containing a small set of representative members of the family, profile HMMs built from the seed alignment, and an automatically generated full alignment, which contains all detectable protein sequences belonging to the family, as defined by profile HMM searches of primary sequence databases.
Pfam entries are classified in one of six types:

Types of Pfam entries.¶
Pfam clans¶
Structural properties are often more conserved than the underlying sequence. Therefore, a single profile HMM is often insufficient to model an entire, diverse, structural superfamily and related Pfam entries are sometimes grouped together into clans; the relationship may be defined by:
sequence similarity (whilst still originating from a common ancestor)
similarity of known three-dimensional structures
functional similarity
and/or similarity between their profile HMMs (as determined by algorithms such as HHsearch) similarity of sequence, structure or profile HMM.
The majority of Pfam Clans are groupings of domains and families.
Getting Started¶
Pfam is hosted by InterPro. All the information contained within Pfam is accessible in the website of this protein sequences resource by browsing by member database and choosing Pfam. For more information about InterPro you can have a look at its documentation.
Site organisation¶

Schematic representation of the organisation of the information in the Pfam database. The arrows represent the flow of data.¶
Note the InterPro entries use the term Conserved site for Motif for consistency with the rest of the member databases that form part of the InterPro consortium.
Searching Pfam¶
There are multiple ways to look for information in Pfam by using the IntePro website.
Searching a specific Pfam entry¶
Users can navigate to specific Pfam entry pages by entering the Pfam identifier or accession number or a keyword that form part of its name via three different Search boxes:
When selecting the Browse + By member database option, the search box is located in the header of the results table.

Example of browsing the Pfam database. A paginated list of all available Pfam entries is displayed. A Search box appears on top of this list.¶
After selecting Search + By text, a larger text box is shown in the center of the page.

Example of searching specific Pfam entry pages by entering the Pfam identifier or accession number or a keyword.¶
In the top right corner of any InterPro page, next to the magnifying glass.

On the InterPro website header, a search box appears when hovering the mouse next to the magnifying glass on the right; it can be used to search for Pfam information.¶
This text box allows you to go quickly to the relevant page in the InterPro site, by using:
Search |
Find |
---|---|
Pfam accession number |
Pfam entry page |
Pfam identifier or name |
Pfam entry page |
Clan identifier |
Pfam Clan page |
UniProt accession |
InterPro protein page, which includes Pfam matches (with coordinates) |
Gene names |
InterPro protein page, which includes Pfam matches (with coordinates) |
PDB identifier |
InterPro structure page, which includes a 3D visualisation of Pfam matches |
Proteomes |
If it is a reference proteome, the InterPro proteome page will be displayed |
Keywords, free text |
List of possible matches |
Searching a protein sequence against Pfam¶
Searching a protein sequence against the Pfam library of HMMs will enable you to find out the domain architecture of the protein, and thus what its potential function might be. If your protein is present UniProt version used to make the current release of InterPro, we have already calculated its domain architecture. You can access this by entering the Uniprot sequence identifier in any of the Search boxes mentioned above (see Searching a specific Pfam entry).
Using the InterPro online sequence search¶
If your sequence is not in the InterPro database, you could perform a single-sequence or a batch search against the Pfam database on the InterPro website. This search uses the web based InterProScan tool, which allows you to scan up 100 sequences at a time with a maximum length of 40,000 amino acids. To run any online search you can follow these steps:
Click the Search + By Sequence in the InterPro website menu. This opens the InterPro sequence search page.

Selecting Search + By Sequence in the InterPro website menu.¶
Provide the FASTA formatted protein sequence(s) of interest by pasting them into the text box or by importing them from a file.

Example of protein sequence in FASTA format in the text box.¶
Expand the Advanced options, click on Unselect all protein sequence applications and select Pfam.

Select only Pfam to search your sequence(s) against this database.¶
Click on the Search button.
While the sequence search is running, you can continue to navigate through the website, other browser tabs or applications and will get a pop-up notification when the job has been completed (this requires the browser notifications to be enabled).
The results of the submitted job are accessible by selecting Results + Your InterProScan Searches in the InterPro website menu.

Select Results + Your InterProScan Searches in the InterPro website menu.¶
Interpreting the protein viewer¶
All Pfam entries - and the InterPro entries where they are integrated - are displayed in the protein sequence viewer. The Pfam and InterPro entries are grouped by type (family, domain, repeat, site). The coloured bars indicate the location of entry matches on the protein sequence. Each matched InterPro entry is displayed on a separate line, with the Pfam entries integrated in it displayed below where relevant. The Pfam entries that remain unintegrated in InterPro entries are displayed separately in the Unintegrated category.
On top of the protein sequence viewer, different icons allow to display the viewer on full screen and zoom in and out of the protein sequence. The Options button offers the possibility to personalise the display by changing the colour code of the entries, the labels (accession number, short name and/or description can be displayed on the right-hand side of the viewer), collapsing the visualisation to show InterPro entries only or to display also the contributing entries from the member databases. The tooltip should be kept active to see a pop-up box with the accession number, description and amino acid coordinates of the match of an entry when hovering the mouse over it. Snapshots of the results can be taken in PNG format.

Results of the submitted job. The integrated and unintegrated Pfam entries matching this protein sequence are shown in the protein viewer. The colour code of the protein viewer is customised as Colour By + Member Database for all Pfam entries to be highlighted in blue.¶
Local protein search¶
Alternatively, if you have a very large number of protein searches to perform, or you do not wish to share your sequence, it may be more convenient to install and run InterProScan.
Finding proteins with a specific set of domain combinations (Domain architectures)¶
Users can search protein sequences that contain specific Pfam entries in a particular arrangement by selecting Search + By Domain architecture in the InterPro website menu. Pfam entries that the proteins should or should not contain can be included or excluded from the domain architecture. The Order of domain matters option offers the possibility to arrange the domains in a particular order. The Exact match option fine tunes the search to find only proteins containing the selected domains (no extra domain in the proteins). Domains can be selected by entering a domain name, Pfam accession or InterPro accession.

Select Search + By Domain architecture in the InterPro menu, enter the desired Pfam entries and select/unselect the relevant options.¶
Pfam entry page organisation¶
In each Pfam entry page, different tabs with relevant information are available, as shown in the figure below.

Example of a Pfam entry page (PF02171). All the tabs explained below can be found on the left-hand side menu. The Overview tab is displayed by default.¶
Overview¶
The entry overview tab is the default display, where the type of Pfam entry, the short name and the clan (if the entry belongs to any) are shown at the top, more information about how clans are defined can be found in Summary. Usually, a curated description of the entry is displayed below, with the relevant literature references.
If there is a Wikipedia page for the entry, the first paragraph and the box with an image of a tridimensional structure and some cross-links are displayed. The full Wikipedia article can be open in a new tab by clicking on the title.
Proteins¶
The list of proteins matching this entry is displayed in this tab. This view can be customised to show:
All proteins (from the whole UniProtKB database).
Only Reviewed proteins (from SwissProt - manually curated).
Only Unreviewed proteins (from TrEMBL - derived from public databases automatically integrated into UniProt).
For each protein, the corresponding protein page in InterPro can be accessed by clicking on the protein accession or name; the InterPro taxonomy page can be accessed by clicking on the species name; and a small-size protein viewer displays the location of the Pfam entry in the protein. The coordinates of the match can be shown by hovering the mouse over it. You can also export this data in different formats, by clicking on the Export button, and customise the page settings, by clicking on the wheel icon.
Domain architectures¶
This tab shows the various domain arrangements of the proteins matched by the entry, ordered in descending order by the number of times that this architecture is seen. Identifying the different domains present in proteins is crucial to understand how they function.
The protein viewer displays a representative sequence for each domain architecture, where the domain size is based on the real length of the domain in the protein. When hovering over a domain, more details are shown in a tooltip, including the domain’s position.
From this page, all related Pfam entry pages can also be accessed by clicking on a Pfam accession at the top of the viewer or on a short name on the right-hand side of the viewer. The list of proteins with this architecture is available by clicking on the protein number.
Taxonomy¶
This tab shows by default a sunburst chart of all the species that the proteins matched by the Pfam entry belong to.
By default, eight individual nodes that are derived from the taxonomic lineage of each protein sequence, ranging from superkingdoms down to species, are displayed. For each node in the taxonomy tree there is a separate ring - and each ring is arranged radially, with the superkingdoms at the centre and the species around the outermost ring. The length of each ring is proportional to the number of proteins found within each taxon. You can choose how many rings you want to see from the options on the right-hand side of the page.
Segments of the sunburst chart are coloured according to their superkingdom, as explained in the Legends section. Mousing over any part of the sunburst chart shows the taxonomic name and level, with both the number of sequences and the number of species found at that level shown on the right-hand side.
These data can also be seen as a table and as a tree. In addition, it is possible to choose to see only data from key species instead. These visualisation options can be chosen from the icon panel above the sunburst. All this information can be downloaded in different formats.
Proteomes¶
A list of the reference proteomes matched by the entry is displayed in this tab. Each item in this list shows the Proteome ID (which is a link to the Proteome page in InterPro), the name of the species carrying this proteome and the number of proteins in this proteome that match the entry. From the Actions column, users can also see a list of these proteins by clicking the first icon (View matching proteins), download the data in different formats or View proteome information.
Structures¶
This tab displays a list of all the PDB structures linked to the proteins matching the Pfam entry. For each structure, you can see the PDB accession, the name of the structure in PDB, and a small-sized protein sequence viewer displaying the location of the Pfam entry in the protein structure chain.
Viewing the structures of domains and proteins helps to understand what their function might be, and how individual residues are arranged in the three-dimensional space. Often, two residues which seem distant along the linear protein sequence can be very close in the folded protein.
By clicking on a PDB accession, name or small image of the structure, a view of the corresponding InterPro structure page that summarises all of the entries of Pfam and other databases and resources for each chain of the structure will be displayed in a protein sequence viewer.
The position of each entry within the overall 3D structure can be visualised by choosing the Pfam entry of interest in the drop-down list Highlight Entry in the 3D structure or by clicking on the bar corresponding to the entry match in the protein sequence viewer. Additionally, links to similar PDB viewers and cross-references to other structural databases are provided in the External links section.
Signature¶
This tab shows the HMM logo of the Pfam model, visualised using Skylign. HMM logos are one way of visualising profile HMMs. Logos provide a quick overview of the properties of an HMM in a graphical form.
The visualisation displays the amino acid conservation for each residue in the model. The rendered area can be dragged to a desired position to navigate large logos. Alternatively, a specific residue number can be written in the Model column text box. When selecting a particular residue in the logo, the probabilities of each amino acid are displayed in the bottom part.
AlphaFold¶
Many of the proteins found in the Pfam entry may have a predicted structure generated by AlphaFoldDB. A list of all the predicted structures available in AlphaFoldDB for the proteins belonging to this entry is displayed in this tab. For each protein in the list, its Uniprot accession, name, the species it belongs to, its length, and a button that allows you to show the predicted structure of this protein in the structure viewer are displayed.
It is also possible to click on the Uniprot accession to go to the InterPro protein page and go to the Alphafold tab, where the position of the different entries in the 3D structure viewer are displayed by clicking on the bar corresponding to the entry match in the protein sequence viewer.
Alignment¶
Three different alignments can be chosen and visualised in this tab:
The seed alignment shows the multiple sequence alignment used to create the HMM model in Pfam. This is a representative set of sequences of the family and it normally has a relatively short number of protein sequences (from the Uniprot Reference proteomes).
The full alignment shows all the protein sequences from the Uniprot Reference proteomes that match this model.
The uniprot alignment includes all the protein sequences matched by this Pfam model in the whole Uniprot database.
The colour coding of the alignment can be customised through the options available in the Colors section.
All the alignments can be downloaded by clicking on the Download button.
Curation¶
This tab is divided into two subsections:
In the first section, you can see details about Pfam curators and Sequence ontology.
The second section displays the HMM building command used to generate the HMM profile defining the Pfam entry and offers the possibility to download it.
Pfam entries creation and annotation¶
For each Pfam entry, the HMM model is run against the protein sequences belonging to the UniProt Reference Proteomes. Subsequently, Pfam curators set a statistical cut-off, known as a gathering threshold (GA) for an entry. Sequences failing to make a statistical match above this threshold are not reported as hits. The threshold is quite conservative, to minimise false positives (although they are unavoidable sometimes). The Pfam model is then run against the whole UniProtKB database before every InterPro release and these are the matches shown in the Proteins tab on the Pfam entry page.
Different Pfam entries have annotations providing diverse amounts of information. Many of them have a description created by Pfam curators. Anyone can contribute to this annotation by contacting directly the curators through the Add your annotation toolbox located on the right-hand side of the Overview tab.
If you know of a domain that is not present in Pfam, you can submit it to the Pfam helpdesk and we will endeavour to build a Pfam entry for it. Please note that our interest does not currently extend to small, species-specific protein families of unknown function, unless they are supported by a publication or other significant functional predictions. We kindly ask you to follow the How can I submit a new domain? section of the FAQ before submitting information for the creation of a new Pfam entry.

Select Add your annotation to give feedback to curators.¶
In addition, Pfam encourages the annotation of Pfam families via Wikipedia. Below the traditional description of the Pfam entry, you may find the text from a Wikipedia article that we feel provides a good description of the Pfam family.
If a family does not yet have a Wikipedia article assigned to it, there are several ways for you to help us add one. You can find more information about the process in the Wikipedia section.
Clan page organisation¶
If a Pfam entry is included in a Pfam clan this information will be displayed in the Overview tab in the Pfam entry page, next to Clan, below the Pfam short name, with a link to the corresponding clan page. More information about how clans are defined can be found in Summary.
Additionally, it is possible to browse through the Pfam clans by selecting Browse + By Clan/Set in the InterPro website menu and select Pfam in the database section.

Example of a Pfam clan page (CL0219). All the tabs described below can be found on the left-hand side menu. The Overview tab is displayed by default.¶
In each Pfam clan page, different tabs with relevant information are available, the information they contain is described below.
Overview¶
The clan Overview tab is the default display, where the clan accession number, its short name and the author(s) are shown at the top. A description of the clan is displayed below, with the relevant literature references.
An interactive view of the Pfam entries included in the clan is also displayed, different label types can be chosen through the Label Content menu: Accession, Name and Short name.
Entries¶
The list of Pfam entries included in the clan is provided in this tab. For each entry, accession, name, short name and links to the entries SEED alignment and domain architectures pages are available.
Users can export this data in different formats, by clicking on the Export button, and customise the page settings, by clicking on the wheel icon.
Proteins¶
The list of proteins matching any Pfam entry belonging to the clan is displayed in this tab. The view can be customised to show:
All proteins (from the whole UniProtKB database).
Only Reviewed proteins (from SwissProt - manually curated).
Only Unreviewed proteins (from TrEMBL - derived from public databases automatically integrated into UniProt).
For each protein, the corresponding protein page in InterPro can be accessed by clicking on the protein accession or name, and the InterPro taxonomy page can be accessed by clicking on the species name.
Users can export this data in different formats, by clicking on the Export button, and customise the page settings, by clicking on the wheel icon.
Structures¶
This tab displays a list of all the PDB structures linked to the proteins matching any Pfam entry belonging to the clan. For each structure, you can see the PDB accession and the name of the structure in PDB.
By clicking on a PDB accession, name or small image of the structure, a view of the corresponding InterPro structure page that summarises all of the entries of Pfam and other databases and resources for each chain of the structure will be displayed in a protein sequence viewer.
The position of each entry within the overall 3D structure can be visualised by choosing the Pfam entry of interest in the drop-down list Highlight Entry in the 3D structure or by clicking on the bar corresponding to the entry match in the protein sequence viewer. Additionally, links to similar PDB viewers and cross-references to other structural databases are provided in the External links section.
Taxonomy¶
This tab shows by default a list of all the species that the proteins matched by any Pfam entry of the clan belong to.
These data can also be seen as a tree. These visualisation options can be chosen from the icon panel above the list. All this information can be downloaded in different formats.
Proteomes¶
A list of the reference proteomes matched by any Pfam entry belonging to the clan is displayed in this tab. For each item in this list the Proteome ID (which is a link to the Proteome page in InterPro), the name of the species carrying this proteome and the number of proteins in this proteome that match the entry are displayed. From the Actions column, users can also access a list of these proteins by clicking the first icon (View matching proteins), download the data in different formats or View proteome information.
Training materials¶
Pfam Quick tour¶
Quick tour provides a brief introduction to the Pfam database and how to access its annotations through the InterPro website.
Creating Families¶
Creating families provides a tutorial on how to create a Pfam entry.
Frequently Asked Questions (FAQs)¶
This Pfam entry is not integrated into InterPro, is it useful anyway?
Why is there apparent redundancy of UniProtKB IDs in the full-length FASTA sequence file?
What is the difference between the ‘-‘ and ‘.’ characters in your full alignments?
How can I visualise the position of a Pfam entry in a structure?
What is Pfam?¶
Pfam is a collection of multiple sequence alignments and profile hidden Markov models (HMMs). Each Pfam profile HMM represents a protein family or domain. By searching a protein sequence against the Pfam library of profile HMMs, you can determine which domains it carries i.e. its domain architecture. Pfam can also be used to analyse proteomes and questions of more complex domain architectures.
For each Pfam accession, we have an entry page. See Searching a specific Pfam entry for more information on how to access them.
What is a Pfam entry page?¶
On the Pfam entry page you can view all the associated information, from annotation to structure predictions of the protein members. See Pfam entry page organisation for a detailed description on how this data is presented.
What is a clan?¶
Some of the Pfam entries are grouped into clans. Pfam defines a clan as a collection of entries that have arisen from a single evolutionary origin. Evidence of their evolutionary relationship can be in the form of similarity in tertiary structures, or, when structures are not available, from common sequence motifs.
When a sequence region has overlapping matches to more than one entry within the same clan, we only show one of those matches. If the sequence region is also in the seed alignment for an entry, only the match to that entry is shown. Otherwise we show the entry that corresponds to the match with the lowest E-value.
The clan pages can be accessed by following a link from the Pfam entry page, or alternatively they can be accessed by by selecting Browse + By Clan/Set in the InterPro website menu and select Pfam in the database section.
For each clan page, you can access all the related data. See Clan page organisation for more information.
What criteria do you use for adding families into clans?¶
We use a variety of measures. Where possible we do use experimental and predicted structures to guide us and that is always the gold standard. We also intend to harmonise this organisation with the ECOD classification. In the absence of a structure we use:
Profile comparisons such as HHsearch
The fact that a sequence significantly matches two profile HMMs in the same region of the sequence
A method called SCOOP, that looks for common matches in search results that may indicate a relationship
All of this information is used by the Pfam curators to make a decision about where families are related and we strive to find information in the literature that support the relationship, e.g. common function.
What is Pfam-N?¶
Pfam-N (N for network) provides additional Pfam matches identified by the Google Research team using deep learning approaches. You can read more about it in this initial blog post and this update. The matches for Pfam-N are displayed under the ‘Other features’ section in the protein sequence viewer.

Example of InterPro protein page for the Uniprot accession A1AA27. The protein viewer shows the integrated and unintegrated Pfam entries matching this protein sequence, as well as other features such as the Pfam-N matches. The colour code of the protein viewer is customised as Colour By + Member Database for all Pfam entries to be highlighted in blue. The tooltip is active and the mouse was hovering over one of the Pfam-N matches when this screenshot was taken.¶
What is the relation between Pfam and InterPro?¶
The Pfam database is a large collection of protein domain families. Each family is represented by multiple sequence alignments and a profile hidden Markov model (HMM) and has information associated. All the information in the Pfam database can be accessed through the InterPro website, where it is hosted. See Getting started for more information.
InterPro is a resource that provides functional analysis of protein sequences by classifying them into families and predicting the presence of domains and important sites through the use of predictive models, known as signatures, provided by several collaborating databases (referred to as member databases). One of it 13 member databases is Pfam. For further information you can explore the InterPro About pages.
Members of the Pfam team at the EMBL-EBI are also part of the InterPro team. In this way, while both protein resources are independently maintained, there is a really close relation between them, with feedback constantly going in both directions to improve protein classification.
This Pfam entry is not integrated into InterPro, is it useful anyway?¶
Yes! The criteria for creating a new Pfam entry and a new InterPro entry are different. A Pfam entry might not yet be curated in IntePro or might not reach InterPro’s standards for integration. However, it can still provide very important information about a protein of interest.
Is possible to build Wise2 with HMMER3 support?¶
The way we get round the problem with the difference in HMMER versions, is to convert the profile HMMs that are in HMMER3 format to HMMER2 format using the HMMER3 program “hmconvert” (with -2) flag. To make the searches feasible, we screen the DNA for potential domains using ncbi-blast and the Pfam-A.fasta as a target library. GeneWise is then used to calculate a subset of profile HMMs against the DNA. There is some down-weighting of the bits-per-position between H2 and H3 HMMs that the conversion does not account for, leading inevitably to some false negatives for some families/sequences. However, until GeneWise is patched to deal with HMMER3 models, this is the best course of action.
How can I search Pfam locally?¶
If you have a large number of sequences or you don’t want to post your sequence across the web, you can search your sequence locally using InterProScan.
Why doesn’t Pfam include my sequence?¶
Pfam is built from a fixed release of UniProtKB. At each InterPro release we incorporate sequences from the latest release of UniProtKB. This means that, at any time, the sequences used by Pfam might be several weeks behind those in the most up-to-date versions of the sequence databases. If your sequence isn’t in Pfam, you can still find out what domains it contains by pasting it into the sequence search box (see InterPro online sequence search for more information).
Why is there apparent redundancy of UniProtKB IDs in the full-length FASTA sequence file?¶
A given Pfam family may match a single protein sequence multiple times, if the domain/family is a repeating unit, for example, or when the profile HMM matches only to short stretches of the sequence but matches several times. In such cases the FASTA file with the full length sequences will contain multiple copies of the same sequence.
How can I submit a new domain?¶
If you know of a domain/family that is not present in Pfam, you can submit it to the Pfam helpdesk and we will endeavour to build a Pfam entry for it. Please note that our interest does not currently extend to small, species-specific protein families of unknown function, unless they are supported by a publication or other significant functional predictions.
Pfam SEED¶
We need at least one sequence to start building a model. Here are some options:
Sequence UniProt ID, from the UniProt Reference Proteomes if possible, and the coordinates (start and end) when appropriate (if it is a domain, or motif). This is the preferred submission form for us.
Sequence/MSA in FASTA format
Sequence/MSA in a text file (e.g. .txt)
If sequences are not in UniProt, we won’t be able to build a model as we need UniProt IDs and versions (Stockholm format). When possible, try not to submit gene IDs in the alignments, give UniProt IDs instead.
Pfam description¶
In addition to the sequence alignment, to build the Pfam SEED, we also need you to provide:
Suggested name and ID for the Pfam entry
Description of the protein/domain function if known
Reference to a scientific publication whenever possible
Your ORCID ID, to add you as an author of the Pfam entry
Can I search my protein against Pfam?¶
Of course! Please look at the sequence search section for instructions on how to do it.
What is the difference between the ‘-‘ and ‘.’ characters in your full alignments?¶
The ‘-‘ and ‘.’ characters both represent gap characters. However they do tell you some extra information about how the profile HMM has generated the alignment. The ‘-‘ symbols are where the alignment of the sequence has used a delete state in the profile HMM to jump past a match state. This means that the sequence is missing a column that the profile HMM was expecting to be there. The ‘.’ character is used to pad gaps where one sequence in the alignment has sequence from the profile HMMs insert state. See the alignment below where both characters are used. The profile HMM states emitting each column are shown. Note that residues emitted from the Insert (I) state are in lower case.

How can I visualise the position of a Pfam entry in a structure?¶
In the Structures tab of a Pfam entry or a Pfam clan page you can find links to relevant InterPro structure pages.
In an InterPro structure page, or each chain of the structure matches to Pfam and other databases and resources are displayed in a protein sequence viewer. On top you can see the 3D structure viewer.
The position of each Pfam entry within the overall 3D structure can be visualised by: * hovering the mouse over the coloured bar representing the Pfam match in the protein sequence viewer. * choosing the Pfam entry of interest in the drop-down list Highlight Entry in the 3D structure.
The AlphaFold tab of a Pfam entry provides links to the predicted structure of every protein matching the entry. In the AlphaFold tab of InterPro protein pages, the position of each Pfam entry within the overall 3D structure can be visualised by hovering the mouse over the coloured bar representing the Pfam match in the protein sequence viewer.
Why don’t you have domain YYYY in Pfam?¶
We are very keen to be alerted to new domains. If you can provide us with a multiple sequence alignment then we will try hard to incorporate it into the database. If you know of a domain, but don’t have a multiple sequence alignment, we still want to know, for simple families just one sequence is enough. Again contact the Pfam helpdesk.
Are there other databases which do this?¶
To a certain extent yes, there are a number of “second generation” databases which are trying to organise protein space into evolutionarily conserved regions. InterPro combines information from several of them in a single searchable resource.
So which database is better?¶
As with everything, it depends on your problem: we would certainly suggest using more than one method. Pfam is likely to provide more interpretable results, with crisp definitions of domains in a protein.
Glossary¶
Alignment coordinates¶
HMMER3 reports two sets of domain coordinates for each profile HMM match. The envelope coordinates delineate the region on the sequence where the match has been probabilistically determined to lie, whereas the alignment coordinates delineate the region over which HMMER is confident that the alignment of the sequence to the profile HMM is correct. Our full alignments contain the envelope coordinates from HMMER3.
Architecture¶
The collection of domains that are present on a protein.
Clan¶
A collection of related Pfam entries. The relationship may be defined by similarity of sequence, structure or profile-HMM.
Domain score¶
The score of a single domain aligned to a profile HMM. Note that, for HMMER2, if there was more than one domain, the sequence score was the sum of all the domain scores for that Pfam entry. This is not quite true for HMMER3.
Envelope coordinates¶
See Alignment coordinates.
Full alignment¶
An alignment of the set of related sequences which score higher than the manually set threshold values for the profile HMMs of a particular Pfam entry.
Gathering threshold (GA)¶
Also called the gathering cut-off, this value is the search threshold used to build the full alignment. The gathering threshold is assigned by a curator when the family is built. The GA is the minimum score a sequence must attain in order to belong to the full alignment of a Pfam entry. For each Pfam profile HMM we have two GA cutoff values, a sequence cutoff and a domain cutoff.
HMMER¶
The suite of programs that Pfam uses to build and search profile HMMs. Since Pfam release 24.0 we have used HMMER version 3 to make Pfam. See the HMMER site for more information.
Noise cutoff (NC)¶
The bit scores of the highest scoring match not in the full alignment.
Pfam-A¶
A profile HMM based hand curated Pfam entry which is built using a small number of representative sequences. We manually set a threshold value for each profile-HMM and search our models against the UniProtKB database. All of the sequences which score above the threshold for a Pfam entry are included in the entry’s full alignment.
Pfam-B¶
A set of unannotated, computationally generated multiple sequence alignments. They are one of the sources we use for creating Pfam-A entries.
Posterior probability¶
HMMER reports a posterior probability for each residue that matches a ‘match’ or ‘insert’ state in the profile HMM. A high posterior probability shows that the alignment of the amino acid to the match/insert state is likely to be correct, whereas a low posterior probability indicates that there is alignment uncertainty. This is indicated on a scale with ‘*’ being 10, the highest certainty, down to 1 being complete uncertainty. Within Pfam we display this information as a heat map view, where green residues indicate high posterior probability, and red ones indicate a lower posterior probability.
Repeat¶
A short unit which is unstable in isolation but forms a stable structure when multiple copies are present.
Seed alignment¶
An alignment of a set of representative sequences for a Pfam entry. We use this alignment to construct the profile HMMs for the Pfam entry.
Sequence score¶
The total score of a sequence aligned to a profile HMM. If there is more than one domain, the sequence score is the sum of all the domain scores for that Pfam entry. If there is only a single domain, the sequence and the domains score for the protein will be identical. We use the sequence score to determine whether a sequence belongs to the full alignment of a particular Pfam entry.
Trusted cutoff (TC)¶
The bit scores of the lowest scoring match in the full alignment.
Pfam scores¶
E-values and Bit-scores¶
Pfam-A is based around hidden Markov model (HMM) searches, as provided by the HMMER3 package. In HMMER3, like BLAST, E-values (expectation values) are calculated. The E-value is the number of hits that would be expected to have a score equal to or better than this value by chance alone. A good E-value is much less than 1. A value of 1 is what would be expected just by chance. In principle, all you need to decide on the significance of a match is the E-value.
E-values are dependent on the size of the database searched, so we use a second system in-house for maintaining Pfam models, based on a bit score (see below), which is independent of the size of the database searched. For each Pfam family, we set a bit score gathering (GA) threshold by hand, such that all sequences scoring at or above this threshold appear in the full alignment. It works out that a bit score of 24 equates to an E-value of approximately 0.1, and a score 27 of to approximately 0.01. From the gathering threshold both a “trusted cutoff” (TC) and a “noise cutoff” (NC) are recorded automatically. The TC is the score for the next highest scoring match above the GA, and the NC is the score for the sequence next below the GA, i.e. the highest scoring sequence not included in the full alignment.
Sequence versus domain scores¶
There’s an additional wrinkle in the scoring system. HMMER3 calculates two kinds of scores, the first for the sequence as a whole and the second for the domain(s) on that sequence. The “sequence score” is the total score of a sequence aligned to the model (the HMM); the “domain score” is the score for a single domain — these two scores are virtually identical where only one domain is present on a sequence. Where there are multiple occurrences of the domain on a sequence any individual match may be quite weak, but the sequence score is the sum of all the individual domain scores, since finding multiple instances of a domain increases our confidence that that sequence belongs to that protein family, i.e. truly matches the model.
Meaning of bit-score for non-mathematicians¶
A bit score of 0 means that the likelihood of the match having been emitted by the model is equal to that of it having been emitted by the Null model (by chance). A bit score of 1 means that the match is twice as likely to have been emitted by the model than by the Null. A bit score of 2 means that the match is 4 times as likely to have been emitted by the model than by the Null. So, a bit score of 20 means that the match is 2 to the power 20 times as likely to have been emitted by the model than by the Null.
Citing Pfam¶
Pfam References¶
Pfam: The protein families database in 2021 J. Mistry, S. Chuguransky, L. Williams, M. Qureshi, G.A. Salazar, E.L.L. Sonnhammer, S.C.E. Tosatto, L. Paladin, S. Raj, L.J. Richardson, R.D. Finn, A. Bateman Nucleic Acids Research (2020) doi: 10.1093/nar/gkaa913
The Pfam protein families database in 2019: S. El-Gebali, J. Mistry, A. Bateman, S.R. Eddy, A. Luciani, S.C. Potter, M. Qureshi, L.J. Richardson, G.A. Salazar, A. Smart, E.L.L. Sonnhammer, L. Hirsh, L. Paladin, D. Piovesan, S.C.E. Tosatto, R.D. Finn Nucleic Acids Research (2019) doi: 10.1093/nar/gky995
The Pfam protein families database: towards a more sustainable future: R.D. Finn, P. Coggill, R.Y. Eberhardt, S.R. Eddy, J. Mistry, A.L. Mitchell, S.C. Potter, M. Punta, M. Qureshi, A. Sangrador-Vegas, G.A. Salazar, J. Tate, A. Bateman Nucleic Acids Research (2016) Database Issue 44:D279-D285
The Pfam protein families database: R.D. Finn, A. Bateman, J. Clements, P. Coggill, R.Y. Eberhardt, S.R. Eddy, A. Heger, K. Hetherington, L. Holm, J. Mistry, E.L.L. Sonnhammer, J. Tate, M. Punta Nucleic Acids Research (2014) Database Issue 42:D222-D230
The Pfam protein families database: M. Punta, P.C. Coggill, R.Y. Eberhardt, J. Mistry, J. Tate, C. Boursnell, N. Pang, K. Forslund, G. Ceric, J. Clements, A. Heger, L. Holm, E.L.L. Sonnhammer, S.R. Eddy, A. Bateman, R.D. Finn Nucleic Acids Research (2012) Database Issue 40:D290-D301
The Pfam protein families database: R.D. Finn, J. Mistry, J. Tate, P. Coggill, A. Heger, J.E. Pollington, O.L. Gavin, P. Gunesekaran, G. Ceric, K. Forslund, L. Holm, E.L. Sonnhammer, S.R. Eddy, A. Bateman Nucleic Acids Research (2010) Database Issue 38:D211-D222
The Pfam protein families database: R.D. Finn, J. Tate, J. Mistry, P.C. Coggill, J.S. Sammut, H.R. Hotz, G. Ceric, K. Forslund, S.R. Eddy, E.L. Sonnhammer and A. Bateman Nucleic Acids Research (2008) Database Issue 36:D281-D288
Pfam: clans, web tools and services: R.D. Finn, J. Mistry, B. Schuster-Böckler, S. Griffiths-Jones, V. Hollich, T. Lassmann, S. Moxon, M. Marshall, A. Khanna, R. Durbin, S.R. Eddy, E.L.L. Sonnhammer and A. Bateman Nucleic Acids Research (2006) Database Issue 34:D247-D51
Enhanced protein domain discovery by using language modeling techniques from speech recognition: L. Coin, A. Bateman and R. Durbin Proc. Natl. Acad. Sci. USA. (2003) 100(8):4516-20
The Pfam Protein Families Database: A. Bateman, L. Coin, R. Durbin, R.D. Finn, V. Hollich, S. Griffiths-Jones, A. Khanna, M. Marshall, S. Moxon, E.L.L. Sonnhammer, D.J. Studholme, C. Yeats and S.R. Eddy Nucleic Acids Research (2004) 32:D138-D141
The Pfam Protein Families Database: A. Bateman, E. Birney, L. Cerruti, R. Durbin, L. Etwiller, S.R. Eddy, S. Griffiths-Jones, K.L. Howe, M. Marshall and E.L. Sonnhammer Nucleic Acids Research (2002) 30(1):276-280
The Pfam Protein Families Database: A. Bateman, E. Birney, R. Durbin, S.R. Eddy, K.L. Howe and E.L. Sonnhammer Nucleic Acids Research (2000) 28:263-266
Pfam 3.1: 1313 multiple alignments match the majority of proteins: A. Bateman, E. Birney, R. Durbin, S.R. Eddy, R.D. Finn and E.L.L. Sonnhammer Nucleic Acids Research (1999) 27:260-262
Pfam: multiple sequence alignments and HMM-profiles of protein domains: E.L.L. Sonnhammer, S.R. Eddy, E. Birney, A. Bateman and R. Durbin Nucleic Acids Research (1998) 26:320-322
Pfam: a comprehensive database of protein families based on seed alignments: E.L.L. Sonnhammer, S.R. Eddy and R. Durbin Proteins (1997) 28:405-420
Book Chapters on Pfam¶
Homology-Based Annotation of Large Protein Datasets M. Punta, J. Mistry Data Mining Techniques for the Life Sciences. Methods in Molecular Biology vol 1415 (2016) doi: 10.1007/978-1-4939-3572-7_8
Identifying Protein Domains with the Pfam Database P. Coggill, R.D. Finn, A. Bateman Current Protocols in Bioinformatics Chapter 2, Unit 2.5 (2008) doi: 10.1002/0471250953.bi0205s23
Pfam: a domain-centric method for analysing proteins and proteomes J. Mistry and R.D. Finn Comparative Genomics. Methods in Molecular Biology vol 396 (2007) doi: 10.1007/978-1-59745-515-2_4
Pfam: the protein families database R.D. Finn (eds M.J. Dunn, L.B. Jorde, P.F.R. Little, S. Subramaniam) Genetics, Genomics, Proteomics and Bioinformatics, Section 6: Protein Families (2005) doi: 10.1002/047001153X.g306303
Identifying protein domains with the Pfam database R.D. Finn, A. Bateman and S. Griffiths-Jones Current Protocols in Bioinformatics (2003) doi: 10.1002/0471250953.bi0205s01
Pfam Annotation in Wikipedia¶
Pfam encourages the annotation of Pfam entries via Wikipedia. Below the traditional description of the Pfam entry, you may find the text from a Wikipedia article that we feel provides a good description of the Pfam entry.
Wikipedia content in the website¶
When we build a new Pfam family, we try to find a Wikipedia article that describes the family and provides what we feel to be a valuable annotation for it.
Where a Wikipedia article has been assigned to a family, the Overview tab of the Pfam family page will show the first paragraph of the article together with the image and main table on it, below the traditional Pfam annotation created by curators. Click on the title of the Wikipedia article for the full article to open in a new tab.

The Overview tab of the Pfam entry page displays the associated Wikipedia article when available.¶
Contributing annotations¶
One of the advantages of using Wikipedia to provide our annotations is that any user can now contribute to that annotation text. In many cases, families that do not yet have a Wikipedia article can be assigned an article that already exists. In some cases, however, no suitable article exists, and in that case we would encourage you to consider adding one to Wikipedia yourself.
You can now contribute to the improvement of Pfam annotations in several ways. Besides giving feedback directly to the curators to improve the traditional description, you can improve existing Wikipedia articles linked to Pfam families. In addition, if you come across a family that does not yet have a Wikipedia article assigned to it, we would really like to add one. If you know of an article that would provide a useful description of a family, please let us know via our annotation submission form (click the Add your annotation button on the family page).
Editing Wikipedia articles¶
Before you edit for the first time¶
Wikipedia is a free, online encyclopedia. Although anyone can edit or contribute to an article, Wikipedia has some strong editing guidelines and policies, which promote the Wikipedia standard of style and etiquette. Your edits and contributions are more likely to be accepted (and remain) if they are in accordance with this policy.
You should take a few minutes to view the following pages:
How your contribution will be recorded¶
Anyone can edit a Wikipedia entry. You can do this either as a new user or you can register with Wikipedia and log on. When you click on the “Edit Wikipedia article” button, your browser will direct you to the edit page for this entry in Wikipedia. If you are a registered user and currently logged in, your changes will be recorded under your Wikipedia user name. However, if you are not a registered user or are not logged on, your changes will be logged under your computer’s IP address. This has two main implications. Firstly, as a registered Wikipedia user your edits are more likely seen as valuable contribution (although all edits are open to community scrutiny regardless). Secondly, if you edit under an IP address you may be sharing this IP address with other users. If your IP address has previously been blocked (due to being flagged as a source of ‘vandalism’) your edits will also be blocked. You can find more information on this and creating a user account in Wikipedia.
Does Pfam agree with the content of the Wikipedia entry?¶
Pfam has chosen to link families to Wikipedia articles. In some case we have created or edited these articles but in many other cases we have not made any direct contribution to the content of the article. The Wikipedia community does monitor edits to try to ensure that (a) the quality of article annotation increases, and (b) vandalism is very quickly dealt with. However, we would like to emphasise that Pfam does not curate the Wikipedia entries and we cannot guarantee the accuracy of the information on the Wikipedia page.
Contact us¶
If you have problems editing or experience problems with these pages please contact us through the Pfam helpdesk.
Generating graphics¶
We provide different tools to generate graphical representation of the features found within a sequence. There are a variety of different shapes and styles and each one has a particular meaning. This page gives an in-depth description of the elements of the library from the Nightingale component and the Domain graphic tool.
Domain visualisation using Nightingale¶
The Nightingale component is used throughout the InterPro website to display protein features in the protein sequence viewer. We provide a tool that allows to generate a personalised representation of protein features using Nightingale v4.
In the JavaScript part in the link above, you can edit the sequence and feature variables to display the features for your protein of interest. You can then take a screenshot of the graphical representation generated.
For each component, you can specific the following parameters:
{ // family/single domain
accession: "PF14826",
start: 19,
end: 181,
color: "blue",
short_name: "FACT-Spt16_Nlob",
shape:"roundRectangle"
},
{ // discontinuous domain
accession: "PF08644",
locations: [{ fragments: [{ start: 520, end: 616 }, { start: 725, end: 810 }] }],
color: "#A42ea2",
short_name: "SPT16",
shape:"roundRectangle"
}
Recommended shapes:
Family or domain components are rendered as rectangles with curved ends (roundRectangle), while other components are represented by rectangle shapes
Repeat/motif: rectangle
Other sequence motifs (e.g. signal peptides, low complexity regions, coiled-coils and transmembrane regions): rectangle
disulphide bridges: bridge
signal peptide: diamond

Example of a domain visualisation using Nightingale v4.¶
For more information about how to use the Nightingale component, you can have a look at its documentation.
Domain graphics tool¶
The domain graphics tool provides graphical representation of the features found within a sequence, termed domain graphics. There are a variety of different shapes and styles and each one has a particular meaning. This page gives an in-depth description of the elements of the Domain graphics library. Please note that we do not recommend to use this tool anymore, but to use the Domain visualisation using Nightingale instead.
The library that generates the images in this page uses a JSON string to describe the domain graphic.
You can generate your own graphics using the domain graphics library available on github.
The sequence¶
The base sequence, undecorated by any domains or features, is represented by a plain grey bar:
{
"length" : "400"
}
The length of the domain graphic that is drawn is proportional to the length of the sequence itself. Any domains or features which are drawn on the sequence are also scaled by the same factor.
Pfam-A¶
The high quality, curated Pfam-A domains are classified into one of six different types: family, domain, coiled-coil, disordered, repeat and motif (for more details see Summary). These different classification types are rendered slightly differently.
Family/domain¶
It is possible for a sequence to match either the full length of a Pfam HMM (a full length match), or to match a portion of an HMM (a fragment match). The two types of match are rendered differently.
Both family and domain entries are rendered as rectangles with curved ends when the sequence is a full length match. Different types of domain are displayed with different colours. When the domain image is long enough, the domain name is shown within the domain itself. In most cases, you can click on the domains to visit the “family page” for that domain. Moving the mouse over the domain image should also display a tooltip showing the domain name, as well as the start and end positions of the domain.
{
"length" : "400",
"regions" : [
{
"type" : "pfama",
"text" : "Domain",
"colour" : "#9999ff",
"display": "true",
"startStyle" : "curved",
"endStyle" : "curved",
"start" : "40",
"end" : "200",
"aliStart" : "50",
"aliEnd" : "175"
},
{
"type" : "pfama",
"text" : "LongFamilyNamesNotShown",
"colour" : "#399",
"display" : true,
"startStyle" : "straight",
"endStyle" : "straight",
"start" : "210",
"end" : "250",
"aliStart" : "215",
"aliEnd" : "245"
}
]
}
From Pfam 24.0 onwards, Pfam has been generated using HMMER3, which introduces the concept of “envelope coordinates” for a match. Envelope regions are represented in domain graphics as lighter coloured regions. The graphic above shows short envelope regions at the ends of both domains.
When the sequence does not match the full length of the HMM that models a Pfam entry, matching domain fragments are shown. When a sequence match does not pass through the first position in the HMM, the N-terminal side of the domain graphic is drawn with a jagged edge instead of a curved edge. Similarly, when a sequence match does not pass through the last position of the HMM, the C-terminal side of the domain graphic is drawn with a jagged edge. In some rarer cases, the sequence match may not pass through either of the first or last positions of the HMM, in which case both sides are drawn with jagged edges. Examples of all three cases are shown below.
{
"length" : "400",
"regions" : [
{
"type" : "pfama",
"text" : "PartN",
"colour" : "#9999ff",
"display": "true",
"startStyle" : "jagged",
"endStyle" : "curved",
"start" : "10",
"end" : "110"
},
{
"type" : "pfama",
"text" : "PartN_C",
"colour" : "#399",
"display" : true,
"startStyle" : "jagged",
"endStyle" : "jagged",
"start" : "115",
"end" : "204"
},
{
"type" : "pfama",
"text" : "PartC",
"colour" : "#1fc01f",
"display" : true,
"startStyle" : "curved",
"endStyle" : "jagged",
"start" : "210",
"end" : "350"
}
]
}
Repeat/motif¶
Repeats and motifs are types of Pfam domain which do not form independently folded units. In order to distinguish them from domains of type family and domain, repeats and motifs are represented by rectangles with straight edges. As for families and domains, partial matches are represented with jagged edges.
{
"length" : "200",
"regions" : [
{
"type" : "pfama",
"text" : "HEAT",
"colour" : "#1fc01f",
"display": "true",
"startStyle" : "straight",
"endStyle" : "straight",
"start" : "2",
"end" : "34"
},
{
"type" : "pfama",
"text" : "HEAT",
"colour" : "#1fc01f",
"display": "true",
"startStyle" : "straight",
"endStyle" : "straight",
"start" : "82",
"end" : "118"
},
{
"type" : "pfama",
"text" : "HEAT",
"colour" : "#1fc01f",
"display": "true",
"startStyle" : "straight",
"endStyle" : "straight",
"start" : "120",
"end" : "155"
},
{
"type" : "pfama",
"text" : "HEAT",
"colour" : "#1fc01f",
"display": "true",
"startStyle" : "straight",
"endStyle" : "straight",
"start" : "159",
"end" : "195"
}
]
}
Discontinuous nested domains¶
Some domains in Pfam are disrupted by the insertion of another domain (or domains) within them. A number of names have been given to this arrangement: discontinuous (referring to the outer domain), inserted or nested (both referring to the inner domain). For example, in many sequences containing an IMPDH domain (PF00478), the IMPDH domain is continuous along the primary sequence. However, in some cases the linear sequence of the IMPDH domain is broken by the insertion of a CBS domain (PF00571), as shown below.
Where three-dimensional structures are available for representatives of a Pfam domain, it is generally clear that the three-dimensional arrangement of the domain containing the nested domain is maintained. Typically the nested domain is found inserted within a surface exposed loop, having little or no effect on the structure of the other domain. Such an arrangement explains why and how these nested domains can be functionally tolerated.
To represent this arrangement of domain graphically, the discontinuous domain is represented in two parts (as shown below). These two parts are joined by a line bridging them.
{
"length" : "200",
"regions" : [
{
"type" : "pfama",
"text" : "IMPDH",
"colour" : "#1fc01f",
"display": "true",
"startStyle" : "curved",
"endStyle" : "jagged",
"start" : "5",
"end" : "80"
},
{
"type" : "pfama",
"text" : "CBS",
"colour" : "#c00f0f",
"display": "true",
"startStyle" : "curved",
"endStyle" : "curved",
"start" : "81",
"end" : "135"
},
{
"type" : "pfama",
"text" : "IMPDH",
"colour" : "#1fc01f",
"display": "true",
"startStyle" : "jagged",
"endStyle" : "curved",
"start" : "136",
"end" : "197"
}
],
"markups" : [
{
"type" : "Nested",
"colour" : "#000000",
"display" : true,
"v_align" : "top",
"start" : "76",
"end" : "136"
}
]
}
Other sequence motifs¶
In addition to domains, smaller sequences motifs are represented by the domain graphics. Currently the following motifs are represented: signal peptides, low complexity regions, coiled-coils and transmembrane regions. These usually take lower priority than other regions that are drawn and they are therefore often obscured by, for example, a Pfam-A graphic being drawn over the top of them. An example of each motif is shown below.
{
"length" : "200",
"motifs" : [
{
"type" : "sig_p",
"colour" : "#ff9c00",
"display" : true,
"start" : 1,
"end" : 27
},
{
"type" : "low_complexity",
"colour" : "#0FF",
"display" : true,
"start" : 39,
"end" : 47
},
{
"type" : "low_complexity",
"colour" : "#0FF",
"display" : true,
"start" : 67,
"end" : 76
},
{
"type" : "coiled_coil",
"colour" : "#9cff00",
"display" : true,
"start" : 103,
"end" : 123
},
{
"type" : "transmembrane",
"colour" : "#F00",
"display" : true,
"start" : 155,
"end" : 175
},
{
"type" : "transmembrane",
"colour" : "#F00",
"display" : true,
"start" : 180,
"end" : 195
}
]
}
Signal peptides¶
Signal peptides are short regions (<60 residues long) found at the N-terminus of proteins, which direct the post-translational transport of a protein and are subsequently removed by peptidases. More specifically, a signal peptide is characterised by a short hydrophobic helix (approximately 7-15 residues). This helix is preceded by a slight positively charged region of highly variable length (approximately 1-12 residues). Between the hydrophobic helix and the cleavage site is a somewhat polar and uncharged region, of between 3 and 8 amino-acids. In InterPro, we use Phobius and SignalP for the prediction of signal peptides and they can be represented graphically by a small orange box.
Low complexity regions¶
Low complexity regions are regions of biased sequence composition, usually comprised of different types of repeats. These regions have been shown to be functionally important in some proteins, but they are generally not well understood and are masked out to focus on globular domains within the protein.
The presence of a low complexity region can be indicated by a cyan rectangle.
Disordered regions¶
We use MobiDB-lite for the prediction of disordered regions in the query sequence.
Coiled-coils¶
Coiled coils are motifs found in proteins that structurally form alpha-helices that wrap or wind around each other. Normally, two to three helices are involved, but cases of up to seven alpha-helices have been reported. Coiled-coils are found in a wide variety of proteins, many functionally very important. In InterPro they are obtained from COILS.
Coiled-coils can be represented by a small lime-green rectangle.
Transmembrane regions¶
Integral membrane proteins contain one or more transmembrane regions that are comprised of an alpha-helix that passes through or “spans” a membrane. Transmembrane helices are quite variable in length, with the average being about 20 amino-acids in length. Phobius and TMHMM are used for the annotation of transmembrane regions, which can be represented by a red rectangle.
Other Sequence features¶
Below is a demonstration of how disulphide bridges and active site residues can be represented. Each of these features can appear above or below the sequence, but in the example below the disulphide bridges are shown above the sequence and the active site residues below the line.
{
"length" : "400",
"regions" : [
{
"colour" : "#1fc01f",
"endStyle" : "curved",
"startStyle" : "curved",
"display" : true,
"end" : "104",
"href" : "/family/Inhibitor_I29",
"text" : "Inhibitor_I29",
"metadata" : {
"scoreName" : "e-value",
"score" : "1.3e-38",
"description" : "Inhibitor_I29",
"accession" : "PF08246",
"end" : "104",
"database" : "pfam",
"identifier" : "Inhibitor_I29",
"type" : "Domain",
"start" : "48"
},
"type" : "pfama",
"start" : "48"
},
{
"colour" : "#c00f0f",
"endStyle" : "curved",
"startStyle" : "curved",
"display" : true,
"end" : "343",
"href" : "/family/Peptidase_C1",
"text" : "Peptidase_C1",
"modelLength" : "307",
"metadata" : {
"scoreName" : "e-value",
"score" : "1.3e-38",
"description" : "Peptidase_C1",
"accession" : "PF00112",
"end" : "343",
"database" : "pfam",
"identifier" : "Peptidase_C1",
"type" : "Domain",
"start" : "134"
},
"type" : "pfama",
"start" : "134"
}
],
"markups" : [
{
"lineColour" : "#CCC",
"colour" : "#CCC",
"display" : true,
"end" : "196",
"v_align" : "top",
"metadata" : {
"database" : "pfam",
"type" : "Disulphide, 155-196",
"end" : "196",
"start" : "155"
},
"type" : "Disulphide",
"start" : "155"
},
{
"lineColour" : "#CCC",
"colour" : "#CCC",
"display" : true,
"end" : "228",
"v_align" : "top",
"metadata" : {
"database" : "pfam",
"type" : "Disulphide, 189-228",
"end" : "228",
"start" : "189"
},
"type" : "Disulphide",
"start" : "189"
},
{
"lineColour" : "#CCC",
"colour" : "#CCC",
"display" : true,
"end" : "333",
"v_align" : "top",
"metadata" : {
"database" : "pfam",
"type" : "Disulphide, 286-333",
"end" : "333",
"start" : "286"
},
"type" : "Disulphide",
"start" : "286"
},
{
"lineColour" : "#000",
"colour" : "#F36",
"display" : true,
"residue" : "C",
"headStyle" : "diamond",
"v_align" : "bottom",
"type" : "Active site",
"metadata" : {
"database" : "pfam",
"description" : "Active site, C158",
"start" : "158"
},
"start" : "158"
},
{
"lineColour" : "#000",
"colour" : "#90C",
"display" : true,
"residue" : "H",
"headStyle" : "diamond",
"v_align" : "bottom",
"type" : "Pfam predicted active site, H292",
"metadata" : {
"database" : "pfam",
"description" : "Pfam predicted active site, H292",
"start" : "292"
},
"start" : "292"
},
{
"lineColour" : "#000",
"colour" : "#F6F",
"display" : true,
"residue" : "N",
"headStyle" : "diamond",
"v_align" : "bottom",
"type" : "Pfam predicted active site, N308",
"metadata" : {
"database" : "pfam",
"description" : "Pfam predicted active site, N308",
"start" : "308"
},
"start" : "308"
}
],
"motifs" : [
{
"colour" : "#ff9c00",
"metadata" : {
"database" : "seq",
"type" : "Signal peptide",
"end" : "26",
"start" : "1"
},
"type" : "sig_p",
"display" : true,
"end" : 26,
"start" : 1
}
]
}
Disulphide bridges¶
Disulphide bridges play a fundamental role in the folding and stability of some proteins. They are formed by covalent bonding between the thiol groups from two cysteine residues. The disulphide bridge annotations can be represented by a solid bridge-shaped line. When multiple disulphide bonds occur, the heights of the bridges are adjusted to avoid overlaps between them. Inter-protein disulphides are represented by single vertical lines. Moving the mouse over the “bridge graphic” shows the details of the bond in a tooltip.
Active site residues¶
Within an enzyme, a small number of residues are directly involved in catalysis of a reaction. These are termed active site residues. Within Pfam there are three categories of active site: those that are experimentally determined, those that are predicted by UniProt and those predicted by Pfam. All three types can be represented by a “lollipop” with a diamond head. The head is coloured red, pink and purple for each of the three types respectively.
“Lollipops”¶
A wide range of different lollipop styles can be create by combining different line and head colours with different drawing styles. The lollipop head can be drawn as a square, circle or diamond, as a simple coloured bar, or as an arrow (pointing away from the sequence) or a “pointer” (an arrow pointing towards the sequence).
{
"length" : "200",
"markups" : [
{
"lineColour" : "#666",
"colour" : "#F36",
"display" : true,
"v_align" : "top",
"headStyle" : "square",
"type" : "Red square, above sequence",
"start" : "20"
},
{
"lineColour" : "#F00",
"colour" : "#F0F",
"display" : true,
"v_align" : "bottom",
"headStyle" : "square",
"type" : "Purple square, red line, below sequence",
"start" : "40"
},
{
"lineColour" : "#666",
"colour" : "#F00",
"display" : true,
"v_align" : "top",
"headStyle" : "diamond",
"type" : "Red diamond, above sequence",
"start" : "60"
},
{
"lineColour" : "#666",
"colour" : "#0F0",
"display" : true,
"v_align" : "bottom",
"headStyle" : "circle",
"type" : "Green circle, below sequence",
"start" : "80"
},
{
"lineColour" : "#666",
"colour" : "#0F0",
"display" : true,
"v_align" : "top",
"headStyle" : "arrow",
"type" : "Green arrow, above sequence",
"start" : "100"
},
{
"lineColour" : "#666",
"colour" : "#08F",
"display" : true,
"v_align" : "bottom",
"headStyle" : "pointer",
"type" : "Blue pointer, below sequence",
"start" : "120"
},
{
"lineColour" : "#666",
"colour" : "#F80",
"display" : true,
"v_align" : "top",
"headStyle" : "line",
"type" : "Orange line, above sequence",
"start" : "140"
}
]
}
Tooltips¶
If appropriate metadata are present in the sequence description, the domain graphics library can also add tooltips to the image. The example below shows a domain graphic and its description includes the necessary metadata for generating tooltips.
{
"length" : "950",
"regions" : [
{
"modelStart" : "5",
"modelEnd" : "292",
"colour" : "#2dcf00",
"endStyle" : "jagged",
"startStyle" : "jagged",
"display" : true,
"end" : "361",
"aliEnd" : "361",
"href" : "/family/PF00082",
"text" : "Peptidase_S8",
"modelLength" : "307",
"metadata" : {
"scoreName" : "e-value",
"score" : "1.3e-38",
"description" : "Subtilase family",
"accession" : "PF00082",
"end" : "587",
"database" : "pfam",
"aliEnd" : "573",
"identifier" : "Peptidase_S8",
"type" : "Domain",
"aliStart" : "163",
"start" : "159"
},
"type" : "pfama",
"aliStart" : "163",
"start" : "159"
},
{
"modelStart" : "5",
"modelEnd" : "292",
"colour" : "#2dcf00",
"endStyle" : "jagged",
"startStyle" : "jagged",
"display" : true,
"end" : "587",
"aliEnd" : "573",
"href" : "/family/PF00082",
"text" : "Peptidase_S8",
"modelLength" : "307",
"metadata" : {
"scoreName" : "e-value",
"score" : "1.3e-38",
"description" : "Subtilase family",
"accession" : "PF00082",
"end" : "587",
"database" : "pfam",
"aliEnd" : "573",
"identifier" : "Peptidase_S8",
"type" : "Domain",
"aliStart" : "163",
"start" : "159"
},
"type" : "pfama",
"aliStart" : "470",
"start" : "470"
},
{
"modelStart" : "12",
"modelEnd" : "100",
"colour" : "#ff5353",
"endStyle" : "curved",
"startStyle" : "jagged",
"display" : true,
"end" : "469",
"aliEnd" : "469",
"href" : "/family/PF02225",
"text" : "PA",
"modelLength" : "100",
"metadata" : {
"scoreName" : "e-value",
"score" : "7.1e-09",
"description" : "PA domain",
"accession" : "PF02225",
"end" : "469",
"database" : "pfam",
"aliEnd" : "469",
"identifier" : "PA",
"type" : "Family",
"aliStart" : "385",
"start" : "362"
},
"type" : "pfama",
"aliStart" : "385",
"start" : "362"
},
{
"modelStart" : "1",
"modelEnd" : "112",
"colour" : "#5b5bff",
"endStyle" : "curved",
"startStyle" : "curved",
"display" : true,
"end" : "726",
"aliEnd" : "726",
"href" : "/family/PF06280",
"text" : "DUF1034",
"modelLength" : "112",
"metadata" : {
"scoreName" : "e-value",
"score" : "1.1e-13",
"description" : "Fn3-like domain (DUF1034)",
"accession" : "PF06280",
"end" : "726",
"database" : "pfam",
"aliEnd" : "726",
"identifier" : "DUF1034",
"type" : "Domain",
"aliStart" : "613",
"start" : "613"
},
"type" : "pfama",
"aliStart" : "613",
"start" : "613"
}
],
"markups" : [
{
"lineColour" : "#ff0000",
"colour" : "#000000",
"display" : true,
"end" : "470",
"v_align" : "top",
"metadata" : {
"database" : "pfam",
"type" : "Link between discontinuous regions",
"end" : "470",
"start" : "361"
},
"type" : "Nested",
"start" : "361"
},
{
"lineColour" : "#333333",
"colour" : "#e469fe",
"display" : true,
"residue" : "S",
"headStyle" : "diamond",
"v_align" : "top",
"type" : "Pfam predicted active site",
"metadata" : {
"database" : "pfam",
"description" : "S Pfam predicted active site",
"start" : "538"
},
"start" : "538"
},
{
"lineColour" : "#333333",
"colour" : "#e469fe",
"display" : true,
"residue" : "D",
"headStyle" : "diamond",
"v_align" : "top",
"type" : "Pfam predicted active site",
"metadata" : {
"database" : "pfam",
"description" : "D Pfam predicted active site",
"start" : "185"
},
"start" : "185"
},
{
"lineColour" : "#333333",
"colour" : "#e469fe",
"display" : true,
"residue" : "H",
"headStyle" : "diamond",
"v_align" : "top",
"type" : "Pfam predicted active site",
"metadata" : {
"database" : "pfam",
"description" : "H Pfam predicted active site",
"start" : "235"
},
"start" : "235"
}
],
"metadata" : {
"database" : "uniprot",
"identifier" : "Q560V8_CRYNE",
"organism" : "Cryptococcus neoformans (Filobasidiella neoformans)",
"description" : "Putative uncharacterized protein",
"taxid" : "5207",
"accession" : "Q560V8"
},
"motifs" : [
{
"colour" : "#ffa500",
"metadata" : {
"database" : "Phobius",
"type" : "sig_p",
"end" : "23",
"start" : "1"
},
"type" : "sig_p",
"display" : true,
"end" : 23,
"start" : 1
},
{
"colour" : "#00ffff",
"metadata" : {
"database" : "seg",
"type" : "low_complexity",
"score" : "2.5100",
"end" : "21",
"start" : "3"
},
"type" : "low_complexity",
"display" : false,
"end" : 21,
"start" : 3
},
{
"colour" : "#86bcff",
"metadata" : {
"database" : "seg",
"type" : "low_complexity",
"score" : "1.4900",
"end" : "156",
"start" : "134"
},
"type" : "low_complexity",
"display" : true,
"end" : "156",
"start" : "134"
},
{
"colour" : "#00ffff",
"metadata" : {
"database" : "seg",
"type" : "low_complexity",
"score" : "2.0200",
"end" : "187",
"start" : "173"
},
"type" : "low_complexity",
"display" : false,
"end" : "187",
"start" : "173"
},
{
"colour" : "#00ffff",
"metadata" : {
"database" : "seg",
"type" : "low_complexity",
"score" : "2.0800",
"end" : "218",
"start" : "207"
},
"type" : "low_complexity",
"display" : false,
"end" : "218",
"start" : "207"
},
{
"colour" : "#00ffff",
"metadata" : {
"database" : "seg",
"type" : "low_complexity",
"score" : "2.1300",
"end" : "231",
"start" : "220"
},
"type" : "low_complexity",
"display" : false,
"end" : "231",
"start" : "220"
},
{
"colour" : "#00ffff",
"metadata" : {
"database" : "seg",
"type" : "low_complexity",
"score" : "2.0000",
"end" : "554",
"start" : "538"
},
"type" : "low_complexity",
"display" : false,
"end" : "554",
"start" : "538"
},
{
"colour" : "#86bcff",
"metadata" : {
"database" : "seg",
"type" : "low_complexity",
"score" : "1.9100",
"end" : "590",
"start" : "578"
},
"type" : "low_complexity",
"display" : true,
"end" : "590",
"start" : 588
},
{
"colour" : "#00ffff",
"metadata" : {
"database" : "seg",
"type" : "low_complexity",
"score" : "1.7600",
"end" : "831",
"start" : "822"
},
"type" : "low_complexity",
"display" : false,
"end" : "831",
"start" : "822"
}
]
}
Querying Pfam using the InterPro API¶
This is an introduction to the InterPro API to retrieve Pfam annotations. A programmatic interface, commonly called an Application Programming Interface (API) allows users to write scripts or programs to access data, rather than having to rely on a browser to view a site.
Basic concepts¶
URLs¶
A RESTful service typically sends and receives data over HTTP, the same protocol that’s used by websites and browsers. As such, the services provided through a RESTful interface are identified using URLs.
In the InterPro website we use a different URL to provide the standard HTML representation of Pfam data and the alternative programmatic JSON format through the API.
To see the data for a particular Pfam-A family, you would visit the following URL in your browser:
To retrieve the data in JSON format, just add an extra parameter, api, to the URL:
The response from the server will now be an JSON document, rather than an HTML page.
The table below lists the website vs API url (scroll the table to right/left to see the corresponding API url):
Data |
Example website url |
Example API url |
---|---|---|
List all Pfam entries |
||
List all Pfam entries of type Family |
||
Information about a specific Pfam entry |
||
List of proteins matching a specific entry |
||
a specific entry |
||
a specific entry |
||
structure from RoseTTAFold |
||
List all Pfam clans |
||
a specific clan |
||
General information about a specific clan |
Available outputs formats¶
By default, the output of the API calls are in JSON format. However, we also support Text and TSV formats. To obtain the results in Text or TSV format, add the ?format=txt or ?format=tsv to the API url.
Examples of API outputs¶
Pfam-A annotations¶
You can retrieve a sub-set of the data in a Pfam-A family page as an JSON document using the following URL: /api/entry/pfam/PF02171
HTTP 200 OK
Allow: GET, HEAD
Content-Type: application/json
InterPro-Version: 94.0
InterPro-Version-Minor: 3
Vary: Accept
{
"metadata": {
"accession": "PF02171",
"entry_id": null,
"type": "family",
"go_terms": null,
"source_database": "pfam",
"member_databases": null,
"integrated": "IPR003165",
"hierarchy": null,
"name": {
"name": "Piwi domain",
"short": "Piwi"
},
"description": [
"<p>This domain is found in the protein Piwi and its relatives. The function of this domain is the dsRNA guided hydrolysis of ssRNA. Determination of the crystal structure of Argonaute reveals that PIWI is an RNase H domain, and identifies Argonaute as Slicer, the enzyme that cleaves mRNA in the RNAi RISC complex [[cite:PUB00020128]]. In addition, Mg+2 dependence and production of 3'-OH and 5' phosphate products are shared characteristics of RNaseH and RISC. The PIWI domain core has a tertiary structure belonging to the RNase H family of enzymes. RNase H fold proteins all have a five-stranded mixed beta-sheet surrounded by helices. By analogy to RNase H enzymes which cleave single-stranded RNA guided by the DNA strand in an RNA/DNA hybrid, the PIWI domain can be inferred to cleave single-stranded RNA, for example mRNA, guided by double stranded siRNA.</p>"
],
"wikipedia": {
"title": "Argonaute",
"extract": "<p>The <b>Argonaute</b> protein family, first discovered for its evolutionarily conserved stem cell function, plays a central role in RNA silencing processes as essential components of the RNA-induced silencing complex (RISC). RISC is responsible for the gene silencing phenomenon known as RNA interference (RNAi). Argonaute proteins bind different classes of small non-coding RNAs, including microRNAs (miRNAs), small interfering RNAs (siRNAs) and Piwi-interacting RNAs (piRNAs). Small RNAs guide Argonaute proteins to their specific targets through sequence complementarity, which then leads to mRNA cleavage, translation inhibition, and/or the initiation of mRNA decay.</p>",
"thumbnail": "iVBORw0KGgoAAAANSUhEUgAAAUAAAAERCAYAAAAKQn74AAAABmJLR0QA/wD/AP+plGSa52gioXAaa3IEVq0YtzeLBALwLopkkEqrtLV1UFV1Bi5X3tc2hQbOd2BxHJxefuXrDIOykhL2rFvHJ+3tlE6fzkljxnD1+..."
},
"literature": {
"PUB00020128": {
"PMID": 15284453,
"ISBN": null,
"volume": "305",
"issue": "5689",
"year": 2004,
"title": "Crystal structure of Argonaute and its implications for RISC slicer activity.",
"URL": null,
"raw_pages": "1434-7",
"medline_journal": "Science",
"ISO_journal": "Science",
"authors": [
"Song JJ",
"Smith SK",
"Hannon GJ",
"Joshua-Tor L."
],
"DOI_URL": "http://dx.doi.org/10.1126/science.1102514"
},
"PUB00018283": {
"PMID": 11050429,
"ISBN": null,
"volume": "25",
"issue": "10",
"year": 2000,
"title": "Domains in gene silencing and cell differentiation proteins: the novel PAZ domain and redefinition of the Piwi domain.",
"URL": null,
"raw_pages": "481-2",
"medline_journal": "Trends Biochem Sci",
"ISO_journal": "Trends Biochem. Sci.",
"authors": [
"Cerutti L",
"Mian N",
"Bateman A."
],
"DOI_URL": "http://dx.doi.org/10.1016/S0968-0004(00)01641-8"
}
},
"set_info": {
"accession": "CL0219",
"name": "RNase_H"
},
"overlaps_with": null,
"counters": {
"subfamilies": 0,
"domain_architectures": 602,
"interactions": 0,
"matches": 29456,
"pathways": 0,
"proteins": 28420,
"proteomes": 2266,
"sets": 1,
"structural_models": {
"alphafold": 21504,
"rosettafold": 0
},
"structures": 112,
"taxa": 9708
},
"entry_annotations": {
"hmm": 0,
"logo": 0,
"alignment:uniprot": 22300,
"alignment:full": 14038,
"alignment:seed": 15
},
"cross_references": {}
}
}
Some Pfam families are removed or merged into others, in which case they become “dead” families. If you try to retrieve annotation information about a dead family, you’ll get a simple JSON document that only tells you that there isn’t any content associated to this entry.
GET /api/entry/pfam/PF06700
HTTP 204 No Content
Allow: GET, HEAD
Content-Type: application/json
InterPro-Version: 94.0
InterPro-Version-Minor: 3
Vary: Accept
{
"detail": "There is no data associated with the requested URL.\nList of endpoints: ['entry', 'pfam', 'PF06700']"
}
Pfam-A family list¶
You can retrieve a list of all Pfam-A families in the latest Pfam release, either as an JSON document or as a tab-delimited text file. Both formats contain the Pfam-A accession, Pfam-A identifier and description:
You can also view the list in a web browser by removing the format=json parameter from the URL.
HTTP 200 OK
Allow: GET, HEAD
Content-Type: application/json
InterPro-Version: 94.0
InterPro-Version-Minor: 3
Vary: Accept
{
"count": 19632,
"next": "https://www.ebi.ac.uk/interpro/api/entry/all/pfam/?cursor=cD1QRjAwMDIw",
"previous": null,
"results": [
{
"metadata": {
"accession": "PF00001",
"name": "7 transmembrane receptor (rhodopsin family)",
"source_database": "pfam",
"type": "family",
"integrated": "IPR000276",
"member_databases": null,
"go_terms": null
}
},
...
Protein data¶
You can retrieve a sub-set of the data in a protein page as a JSON document using any of the following URL: /api/protein/uniprot/P00789
HTTP 200 OK
Allow: GET, HEAD
Content-Type: application/json
InterPro-Version: 94.0
InterPro-Version-Minor: 3
Vary: Accept
{
"metadata": {
"accession": "P00789",
"id": "CANX_CHICK",
"source_organism": {
"taxId": "9031",
"scientificName": "Gallus gallus",
"fullName": "Gallus gallus (Chicken)"
},
"name": "Calpain-1 catalytic subunit",
"description": [
"Calcium-regulated non-lysosomal thiol-protease which catalyze limited proteolysis of substrates involved in cytoskeletal remodeling and signal transduction"
],
"length": 705,
"sequence": "MMPFGGIAARLQRDRLRAEGVGEHNNAVKYLNQDYEALKQECIESGTLFRDPQFPAGPTALGFKELGPYSSKTRGVEWKRPSELVDDPQFIVGGATRTDICQGALGDCWLLAAIGSLTLNEELLHRVVPHGQSFQEDYAGIFHFQIWQFGEWVDVVVDDLLPTKDGELLFVHSAECTEFWSALLEKAYAKLNGCYESLSGGSTTEGFEDFTGGVAEMYDLKRAPRNMGHIIRKALERGSLLGCSIDITSAFDMEAVTFKKLVKGHAYSVTAFKDVNYRGQQEQLIRIRNPWGQVEWTGAWSDGSSEWDNIDPSDREELQLKMEDGEFWMSFRDFMREFSRLEICNLTPDALTKDELSRWHTQVFEGTWRRGSTAGGCRNNPATFWINPQFKIKLLEEDDDPGDDEVACSFLVALMQKHRRRERRVGGDMHTIGFAVYEVPEEAQGSQNVHLKKDFFLRNQSRARSETFINLREVSNQIRLPPGEYIVVPSTFEPHKEADFILRVFTEKQSDTAELDEEISADLADEEEITEDDIEDGFKNMFQQLAGEDMEISVFELKTILNRVIARHKDLKTDGFSLDSCRNMVNLMDKDGSARLGLVEFQILWNKIRSWLTIFRQYDLDKSGTMSSYEMRMALESAGFKLNNKLHQVVVARYADAETGVDFDNFVCCLVKLETMFRFFHSMDRDGTGTAVMNLAEWLLLTMCG",
"proteome": "UP000000539",
"gene": null,
"go_terms": [
{
"identifier": "GO:0004198",
"name": "calcium-dependent cysteine-type endopeptidase activity",
"category": {
"code": "F",
"name": "molecular_function"
}
},
{
"identifier": "GO:0006508",
"name": "proteolysis",
"category": {
"code": "P",
"name": "biological_process"
}
},
{
"identifier": "GO:0005509",
"name": "calcium ion binding",
"category": {
"code": "F",
"name": "molecular_function"
}
}
],
"protein_evidence": 1,
"source_database": "reviewed",
"is_fragment": false,
"ida_accession": "664e4b66bad68bfc279e99cc8deefa39a1edf04a",
"counters": {
"domain_architectures": 10280,
"entries": 30,
"isoforms": 0,
"proteomes": 1,
"sets": 5,
"structures": 0,
"taxa": 1,
"dbEntries": {
"prosite": 2,
"panther": 1,
"prints": 1,
"profile": 2,
"smart": 2,
"pfam": 2,
"cathgene3d": 3,
"cdd": 3,
"ssf": 3,
"interpro": 11
},
"proteome": 1,
"taxonomy": 1,
"similar_proteins": 10280
}
}
}
Sending requests using a script¶
Most programming languages have the ability to send HTTP requests and receive HTTP responses. A Python script to retrieve data about a Pfam family might be as trivial as this:
#!/usr/bin/env python3
# standard library modules
import sys, errno, re, json, ssl
from urllib import request
from urllib.error import HTTPError
from time import sleep
BASE_URL = "https://www.ebi.ac.uk:443/interpro/api/entry/pfam/PF02171"
def output_list():
#disable SSL verification to avoid config issues
context = ssl._create_unverified_context()
next = BASE_URL
last_page = False
attempts = 0
while next:
try:
req = request.Request(next, headers={"Accept": "application/json"})
res = request.urlopen(req, context=context)
# If the API times out due a long running query
if res.status == 408:
# wait just over a minute
sleep(61)
# then continue this loop with the same URL
continue
elif res.status == 204:
#no data so leave loop
break
payload = json.loads(res.read().decode())
next = payload["next"]
attempts = 0
if not next:
last_page = True
except HTTPError as e:
if e.code == 408:
sleep(61)
continue
else:
# If there is a different HTTP error, it wil re-try 3 times before failing
if attempts < 3:
attempts += 1
sleep(61)
continue
else:
sys.stderr.write("LAST URL: " + next)
raise e
for i, item in enumerate(payload["results"]):
sys.stdout.write(item["metadata"]["name"]["short"] + "\n")
# Don't overload the server, give it time before asking for more
if next:
sleep(1)
if __name__ == "__main__":
output_list()
This script prints out the short name (Piwi) for the family (PF02171).
FTP Site¶
The Pfam FTP site is organised into the following structure:
The most important directory is probably the current_release directory. It contains the flat-files for the current release.
AntiFam¶
The AntiFam directory contains the different releases of the AntiFam database, identifying spurious proteins.
RoseTTAfold_aln¶
The RoseTTAfold_aln directory contains the alignments used by RoseTTAfold to predict their structural models using Pfam.
Tools¶
The Tools directory contains code for running pfam_scan.pl.
The README file in this directory contains detailed information on how to install and run the script. Note that we have gone for a modular design for the script, enabling the functionally on the script to be easily incorporated into other Perl scripts. The ChangeLog file lists the versions and changes to the current version of pfam_scan.pl (and modules).
There is also an archived version of pfam_scan.pl that works with HMMER2. This is no longer supported.
There is also Perl code for predicting active sites found in the ActSitePred directory, the functionality of which has been rolled into the latest version of pfam_scan.pl.
current_release¶
This directory contains the flat-files for the current release. Some of these files may be very large (of the order of several hundred megabytes). Please check the sizes on the FTP site before trying to download them over a slow connection. The files, most of which are compressed using gzip, are:
- Pfam-A.dead.gz
Listing of families that have been deleted from the database
- Pfam-A.fasta.gz
A 90% non-redundant set of fasta formatted sequence for each Pfam-A family. The sequences are only the regions hit by the model and not full length protein sequences.
- Pfam-A.full.gz
The full alignments of the curated families, searched against pfamseq/UniProtKB reference proteomes (prior to Pfam 29.0, this file contained matches against the whole of UniProtKB).
- Pfam-A.full.uniprot.gz
The full alignments of the curated families, searched against UniProtKB.
- Pfam-A.full.metagenomics.gz
The full alignments of the curated families, searched against Metagenomic proteins.
- Pfam-A.full.ncbi.gz
The full alignments of the curated families, searched against NCBI GenPept proteins.
- Pfam-A.hmm.dat.gz
A data file that contains information about each Pfam-A family
- Pfam-A.hmm.gz
The Pfam HMM library for Pfam-A families
- Pfam-A.seed.gz
The SEED alignments of the curated families. Please note that from Pfam 36.0 onwards we do not process PDB data. Hence secondary structure annotations aren’t available in the SEED alignments anymore. However, PDBe provides mappings to Pfam which might be of interest.
- Pfam-C.gz
A file that contains the information about clans and the Pfam-A membership
- active_site.dat.gz
Tar-ball of data required for the predictions of active sites by Pfam scan.
- database_files
Directory contains two files per table from the MySQL database. The .sql.gz file contains the table structure, the .txt.gz files contains the content of the table as a tab delimited file with field enclosed by a single quote (‘).
- diff.gz
Stores the change status of entries between this release and last.
- md5_checksums
A file containing the MD5 checksum for each release file
- metaseq.gz
Metagenomic sequence database used in this release
- ncbi.gz
NCBI GenPept sequence database used in this release.
- pdbmap.gz
Mapping between PDB structures and Pfam domains.
- pfamseq.gz
A fasta version of Pfam’s underlying sequence database
- relnotes.txt
Release notes
- swisspfam.gz
ASCII representation of the domain structure of UniProt proteins according to Pfam
- uniprot_sprot.dat.gz
Data files from UniProt containing SwissProt annotations.
- uniprot_trembl.dat.gz
Data files from UniProt containing TrEMBL annotations.
- userman.txt
File containing information about the flatfile format
- Pfam-A.regions.tsv.gz
A tab separated file containing UniProtKB reference proteome sequences and Pfam-A family information
- Pfam-A.regions.uniprot.tsv.gz
A tab separated file containing UniProtKB sequences and Pfam-A family information
- Pfam-A.clans.tsv.gz
A tab separated file containing Pfam-A family and clan information for all Pfam-A families
mappings¶
The mapping directory contains the mapping between PDB structures and Pfam entries.
papers¶
The papers directory contains each NAR database issue article describing Pfam. For a detailed description of the latest changes to Pfam, please consult (and cite) these papers.
releases¶
The releases directory contains all the flat files and database dumps (where appropriate) for all version of Pfam to-date. The files in more recent releases are the same as described for the current release, but in older releases the contents do change.
About Pfam¶
Pfam version 35.0 was produced at the European Bioinformatics Institute using a sequence database called Pfamseq, which is based on UniProt release 2021_03.
Pfam is freely available under the Creative Commons Zero (“CC0”) licence.
Pfam is powered by the HMMER3 package written by Sean Eddy and his group at HHMI/ Harvard University, and built by the Xfam consortium.
Pfam is supported by the following organisations:
EMBL is EMBL-EBI’s parent organisation. It provides core funding (staff, space, equipment) for Pfam.
The Wellcome Trust has supported Pfam since the database inception, via core funding when based at the Wellcome Trust Sanger Institute. As well as providing and maintaining the campus on which the EMBL-EBI is located, the Wellcome Trust also now provides significant funding for Pfam (grant 221320/Z/20/Z). The current grant runs from October 2020 to September 2025.
BBSRC is supporting Pfam activities (BB/S020381/1) from November 2019 to October 2023 and has previously supported Pfam activities via grants BB/L024136/1 and BB/N00521X/1.
The Howard Hughes Medical Institute supports the Eddy group.
Many organisations have supported Pfam activities in the past.
For more information, please contact the Pfam helpdesk.
Authorship¶
We greatly appreciate the contribution made to Pfam from our user community. To acknowledge these contributions, and allow them to be an integral part of researchers’ profiles, we have incorporated ORCID identifiers, displaying these in the ‘curation and model’ tab of each Pfam entry.
To claim Pfam entries against your ORCID, first go to the EMBL-EBI website and search by putting your ORCID into the search box and selecting ‘Protein Families’ from the drop down.
From the results page, select Pfam on the left-hand side and you should then see a link at the top of the results inviting you to Claim to ORCID. Select all the entries you want to add to your ORCID and click on the button. A pop-up window will appear, inviting you to authenticate in the ORCID website. Once you are logged-in, click on the Claim button.
Team Members¶
The Pfam Consortium¶
Pfam is the product from an international consortium of researchers that has been borne out of its original development by Erik Sonnhammer, Sean Eddy and Richard Durbin. The current list of consortium members, their institutes and primary roles are listed below.
European Bioinformatics Institute (EMBL-EBI), UK¶
Alex Bateman - Pfam team leader and head of Protein Sequence resources at EMBL-EBI
Antonina Entcheva Andreeva - Biocurator
Sara Chuguransky - Senior Biocurator
Tiago Grego - Software developer
Beatriz Lazaro Pinto - Biocurator
Typhaine Paysan-Lafosse - Curation Project Leader
Harvard University, USA¶
Sean Eddy - Founding developer and author of HMMER software
Stockholm Bioinformatics Center, Sweden¶
Erik Sonnhammer - Coordinator of Pfam-Sweden and founding developer
External contributors¶
Pfam includes families that have been built by external contributors:
NCBI, USA¶
Lakshminarayan Iyer
L. Aravind
Zhang Dapeng
Vivek Anantharaman
Sanford-Burnham Medical Research Institute, USA¶
Adam Godizk
Previous contributors¶
Gabriel Aldam
Shimelis Assefa
Matthew Bashton
Ewan Birney
Lorenzo Cerrutti
Yuanyuan Chang
Jody Clements
Penny Coggill
Lachlan Coin
Robson De Souza
Richard Durbin
Ruth Eberhardt
Sara El-Gebali
Kyle Ellrott
Matthew Fenech
Kristoffer Forslund
O. Luke Gavin
Prasad Gunasekaran
Sam Griffiths-Jones
Kevin Howe
Lukasz Jaroszewski
Nicola Kerrison
Marta Llagostera
Aurélien Luciani
Mhairi Marshall
Nina Mian
William Mifsud
Jaina Mistry
Simon Moxon
Simon Potter
Joanne Pollington
Marco Punta
Matloob Qureshi
Lorna Richardson
Stephen-John Sammut
Luis Sanchez Pulido
Benjamin Schuster-Böckler
David Studholme
John Tate
Benjamin Vella-Briffa
Lowri Williams
Arthur Wuster
Corin Yeats
Pfam is a collaborative venture and we hope to be able to interact with as many people as possible, in order to provide a quality database. Please get in touch with any one of us for more information about Pfam. You can contact us trough the Pfam helpdesk.
Contact us¶
Helpdesk¶
We run a helpdesk , which handles annotation comments, data enquiries and general problems with the Pfam database. We use a request tracking system to monitor emails to the helpdesk, so you should receive an automated response to your email, letting you know that the system has logged your mail and notified us of its arrival.
License¶
Pfam is freely available under the Creative Commons Zero (“CC0”) licence.
Citing Pfam¶
If you use Pfam in your work, please consider citing the Pfam References.
Get in touch¶
If you have any questions or feedback, contact us through the Pfam helpdesk.
Social media¶
You can follow @PfamDB on X and InterPro/Pfam on LinkedIn.