FTP Site¶

The Pfam FTP site is organised into the following structure:

antifam
rosettafold
tools
current_release
mappings
papers
releases

The most important directory is probably the current_release directory. It contains the flat-files for the current release.

AntiFam¶

The AntiFam directory contains the different releases of the AntiFam database, identifying spurious proteins.

Tools¶

The Tools directory contains code for running pfam_scan.pl.

The README file in this directory contains detailed information on how to install and run the script. Note that we have gone for a modular design for the script, enabling the functionally on the script to be easily incorporated into other Perl scripts. The ChangeLog file lists the versions and changes to the current version of pfam_scan.pl (and modules).

There is also an archived version of pfam_scan.pl that works with HMMER2. This is no longer supported.

There is also Perl code for predicting active sites found in the ActSitePred directory, the functionality of which has been rolled into the latest version of pfam_scan.pl.

current_release¶

This directory contains the flat-files for the current release. Some of these files may be very large (of the order of several hundred megabytes). Please check the sizes on the FTP site before trying to download them over a slow connection. The files, most of which are compressed using gzip, are:

Pfam-A.clans.tsv.gz: A tab separated file containing Pfam-A family and clan information for all Pfam-A families
Pfam-A.dead.gz: Listing of families that have been deleted from the database
Pfam-A.fasta.gz: A 90% non-redundant set of fasta formatted sequence for each Pfam-A family. The sequences are only the regions hit by the model and not full length protein sequences.
Pfam-A.full.gz: The full alignments of the curated families, searched against pfamseq/UniProtKB reference proteomes (prior to Pfam 29.0, this file contained matches against the whole of UniProtKB).

Pfam-A.hmm.dat.gz: A data file that contains information about each Pfam-A family
Pfam-A.hmm.gz: The Pfam HMM library for Pfam-A families
Pfam-A.regions.tsv.gz: A tab separated file containing UniProtKB reference proteome sequences and Pfam-A family information

Pfam-A.seed.gz: The SEED alignments of the curated families. Please note that from Pfam 36.0 onwards we do not process PDB data. Hence secondary structure annotations aren’t available in the SEED alignments anymore. However, PDBe provides mappings to Pfam which might be of interest.
Pfam-C.gz: A file that contains the information about clans and the Pfam-A membership
active_site.dat.gz: Tar-ball of data required for the predictions of active sites by Pfam scan.

diff.gz: Stores the change status of entries between this release and last.
md5_checksums: A file containing the MD5 checksum for each release file

pfamseq.gz: A fasta version of Pfam’s underlying sequence database
relnotes.txt: Release notes

uniprot_sprot.dat.gz: Data files from UniProt containing SwissProt annotations.
uniprot_trembl.dat.gz: Data files from UniProt containing TrEMBL annotations.
userman.txt: File containing information about the flatfile format

mappings¶

The mapping directory contains the mapping between PDB structures and Pfam entries.

papers¶

The papers directory contains each NAR database issue article describing Pfam. For a detailed description of the latest changes to Pfam, please consult (and cite) these papers.

releases¶

The releases directory contains all the flat files and database dumps (where appropriate) for all version of Pfam to-date. The files in more recent releases are the same as described for the current release, but in older releases the contents do change.