How to choose the right annotation and sequences¶

When it comes to the decision of the genome sequence (usually FASTA-formatted), or the genome annotation (usually GTF-formatted), the content, source and wide range of choices are often a issue.

Here are four big genome sequence and annotation databases:

Gencode
Ensembl
RefSeq
MGI

Gencode and Ensembl are both cutting edge databases for both humans and mice. MGI (Mouse Genome Informatics) is dedicated to mice but also contains many up-to-date information. You can perform any kind of analyses with these databases. Ensembl contains many more organisms.

RefSeq should be used only when you are working on very well known genes and do not care about missing up-t-o-date information. Neither trascriptomic, nor splicing, nor immunology.

Choose your annotations, example with Gencode:¶

Dedicated to both Humans and Mice genomes, this database is up to date with lasts publications these two organisms.

If you are working on a project based on exploratory research through the genome, while your project investigator does not know what kind of genes and or transcripts will interest them, then choose the complete super set:

wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/gencode.v34.chr_patch_hapl_scaff.annotation.gtf.gz
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/GRCh38.p13.genome.fa.gz

If you are working on a project based on known, protein-coding, documented genes, then use the “chromosome only” reference:

# Complete genome sequence
ls /mnt/beegfs/database/bioinfo_structured/Human/hg38/genome/gencode/release_34/GRCh38.p13.genome.fa

# Complete genome annotation
ls /mnt/beegfs/database/bioinfo_structured/Human/hg38/gtf_gff3/gencode/release_34/gencode.v34.annotation.gtf

# Complete transcriptome sequence
ls /mnt/beegfs/database/bioinfo_structured/Human/hg38/transcriptome/gencode/release_34/complete/gencode.v34.transcripts.fa

# Protein coding transcript sequences
ls /mnt/beegfs/database/bioinfo_structured/Human/hg38/transcriptome/gencode/release_34/protein_coding/gencode.v34.pc_transcripts.fa

You could also filter the provided GTF/Fasta files based on TSL (Trascript Support Level) categories or APPRIS scores.

hg19 / hg38¶

hg19 was initially released in 2009. hg38 was initially released in 2013.

hg38 is more complete. If it is not for long term compatibility with old data, stop using hg19.

hg19 / GRCh37¶

There are very minor divergences.

Number of bases in hg19 (RefSeq):

{T=58760485, G=38670110, A=58713343, C=38653197, N=3225295}

Number of bases in GRCh37 (Ensembl):

{T=58760485, G=38670110, A=58713343, R=2, C=38653197, M=1, N=3225292}

Stick to one annotation, and keep the same during a whole project. Changing annotations will lead to errors in SNP calling.