Tutorial introduction¶
This is a tutorial written to support the Pangenomics scientific days 2024 Dec. 5 & 6, organized by Infrastucture de Recherche (IR) BioinfOmics INRAE and collaborators. Attendees will learn to create a pangenome graph with different tools and use it for downstream analysis. The expected outcomes of the course are:
- Pangenome graph construction, statistics
- Graph exploration / visualization
- Short-read mapping and genotyping
This tutorial uses the Genotoul bioinfo platform bioinformatics facility with pre-installed tools thanks to the module system simplyfing software version management.
About IT resources and Data¶
As we will run this tutorial on the Genotoul Platform bioinformatics cluster, for simplicity we will run the following commands in interactive mode via slurm job manager. So, connect to the cluster ans ask for a node with 4 cpu and 60Go of RAM.
ssh login@genobioinfo.toulouse.inrae.fr
# replace login by your user name / If you do not have an account ask the organizers for a training account
# ask for a node
srun -c 4 --mem 60G --pty bash
# mv to your work and create directory
cd work
mkdir pangbomics
cd pangbomics
Download data used for this tutorial. More about the data at the dedicated page: Data
wget https://genotoul-bioinfo.pages.mia.inra.fr/pangenome_hackathon/data/ztritici.tar
tar -x -f ztritici.tar
# unzip all genome files
gzip -d ztritici/*.fa.gz
# set the DATADIR path for the tutorial
export DATADIR=$(pwd)/ztritici
echo $DATADIR
Data preparation¶
For clarity, haplotype sequences will be renamed using PanSN.
NB we call "haplotype" each individual genome assembly. We could have two haplotypes for the same diploïd genome.
Briefly, each sequence ID should be named as follow : <genome_name>#<haplotype_id>#<chr_name or ctg_name>
.
for file in ztritici/*.fa; do
genome=$(basename $file .fa | cut -f1 -d'.')
echo $genome
sed -i -e "1 s/^.*$/>$genome#1#chr05/" $file
done
Pangenome Graph Construction¶
Before running graph constructions¶
For the next 2 steps (Minigraph-Cactus and PGGB), computations will take some time !
Please launch the corresponding construction commands, then open a second console while they are running. This will allow you to parse / explore the result files we provide.
BE CAREFUL, make sure that the first session created previously (and assigned to 60Gb of memory) is used to run Minigraph-Cactus and PGGB commands. Use this second session only to explore files, with commands such as more
or less
.
ssh login@genobioinfo.toulouse.inrae.fr
# replace login by your user name / If you do not have an account ask the organizers for a training account
# ask for a node
srun --pty bash
# mv to your work and create directory
cd work/pangbomics
wget https://genotoul-bioinfo.pages.mia.inra.fr/pangenome_hackathon/data/results_graphs.tar
Minigraph-Cactus¶
Minigraph-Cactus
is a Toil workflow that takes the output graph from Minigraph
and adds back small SVs using Cactus
.
Briefly on
Minigraph
: The first genome is used as the reference and serves as the backbone of the graph. Variations in the haplotypes are added incrementally, starting from the closest to the farthest in genomic distance.Minigraph
removes structural variants (SVs) smaller than 50 bp by default, resulting in a simpler graph compared to alternative tools.
First, we create a seqfile
that will tell Minigraph-Cactus where genome sequences are located.
# Create a directory for MC
mkdir MC
# Create the seqfile
for file in $DATADIR/*.fa; do
echo -e "$(basename $file .fa | cut -f1 -d'.').1\t${file}" >> MC/zt.seqfile
done
#$ cat MC/zt.seqfile
IPO323.1 /work/user/<user>/pangbomics/ztritici/IPO323.chr_5.fa
Aus01.1 /work/user/<user>/pangbomics/ztritici/Aus01.chr_5.fa
1A5.1 /work/user/<user>/pangbomics/ztritici/ST99CH_1A5.chr_5.fa
1E4.1 /work/user/<user>/pangbomics/ztritici/ST99CH_1E4.chr_5.fa
3D1.1 /work/user/<user>/pangbomics/ztritici/ST99CH_3D1.chr_5.fa
3D7.1 /work/user/<user>/pangbomics/ztritici/ST99CH_3D7.chr_5.fa
TN09.1 /work/user/<user>/pangbomics/ztritici/TN09.chr_5.fa
YEQ92.1 /work/user/<user>/pangbomics/ztritici/YEQ92.chr_5.fa
Now, we can run Minigraph-Cactus. It will take around 15min:
# load cactus env
module load bioinfo/Cactus/2.7.1
# info
# minimal command is: cactus-pangenome jobstore zt.seqfile --outDir zt_mc --outName zt_mc --reference IPO323
# we will run the command with several options to obtain the graph in different formats with their indexes
cactus-pangenome jobstore MC/zt.seqfile --outDir MC/zt_mc --outName zt_mc --reference IPO323.1 TN09.1 --gfa clip filter full
#cactus-pangenome jobstore_full zt.seqfile --outDir MC/zt_mc_full --outName zt_mc --reference IPO323.1 --filter 0 --clip 0 --gbz --gfa
# decompressing outputs
gzip -d MC/zt_mc/*.full.gfa.gz
# Retrieving graphs
ln MC/zt_mc/zt_mc.full.gfa MC/zt_mc.V1-1.gfa
--filter [0-9]
: create an Allele Frequency graph with nodes and edges supported by the number of haplotypes provided. By default, MC set--filter
to 2 and--clip
to 10000, in the output you will have 3 graphs:- zt_mc.d2.gfa.gz (clipped and filtered graph with node supported by at least dX haplotypes)
- zt_mc.full.gfa.gz (graph with all nodes)
- zt_mc.gfa.gz (clipped graph)
More about cactus, look at the documentation
In our case, we will focus only on the full
graph since it contains complete haplotypes and can be compared to the PGGB graph. However, keep in mind that this is not the default graph, and it undermines the purpose of the Minigraph-Cactus approach.
PGGB¶
The PGGB
approach differs from Minigraph
and Minigraph-Cactus
in that it does not use a reference genome. The entire graph is built through pairwise alignment of all haplotypes using wfmash
, followed by graph induction with seqwish
. In theory, PGGB graphs are expected to be more complete, but this comes with a computational cost.
PGGB requires a single FASTA, compressed with BGzip and indexed using samtools. As we already renamed our sequences according to PanSN, we will simply merge all FASTAs into a single file.
mkdir pggb
# Concatenating FASTAs into a single one
cat $DATADIR/*.fa | bgzip > pggb/ztchr5.fa.gz
# Indexing
module load bioinfo/samtools/1.20
samtools faidx pggb/ztchr5.fa.gz
We can then run PGGB, it will take 13min.
# load pggb
module purge
module load devel/Miniconda/Miniconda3
module load bioinfo/pggb/0.5.4
pggb -i pggb/ztchr5.fa.gz -o pggb/ztchr5 -n 8 -t 4 -p 90 -s 5k -X -v
# Retrieving final graph
mv pggb/ztchr5/*.gfa pggb/zt_pggb.V1-0.gfa
There are 3 key parameters with PGGB :
--p
: Mapping identity. This indicate the alignment tool (wfmash
) that the expected similarity between our genomes is 90%.
--s
: Segment length. This is the window the alignment tool (wfmash
) will use to align sequences. This value should be longer than the expected length of repeats within your genomes.
--X
: Skipping the graph normalization withSmoothXG
The
-n
parameter corresponds to the number of haplotypes we have in our dataset.
Notes about GFA specifications and formats¶
GFA format corresponds to a human-readable (text) representation of variation graphs. Several versions of the specification already exist. Today, v1.0 and v1.1 are the most used versions. Full specification is here: GFA specification
Here is a summary of the main differences between v1.0 and v1.1:
H VN:Z:1.0 # header, showing here version=1.0
S 11 ACCTT # a Segment (node), with ID=11, and sequence=ACCTT
S 12 TCAAGG # carefull, several programs only accepts integer IDs (no letters/symbols)
S 13 CTTGATT
L 11 + 12 - # a Link (edge), that links node 11 (read forward) to node 12 (read as revcomp)
L 12 - 13 +
L 11 + 13 +
P NA12878#1#chr1 11+,12-,13+ # >v1.0 : Path of an haplotype, e.g. a list of node ids and directions
W NA12878 1 chr1 0 18 >11<12>13 # >v1.1 : Walk of an haplotype, contig, mapped read, ...
More complex identifiers and coordinates can be associated.
Whatever if you read the path (P) or walk (W), the resulting path (genome or haplotype) is:
11+ ACCTT
12- CCTTGA # <-- reverse complement !
13+ CTTGATT
NA12878#1#chr1 ACCTTCCTTGACTTGATT
Graph statistics¶
Loading required tools :
module purge
module load bioinfo/vg/1.57.0
module load devel/Miniconda/Miniconda3
module load compilers/gcc/12.2.0
module load bioinfo/odgi/0.8.6_compil
General statistics¶
We will compare both PGGB
and Minigraph-Cactus
graphs general statistics.
Note: You may stumble on an error
warning:[GFAParser] Skipping GFA W line: GFA format error: On pass 1: On line XXX: Duplicate path YYY exists in graph
. It seems that vg complains when parsing the GFA v1.1 format. Yet, the warning does not seem to have any effect.
# MC
vg stats -lz MC/zt_mc.V1-1.gfa
nodes 331950
edges 448014
length 7651968
# PGGB
vg stats -lz pggb/zt_pggb.V1-0.gfa
nodes 155544
edges 209508
length 7500395
The length
corresponds to the total sequence length contained within the graph. Since haplotypes are complete, the shorter the length
, the more compressed the graph is. In our case, PGGB
produced the more compressed version of the two. This can have two implications:
PGGB
was better at identifying corresponding regions between haplotypes, resulting in better compression of the information.PGGB
compressed repeats and created many cycles in the graph.
In reality, it is often a combination of both.
As a sanity check, we can insure our haplotypes are complete in the graph :
# MC
vg paths -x MC/zt_mc.V1-1.gfa -E | sort
Aus01#1#chr05#0 3132306
IPO323#1#chr05 2861803
ST99CH_1A5#1#chr05#0 2700053
ST99CH_1E4#1#chr05#0 2763254
ST99CH_3D1#1#chr05#0 3035485
ST99CH_3D7#1#chr05#0 2781580
TN09#1#chr05 3220060
YEQ92#1#chr05#0 3113700
# PGGB
vg paths -x pggb/zt_pggb.V1-0.gfa -E | sort
Aus01#1#chr05#0 3132306
IPO323#1#chr05#0 2861803
ST99CH_1A5#1#chr05#0 2700053
ST99CH_1E4#1#chr05#0 2763254
ST99CH_3D1#1#chr05#0 3035485
ST99CH_3D7#1#chr05#0 2781580
TN09#1#chr05#0 3220060
YEQ92#1#chr05#0 3113700
# Fasta index from ztchr5.fa
cat pggb/ztchr5.fa.gz.fai | cut -f1-2 | sort
Aus01#1#chr05 3132306
IPO323#1#chr05 2861803
ST99CH_1A5#1#chr05 2700053
ST99CH_1E4#1#chr05 2763254
ST99CH_3D1#1#chr05 3035485
ST99CH_3D7#1#chr05 2781580
TN09#1#chr05 3220060
YEQ92#1#chr05 3113700
Here we can see that PGGB, Minigraph-Cactus and the FASTA index show the same lengths for haplotypes.
Sub-graph extraction¶
We will explore the region of chr5 with a specific cluster of metabolism in TN09. First extract region from provided information in Data page. So we therefore need to find the coordinates of genes jg9858 to jg9867.
zgrep "jg9858" $DATADIR/TN09.chr_5.genes.gff.gz
TN09.chr_5 AUGUSTUS mRNA 2758591 2763545 . + . ID=jg9858.t1;geneID=TN09.jg9858
TN09.chr_5 AUGUSTUS CDS 2758591 2758850 0.86 + 0 Parent=jg9858.t1
...
zgrep "jg9866" $DATADIR/TN09.chr_5.genes.gff.gz
TN09.chr_5 AUGUSTUS mRNA 2810703 2812735 . + . ID=jg9866.t1;geneID=TN09.jg9866
TN09.chr_5 AUGUSTUS CDS 2810703 2810992 0.60 + 0 Parent=jg9866.t1
...
TN09#1#chr05:2756591-2814735
(~60kbp).
We will extract the regions from both graphs.
There are two tool suites that can be used to manipulate graphs. Since there are no established standards, we will provide the commands to extract the region using both tools, but you should only use one during the practice. Each tool has its own binary format, so we will need to convert our graphs into one of these formats.
- VG : .vg
, .pg
, .xg
- Odgi : .og
From Minigraph-Cactus graph¶
We can extract the region from TN09, thanks to --reference
option of MC, which indexes specifically both paths, allowing genomic coordinates based queries.
Note that haplotypes that are not 'reference' can be sparse, meaning the region you are looking for is not directly queriable. In this case, you can map (align) the region's sequence onto the graph and extract the corresponding path.
VG¶
# Extracting the sub-graph
vg find -x MC/zt_mc.V1-1.gfa -p TN09#1#chr05:2756591-2814735 -E > MC/MC.TN09_chr_5_2756591_2814735.pg
# Converting back to GFA
vg convert MC/MC.TN09_chr_5_2756591_2814735.pg -f -W > MC/MC.TN09_chr_5_2756591_2814735.V1-0.gfa
Odgi¶
# Converting GFAv1.1 to GFAv1.0
vg convert -g -f -W MC/zt_mc.V1-1.gfa > MC/zt_mc.V1-0.gfa
# Extracting the sub-graph
odgi extract -i MC/zt_mc.V1-0.gfa -t $(nproc) -o MC/MC.TN09_chr_5_2756591_2814735.V1-0.og -r TN09#1#chr05:2756591-2814735 -E -P
# converting back to GFA
odgi view -i MC/MC.TN09_chr_5_2756591_2814735.V1-0.og -g -t $(nproc) -P > MC/MC.TN09_chr_5_2756591_2814735.V1-0.gfa
From PGGB graph¶
Compared to Minigraph-Cactus, PGGB retains all haplotype sequences in the final graph. Therefore, if a haplotype is present in the input assemblies used by PGGB, the region can be queried in the graph.
VG¶
# find corresponding nodes
vg find -x pggb/zt_pggb.V1-0.gfa -p TN09#1#chr05#0:2756591-2814735 -E > pggb/pggb.TN09_chr5_2756591_2814735.pg
# convert in gfa for visualization with bandage
vg convert pggb/pggb.TN09_chr5_2756591_2814735.pg -f -W > pggb/pggb.TN09_chr5_2756591_2814735.V1-0.gfa
Odgi¶
# find corresponding nodes
odgi extract -i pggb/zt_pggb.V1-0.gfa -t $(nproc) -o pggb/pggb.TN09_chr5_2756591_2814735.V1-0.og -r TN09#1#chr05:2756591-2814735 -E -P
# convert in gfa for visualization with bandage
odgi view -i pggb/pggb.TN09_chr5_2756591_2814735.V1-0.og -g -t $(nproc) -P > pggb/pggb.TN09_chr5_2756591_2814735.V1-0.gfa
Format conversion¶
This conversion will enable to visualize the graphs with SequenceTubeMap.
# Sub-graph conversion to XG file format
vg convert -g pggb/pggb.TN09_chr5_2756591_2814735.V1-0.gfa -x -t $(nproc) > pggb/pggb.TN09_chr5_2756591_2814735.V1-0.xg
vg convert -g MC/MC.TN09_chr_5_2756591_2814735.V1-0.gfa -x -t $(nproc) > MC/MC.TN09_chr_5_2756591_2814735.V1-0.xg
# Whole graph conversion to XG file format
vg convert -g pggb/zt_pggb.V1-0.gfa -x -t $(nproc) > pggb/zt_pggb.V1-0.xg
vg convert -g MC/zt_mc.V1-1.gfa -x -t $(nproc) > MC/zt_mc.V1-1.xg
Visualisation¶
We will visualize some of the graphs that we have produced, as well as the paths for some genomes. Overall, complex variations are hard to visualise, whatever the tool. It is just hard to project graph complexity to 2 dimensions and every software has its pros and cons. But visualisation may remain useful to explore a local and simple variation pattern or to do a figure.
Bandage¶
Bandage (or its fork Bandage-NG) is designed to visualise very large graphs (millions of nodes). You will use it if you need to interact with large/complex chunks of graphs, such as the equivalent of a few megabases of a chromosome. It is not designed specifically for pangenome graphs, but it's compatible with GFA format.
Pros:
- can open very large graphs chunks
- can zoom/unzoom on complex structural variations
- sequence search via blast
- manual path highlight via node lists
- some annotation possible via CSV inputs
Cons:
- path information is not handled per say (you cannot select an haplotype / genome)
- limited in terms of graphical capabilities (no sequence display)
# download and execution
wget https://github.com/rrwick/Bandage/releases/download/v0.9.0/Bandage_Ubuntu-x86-64_v0.9.0_AppImage.zip
unzip Bandage_Ubuntu-x86-64_v0.9.0_AppImage.zip
./Bandage_Ubuntu-x86-64_v0.9.0.AppImage
First, from your LOCAL console, download the full graph and graph chunk that we extracted previously in GFA format.
# in the following commands replace login by your own login
scp login@genobioinfo.toulouse.inrae.fr:~/work/pangbomics/pggb/zt_pggb.V1-0.gfa ~/
scp login@genobioinfo.toulouse.inrae.fr:~/work/pangbomics/pggb/pggb.TN09_chr5_2756591_2814735.V1-0.gfa ~/
scp login@genobioinfo.toulouse.inrae.fr:~/work/pangbomics/MC/MC.TN09_chr_5_2756591_2814735.V1-0.gfa ~/
- Load the full PGGB graph (zt_pggb.V1-0.gfa) via menu File -> Load grap -> select the full GFA
- On the top left, click on "More info" to get some general statistics
- Click on "Draw graph", note that you can cancel the laying out when you want, it will just display, just with more overlapping node. Wait 2 minutes at most.
- On the left, set node length to 20, then uniform colors to see more things when zoomed out.
- You can zoom in & out with CTRL+MOUSE_WHEEL
The full graph should look like these 3 large circles. You can immediately notice that 2 regions show a high complexity of structural variations (e.g. where these 3 large circles are joining together). Also, you can notice that structural variations of lower complexity or shorter are actually present all along the genomes (e.g. small hairpins all along the 3 large circles).
We will know focus on the TN09 region and observe potential differences between PGGB and MC. To do so, we need to load the corresponding graph, then we will highlight each haplotype path. Because Bandage does not know how to handle paths from the GFA, we need to manually select the list of nodes of each path, then we will color them.
- As before, load the GFA from MC (MC.TN09_chr5_2756591_2814735.V1-0.gfa)
- Then, list all nodes belonging to 3 of the haplotypes with the following commands.
- For each command, copy-paste the list of nodes displayed in the console to the top right field of Bandage labelled "Search nodes"
- Then, set a color for the selected nodes
# load the gfa file in bandage
# find nodes to highlight
# IPO323 nodes (reference genome)
grep -P "IPO323" MC/MC.TN09_chr_5_2756591_2814735.V1-0.gfa | cut -f 3 | sed 's/[+|-]//g'
# TN09 nodes
grep -P "TN09" MC/MC.TN09_chr_5_2756591_2814735.V1-0.gfa | cut -f 3 | sed 's/[+|-]//g'
# Aus01 nodes
grep -P "Aus01" MC/MC.TN09_chr_5_2756591_2814735.V1-0.gfa | cut -f 3 | sed 's/[+|-]//g'
Can you see a difference between haplotypes ? Between MC and PGGB ?
IPO323 path¶
TN09 path¶
Aus01 path¶
# load the gfa file in bandage
# find nodes to highlight
# IPO323 nodes (reference genome)
grep -P "IPO323" pggb/pggb.TN09_chr5_2756591_2814735.V1-0.gfa | cut -f 3 | sed 's/[+|-]//g'
# TN09 nodes
grep -P "TN09" pggb/pggb.TN09_chr5_2756591_2814735.V1-0.gfa | cut -f 3 | sed 's/[+|-]//g'
# Aus01 nodes
grep -P "Aus01" pggb/pggb.TN09_chr5_2756591_2814735.V1-0.gfa | cut -f 3 | sed 's/[+|-]//g'
SequenceTubeMap¶
SequenceTubeMap is a tool still in development. It was built with pangenome graphs in mind. It is a compromise between alignement-like and graph-like views of the pangenome graph. It produces relatively comprehensive figures for regions of low complexity. But be aware that complex regions will not be handled well, and the figure could even lead to false interpretation (figure is correct, but we are not use to read graph projected in 2D space).
Pros:
- alignement-like projection of the graph
- compressed/uncompressed view of large region
- quite "artistic" and eye-catching
Cons:
- still in development (bugs)
- installation can be difficult (web server + environment)
- for large graphs : requires preliminary indexation and registration of indexes in the web server
Go to SequenceTubeMap demo website: https://vgteam.github.io/sequenceTubeMap/
You arrive on the dodemome, which shows by default the result of some reads mapped to a simple graph with 3 SNPs. You can select a few other demos with the top menu "Data".
Let's upload our own graph extraction. We need to retrieve and upload the graph in the "PG" format (it contains the graph and some index).
On your LOCAL machine, open a terminal and download your PG graph to your home.
scp login@genobioinfo.toulouse.inrae.fr:~/work/pangbomics/pggb/pggb.TN09_chr5_2756591_2814735.V1-0.xg ~/
- In "Data" choosebox, select "custom"
- Click on "configure track" that appeared on the right.
- Click on the "+" that appeared
- Select "Graph", "upload" and select your PG graph that you downloded.
- On the right, click on the grey wheel, set reference paths in reds and others in grays
- Close windows with the small cross on top right corner of the windows
- Click on red button "GO"
- If loaded correctly, you will see a red message saying "Wrong query: missing region"
This tool allow different ways to select a region to display, it can be by node_id or linear coordinates of one selected haplotype. A short manual is available if you click on "?". Some examples are :
- A coordinate range (e.g. "chr1:1-100")
- A node ID range (e.g. "node:100-110")
- A start position and a distance (e.g. "chr1:1+100")
- A node ID anchor and a distance (e.g. "node:100+10")
Find the first node of the GFA file with
grep '^S' pggb/pggb.TN09_chr5_2756591_2814735.V1-0.gfa | head -1 | cut -f 2
This will be your nodeid. Let's look around this one.
- In "region" box, set "node:nodeid+100" to select 100 nodes after the nodeid.
- Click on red button "GO"
- You can zoom/unzoom with the mouse wheel
- By scrolling from the start to the right up to node id (nodeid are displayed by letting mouse their top), you can see just after a giant node, that is unique to 1 haplotype and its length is 51kb! This make the whole visualisation somewhat broken.
- Go below the visualisation pane, in "Visualisation options" and click on "Compressed view". Now nodes are NOT proportionnal to their sequence length, you can have an overview of the whole region. Try to find again the node id 32542, which is our 51kb insertion, you should see this :
GFAviz¶
(OPTIONAL. If you are late, just have a look at the picture below.)
GFAviz seems to not be maintained anymore, but it is relatively useful to produce figures. It is designed with pangenome graphs in mind and it allows a lot of graphical customisation and annotations. But this comes to a cost. Only small graphs can be opened and manipulated (dozen of kilobases / thousands of node at best).
Pros: - lots of graphical customization - allow to highlight paths independantly, node lengths are represented - can be "artistic" and eye-catching if you spend time on it
Cons: - not maintained anymore (?) - restricted to small graphs chunks
# installation you should have done beforehand (debian)
apt-get install qt5-qmake qtbase5-dev qtbase5-dev-tools libqt5svg5-dev libqt5webenginewidgets5 libqt5webchannel5-dev qtwebengine5-dev
git clone https://github.com/ggonnella/gfaviz
cd gfaviz
qmake
make
We will open the TN09 region. On your LOCAL machine, open a terminal and download your GFA to your home.
scp login@genobioinfo.toulouse.inrae.fr:~/work/pangbomics/pggb/pggb.TN09_chr5_2756591_2814735.V1-0.gfa ~/
- Open GFAviz, then menu File -> Open GFA -> select the dowloaded GFA
By default, you see all nodes, all paths, with default colors. Let's select only 3 haplotypes with interesting differences. - Menu Selection -> select all. - On the right -> Style Pane 'P/O/U' -> set width to 0 -> click on graph to update - On the right -> GFA elements -> groups -> select 1st haplotype 'Aus1' with a double click (you should see it as selected in the middle textbox) - On the right -> Style Pane 'P/O/U' -> set width to 2 and choose green color-> click on graph to update (eventually zoom/dezoom to update) - Do the same with 2nd haplotype, but in red and 3rd in Blue.
You can now see more clearly the variants that are specific to these selected haplotypes. Also you can observe that a big segement between 2 loops is not shared by them.
- click on this segment, on top right you can retrieve to which group it belongs (TN09#1#chr05#0) and its sequence (51kb !)
- click on Pane 'S' and display the sequence direction (tickbox) and node_id (Element label -> show) and color it in purple.
You should have this result, that you can export this as a vectorial image for more hand-made annotations. You could also play with more layouts in GFAviz.
Variants analyses¶
Calling from the graph directly¶
Variants can be directly called from the haplotypes contained within the graph using the vg deconstruct
command :
# Calling variants from IPO323 haplotype
## On PGGB graph
vg deconstruct \
-a -e -P IPO323 \
-t $(nproc) -v \
pggb/zt_pggb.V1-0.xg \
> pggb.IPO323.var.vcf
## On MC graph
vg deconstruct \
-a -e -P IPO323 \
-t $(nproc) -v \
MC/zt_mc.V1-1.xg \
> MC.IPO323.var.vcf
Briefly,
-a
indicates that all snarls or bubbles in the graph are considered. The-e
argument restricts VG to only consider path traversals (enabled by default in newer version of VG).
Although Giraffe allows you to directly produce a SAM/BAM file with read alignments projected onto the desired genome, this step can also be done manually from the GAF and is not strictly necessary for the next steps.
scp login@genobioinfo.toulouse.inrae.fr:~/work/pangbomics/*.vcf ~/
scp login@genobioinfo.toulouse.inrae.fr:~/work/pangbomics/pggb/ztchr5.fa* ~/
Load the FASTA file as a genome and open both VCF files as tracks. We can then focus on the region we previously explored visually : IPO323#1#chr05:2,756,591-2,814,735
Short reads mapping onto the graph¶
There are two main tools to map onto a pangenome graph :
- GraphAligner : Long reads
- VG Giraffe : Short reads
In our case, we will map short reads, so we will use giraffe
. GraphAligner is simpler and you can directly map onto the GFA. Giraffe requires the graph to indexed beforehand.
Indexing for Giraffe¶
There are two ways to index the graph : manually or with vg autoindex
. We will use the later but refer to this page for the manual process.
# PGGB graph
vg autoindex \
-p pggb/zt_pggb.V1-0.gfa \
-t $(nproc) \
-g pggb/zt_pggb.V1-0.gfa \
-w giraffe
# MC graph
vg autoindex \
-p MC/zt_mc.V1-1.gfa \
-t $(nproc) \
-g MC/zt_mc.V1-1.gfa \
-w giraffe
The previous commands create three files: - GBZ: a read-only, compressed version of the graph with several index - DIST: a distance index - MIN: a minimizer index
Note that more complex graphs can only be indexed when created with MC, as PGGB produces many loops in repetitive regions, which VG is not suited to handle.
Mapping with Giraffe¶
Currently does not work with vg giraffe 1.48, 1.52, 1.57
# PGGB graph
vg giraffe -t 3 -p \
-Z pggb/zt_pggb.V1-0.gfa.giraffe.gbz \
-d pggb/zt_pggb.V1-0.gfa.dist \
-m pggb/zt_pggb.V1-0.gfa.min \
-o gaf \
-f $DATADIR/SRR16762562.chr5.R1.fastq.gz \
-f $DATADIR/SRR16762562.chr5.R2.fastq.gz \
> pggb/pggb.SRR16762562.gaf
# MC graph
vg giraffe -t 3 -p \
-Z MC/zt_mc.V1-1.gfa.giraffe.gbz \
-d MC/zt_mc.V1-1.gfa.dist \
-m MC/zt_mc.V1-1.gfa.min \
-o gaf \
-f $DATADIR/SRR16762562.chr5.R1.fastq.gz \
-f $DATADIR/SRR16762562.chr5.R2.fastq.gz \
> MC/mc.SRR16762562.gaf
The resulting file is in GAF (Graph Alignment Format) which is specified here.
Briefly, each alignment can be viewed as a path within the corresponding graph. Alignments have not been projected onto any genomes.
Although Giraffe
allows you to directly produce a SAM/BAM file with read alignments projected onto the desired genome, this step can also be done manually from the GAF and is not strictly necessary for the next steps.
To project the graph alignments to a genome alignement, vg surject
can be used :
By default,
vg surject
will project the alignments onto reference genomes only. In our case, we will provide the IPO323 path name to force the projection onto it.
# PGGB graph
vg paths -x pggb/zt_pggb.V1-0.gfa.giraffe.xg -S IPO323 -L > pggb/IPO323.paths.txt
vg surject \
-x pggb/zt_pggb.V1-0.xg \
--interleaved \
-F pggb/IPO323.paths.txt \
-b -N SRR16762562 \
-G pggb/pggb.SRR16762562.gaf | \
samtools sort > pggb/pggb.SRR16762562.IPO323.bam
# MC graph
vg paths -x MC/zt_mc.V1-1.gfa.giraffe.gbz -S IPO323 -L > MC/IPO323.paths.txt
vg surject \
-x MC/zt_mc.V1-1.gfa.giraffe.gbz \
--interleaved \
-F MC/IPO323.paths.txt \
-b -N SRR16762562 \
-G MC/mc.SRR16762562.gaf | \
samtools sort > MC/mc.SRR16762562.IPO323.bam
Visualization¶
# see the mapping
samtools view MC/mc.SRR16762562.bam | more
# specify IPO323
samtools view SRR16762562.IPO323.bam | grep -c -P "IPO323"
2740
# specify TN09
vg paths -x mc.TN09_chr5_2756591_2814735.giraffe.gbz -S TN09 -L > TN09.paths.txt
vg surject -x mc.TN09_chr5_2756591_2814735.giraffe.gbz --interleaved -F TN09.paths.txt -b -N SRR16762562 mc.SRR16762562.gam | samtools sort > SRR16762562.TN09.bam
samtools view SRR16762562.TN09.bam | grep -c -P "TN09"
149686
# SRR16762562 contains more mappings
# visualize alignment on sequence with IGV:
vg paths -x mc.TN09_chr5_2756591_2814735.vg -F -Q IPO323 > IPO323_chr5_2756591_2814735.fasta
vg paths -x mc.TN09_chr5_2756591_2814735.vg -F -Q TN09 > TN09_chr5_2756591_2814735.fasta
samtools index SRR16762562.IPO323.bam
samtools index SRR16762562.TN09.bam
IGV IPO323
IGV TN09
Problem: many bad mapping => mapping on TE (see coverage). We need to perform quality control on mapping. Bias onmapping due to subsample of a specific region.
Mapping on pggb graph¶
# index the gfa
vg autoindex -p pggb.TN09_chr5_2756591_2814735 -w giraffe -g pggb.TN09_chr5_2756591_2814735.gfa
# mapping with giraffe
vg giraffe -t 3 -p -Z pggb.TN09_chr5_2756591_2814735.giraffe.gbz -f $DATADIR/SRR16762562.chr5.R1.fastq.gz -f $DATADIR/SRR16762562.chr5.R2.fastq.gz > mc.SRR16762562.gam
vg giraffe -t 3 -p -Z pggb.TN09_chr5_2756591_2814735.giraffe.gbz -f $DATADIR/SRR16762562.chr5.R1.fastq.gz -f $DATADIR/SRR16762562.chr5.R2.fastq.gz -o BAM > mc.SRR16762562.bam => failed no ref paths in the graph / need to update .gbz with gbz gwbt
add ref path (follow: add ref path)
# validate no ref path
vg paths -M -x pggb.TN09_chr5_2756591_2814735.giraffe.gbz
#NAME SENSE SAMPLE HAPLOTYPE LOCUS PHASE_BLOCK SUBRANGE
1A5#1#chr5#0 HAPLOTYPE 1A5 1 chr5 0 NO_SUBRANGE
1E4#1#chr5#0 HAPLOTYPE 1E4 1 chr5 0 NO_SUBRANGE
3D1#1#chr5#0 HAPLOTYPE 3D1 1 chr5 0 NO_SUBRANGE
3D7#1#chr5#0 HAPLOTYPE 3D7 1 chr5 0 NO_SUBRANGE
Aus01#1#chr5#0 HAPLOTYPE Aus01 1 chr5 0 NO_SUBRANGE
IPO323#1#chr5#0 HAPLOTYPE IPO323 1 chr5 0 NO_SUBRANGE
TN09#1#chr5#0 HAPLOTYPE TN09 1 chr5 0 NO_SUBRANGE
YEQ92#1#chr5#0 HAPLOTYPE YEQ92 1 chr5 0 NO_SUBRANGE
vg convert -a --ref-sample IPO323 --ref-sample TN09 pggb.TN09_chr5_2756591_2814735.giraffe.gbz > tmp.vg
vg paths -M -x tmp.vg
#NAME SENSE SAMPLE HAPLOTYPE LOCUS PHASE_BLOCK SUBRANGE
3D1#1#chr5#0 HAPLOTYPE 3D1 1 chr5 0 NO_SUBRANGE
Aus01#1#chr5#0 HAPLOTYPE Aus01 1 chr5 0 NO_SUBRANGE
YEQ92#1#chr5#0 HAPLOTYPE YEQ92 1 chr5 0 NO_SUBRANGE
TN09#1#chr5 REFERENCE TN09 1 chr5 NO_PHASE_BLOCK NO_SUBRANGE
1E4#1#chr5#0 HAPLOTYPE 1E4 1 chr5 0 NO_SUBRANGE
3D7#1#chr5#0 HAPLOTYPE 3D7 1 chr5 0 NO_SUBRANGE
IPO323#1#chr5 REFERENCE IPO323 1 chr5 NO_PHASE_BLOCK NO_SUBRANGE
1A5#1#chr5#0 HAPLOTYPE 1A5 1 chr5 0 NO_SUBRANGE
# reindex in gbz format with reference
vg gbwt -x tmp.vg --index-paths -g pggb.TN09_chr5_2756591_2814735.giraffe.reindex.gbz --gbz-format
# map and export in bam format with new re-indexed graph
vg giraffe -t 3 -p -Z pggb.TN09_chr5_2756591_2814735.giraffe.reindex.gbz -f $DATADIR/SRR16762562.chr5.R1.fastq.gz -f $DATADIR/SRR16762562.chr5.R2.fastq.gz -o BAM > mc.SRR16762562.bam => failed no ref paths in the graph / need to update .gbz with gbz gwbt
Variant calling and genotyping¶
Variant calling from MC¶
vg gamsort mc.SRR16762562.gam > mc.SRR16762562.sorted.gam
vg pack -x mc.TN09_chr5_2756591_2814735.giraffe.gbz -g mc.SRR16762562.sorted.gam -o mc.SRR16762562.pack
vg paths -Mx mc.TN09_chr5_2756591_2814735.giraffe.gbz
vg call mc.TN09_chr5_2756591_2814735.giraffe.gbz -p IPO323#0#chr_5 TN09#0#TN09.chr_5 -k mc.SRR16762562.pack > mc.SRR16762562.vcf
To go further¶
Pan1c View report generated using the same dataset.
References¶
Authors¶
- Most of the content:
- Nicolas Lapalu (nicolas.lapalu[at]inrae.fr)
- Antoine Malet (antoine.malet[at]inrae.fr)
- Additions/corrections from:
- Alexis Mergez (alexis.mergez[at]inrae.fr)
- Benjamin Linard (benjamin.linard[at]inrae.fr)