Skip to content

Tutorial introduction

This is a tutorial written to support the Pangenomics scientific days 2024 Dec. 5 & 6, organized by Infrastucture de Recherche (IR) BioinfOmics INRAE and collaborators. Attendees will learn to create a pangenome graph with different tools and use it for downstream analysis. The expected outcomes of the course are:

  • Pangenome graph construction, statistics
  • Graph exploration / visualisation
  • Short-read mapping and genotyping

This tutorial uses the Genotoul bioinfo platform bioinformatics facility with pre-installed tools thanks to the module system simplyfing software version management.

About IT resources and Data

As we will run this tutorial on the Genotoul Platform bioinformatics cluster, for simplicity we will run the following commands in interactive mode via slurm job manager. So, connect to the cluster ans ask for a node with 4 cpu and 60Go of RAM.

ssh login@genobioinfo.toulouse.inrae.fr
# replace login by your user name / If you do not have an account ask the organizers for a training account
# ask for a node
srun -c 4 --mem 60G --pty bash
# mv to your work and create directory
cd work 
mkdir pangbomics
cd pangbomics

Download data used for this tutorial. More about the data at the dedicated page: Data

wget https://genotoul-bioinfo.pages.mia.inra.fr/pangenome_hackathon/data/ztritici.tar
tar -x -f ztritici.tar
# unzip all genome files
gzip -d ztritici/*.fa.gz
# set the DATADIR path for the tutorial
export DATADIR=$(pwd)/ztritici
echo $DATADIR

Data preparation

For clarity, haplotype sequences will be renamed using PanSN.

NB we call "haplotype" each individual genome assembly. We could have two haplotypes for the same diploïd genome.

Briefly, each sequence ID should be named as follow : <genome_name>#<haplotype_id>#<chr_name or ctg_name>.

for file in ztritici/*.fa; do
    genome=$(basename $file .fa | cut -f1 -d'.')
    echo $genome
    sed -i -e "1 s/^.*$/>$genome#1#chr05/" $file
done

Pangenome Graph Construction

Before running graph constructions

For the next 2 steps (Minigraph-Cactus and PGGB), computations will take some time !

Please launch the corresponding construction commands, then open a second console while they are running. This will allow you to parse / explore the result files we provide.

BE CAREFUL, make sure that the first session created previously (and assigned to 60Gb of memory) is used to run Minigraph-Cactus and PGGB commands. Use this second session only to explore files, with commands such as more or less.

ssh login@genobioinfo.toulouse.inrae.fr
# replace login by your user name / If you do not have an account ask the organizers for a training account
# ask for a node
srun --pty bash

Minigraph-Cactus

Minigraph-Cactus is a Toil workflow that takes the output graph from Minigraph and adds back small SVs using Cactus (if you want to know more about the workflow manager Toil: homepage).

Briefly on Minigraph : The first genome is used as the reference and serves as the backbone of the graph. Then, consecutive genomes will be aligned incrementally, augmenting the graph with their own variations. As a consequence, the order of genome alignement will impact the graph content, and different orders will generate different graphes.

Authors recommend to start from a well-annotated reference, then augement the graph from the closest (in evolutionary terms) genome to the more distant genome. The hope behind this choice is that the most complex variations can be projected in a comprehensive wau to the reference coordinates, for further analyses. In practice, you can give your own phylogenetic tree to guide the order of alignements and graph augmentations.

Note that Minigraph is designed for structural variants (SVs) only graphs, variations smaller than 50 bp will be ignored by default, resulting in a simpler but more scalable graph (comparatively to alternative tools).

All this remains true for Minigraph-Cactus. This extended pipeline will launch minigraph to obtain the SV-only backbone graph, then it will re-align all genomes to this graph with the genome aligner progressive-cactus to augment the graph with all variations, including SNPs and short indels. However, as the minigraph SV graph is the backbone, few new large or complex SV should appear in this last graph augmentation process.

First, we create a seqfile that will tell Minigraph-Cactus where genome sequences are located.

# Create a directory for MC
mkdir MC

# Create the seqfile
for file in $DATADIR/*.fa; do
    echo -e "$(basename $file .fa | cut -f1 -d'.').1\t${file}" >> MC/zt.seqfile
done
Looking at the file, it's simply an index giving the haplotype name and its corresponding FASTA.
#$ cat MC/zt.seqfile
IPO323.1  /work/user/<user>/pangbomics/ztritici/IPO323.chr_5.fa
Aus01.1   /work/user/<user>/pangbomics/ztritici/Aus01.chr_5.fa
1A5.1     /work/user/<user>/pangbomics/ztritici/ST99CH_1A5.chr_5.fa
1E4.1     /work/user/<user>/pangbomics/ztritici/ST99CH_1E4.chr_5.fa
3D1.1     /work/user/<user>/pangbomics/ztritici/ST99CH_3D1.chr_5.fa
3D7.1     /work/user/<user>/pangbomics/ztritici/ST99CH_3D7.chr_5.fa
TN09.1    /work/user/<user>/pangbomics/ztritici/TN09.chr_5.fa
YEQ92.1   /work/user/<user>/pangbomics/ztritici/YEQ92.chr_5.fa

Now, we can run Minigraph-Cactus. It will take around 15min:

# load cactus env
module load bioinfo/Cactus/2.7.1
# info
# minimal command is: cactus-pangenome jobstore zt.seqfile --outDir zt_mc --outName zt_mc --reference IPO323
# we will run the command with several options to obtain the graph in different formats with their indexes

cactus-pangenome jobstore MC/zt.seqfile --outDir MC/zt_mc --outName zt_mc --reference IPO323.1 TN09.1 --gfa clip filter full 

#cactus-pangenome jobstore_full zt.seqfile --outDir MC/zt_mc_full --outName zt_mc --reference IPO323.1 --filter 0 --clip 0 --gbz --gfa

# decompressing outputs
gzip -d MC/zt_mc/*.full.gfa.gz

# Retrieving graphs
ln MC/zt_mc/zt_mc.full.gfa MC/zt_mc.V1-1.gfa
Few words about MC options. MC is a workflow with many optional steps, depending of your input data (genome size, ploidy level, genome complexity, ...). You will find the cactus documentation here.

A good practice is to run with default options and perform tuning if required. Two important options are --filter [0-9] and --clip N. The --filter option creates an Allele Frequency graph with nodes and edges supported by at least the number of haplotypes provided. The --clip N option removes from non-reference haplotypes any sequence longer than N bp that does not align to the underlying minigraph (theoretically, only "parallel" non-aligned sequences are concerned and not mere insertions - see figure below).

mc_clipped

From Hickey et al. 2023

By default, MC set --filter to 2 and --clip to 10000, in the output you will have 3 graphs:

  • zt_mc.d2.gfa.gz (clipped and filtered graph with node supported by at least dX haplotypes)
  • zt_mc.full.gfa.gz (graph with all nodes)
  • zt_mc.gfa.gz (clipped graph)

Authors recommend to use the clipped graph, because it is more scalable to post-analyses AND it retains only the most well-aligned variations (which should avoid false-positives due to alignement errors). In our case, we will focus only on the full graph since it contains complete haplotypes and can be compared to the PGGB graph. However, keep in mind that this is not the default graph used in many studies.

PGGB

The PGGB approach differs from Minigraph and Minigraph-Cactus in that it does not use a reference genome. The entire graph is built through pairwise alignment of all haplotypes using wfmash, followed by graph induction with seqwish. In theory, PGGB graphs are expected to be more complete, but this comes with a computational cost.

PGGB requires a single FASTA, compressed with BGzip and indexed using samtools. As we already renamed our sequences according to PanSN, we will simply merge all FASTAs into a single file.

module load bioinfo/samtools/1.20
mkdir pggb

# Concatenating FASTAs into a single one
cat $DATADIR/*.fa | bgzip > pggb/ztchr5.fa.gz

# Indexing
samtools faidx pggb/ztchr5.fa.gz

We can then run PGGB, it will take 13min.

# load pggb
module purge
module load devel/Miniconda/Miniconda3
module load bioinfo/pggb/0.5.4
pggb -i pggb/ztchr5.fa.gz -o pggb/ztchr5 -n 8 -t 4 -p 90 -s 5k -X -v

# Retrieving final graph
mv pggb/ztchr5/*.gfa pggb/zt_pggb.V1-0.gfa 

There are 3 key parameters with PGGB :
- -p : Mapping identity. This indicate the alignment tool (wfmash) that the expected similarity between our genomes is 90%.
- -s : Segment length. This is the window the alignment tool (wfmash) will use to align sequences. This value should be longer than the expected length of repeats within your genomes.
- -X : Skipping the graph normalization with SmoothXG

The -n parameter corresponds to the number of haplotypes we have in our dataset.

You can notice that we skipped one of the PGGB pipeline step smoothxg. Briefly, PGGB can be decomposed into 3 main steps: 1) pairwise alignment of the assemblies with wfmash, graph construction from the alignments with seqwish and graph normalisation with smoothxg. You could see this normalisation as a topological "simplification", e.g. some variants can be represented by complex and intricate graph bubbles, which could eventually be simplified with some decomposition, while some other regions may be underaligned and need to reshape some variants. More can be found about this complex normalisation in their (recent paper)[https://www.nature.com/articles/s41592-024-02430-3].

Not that because this last smoothxg step involves many ordering and partial realignments, you may obtain different graph output from the same graph input. Graph node ordering in particular involves a stochastic component related to multithreaded gradient descents. Said otherwise the order of execution of the threads, which on most modern architecture cannot be controlled, may results to slightly different graphs normalisation from the same graph input. You can garantee the same output is you assign a single thread to smoothxg, but be ready for many weeks of computation...

Here, we have a relatively simple graph compared to, for instance, some plant pangenome graph. Ending with the seqwish graph output is sufficient. This graph is "complete", in regards to the assemblies alignments. Maybe some variants are just represented with some overintricate but still correct bubbles.

Notes about GFA specifications and formats

GFA format corresponds to a human-readable (text) representation of variation graphs. Several versions of the specification already exist. Today, v1.0 and v1.1 are the most used versions. Full specification is here: GFA specification

Here is a summary of the main differences between v1.0 and v1.1:

H   VN:Z:1.0                                  # header, showing here version=1.0
S   11  ACCTT                                 # a Segment (node), with ID=11, and sequence=ACCTT
S   12  TCAAGG                                # carefull, several programs only accepts integer IDs (no letters/symbols)
S   13  CTTGATT
L   11  +   12  -                             # a Link (edge), that links node 11 (read forward) to node 12 (read as revcomp)
L   12  -   13  +    
L   11  +   13  +   
P   NA12878#1#chr1  11+,12-,13+               # >v1.0 : Path of an haplotype, e.g. a list of node ids and directions
W   NA12878 1   chr1    0   18  >11<12>13     # >v1.1 : Walk of an haplotype, contig, mapped read, ...
                                                        More complex identifiers and coordinates can be associated.

Whatever if you read the path (P) or walk (W), the resulting path (genome or haplotype) is:

11+ ACCTT
12- CCTTGA     # <-- reverse complement !
13+ CTTGATT
NA12878#1#chr1  ACCTTCCTTGACTTGATT

Graph statistics

Loading required tools :

module purge
module load bioinfo/vg/1.57.0
module load devel/Miniconda/Miniconda3
module load compilers/gcc/12.2.0
module load bioinfo/odgi/0.8.6_compil
module load bioinfo/samtools/1.20

General statistics

We will compare both PGGB and Minigraph-Cactus graphs general statistics.

Note: You may stumble on an error warning:[GFAParser] Skipping GFA W line: GFA format error: On pass 1: On line XXX: Duplicate path YYY exists in graph. It seems that vg complains when parsing the GFA v1.1 format. Yet, the warning does not seem to have any effect.

# MC
vg stats -lz MC/zt_mc.V1-1.gfa
nodes   331950
edges   448014
length  7651968

# PGGB
vg stats -lz pggb/zt_pggb.V1-0.gfa
nodes   155544
edges   209508
length  7500395
For MC, you may have slightly different values because as mentionned before for PGGB, there is some stochasticity in graph construction. On PGGB side, as we used the seqwishstep output, you should have the same values.

The length corresponds to the total sequence length contained within the graph. Since haplotypes are complete, the shorter the length, the more compressed the graph is. In our case, PGGB produced the more compressed version of the two. This can have two implications:

  • PGGB was better at identifying corresponding regions between haplotypes, resulting in better compression of the information.
  • PGGB compressed repeats and created many cycles in the graph.

In reality, it is often a combination of both.

As a sanity check, we can insure our haplotypes are complete in the graph :

# MC
vg paths -x MC/zt_mc.V1-1.gfa -E | sort
Aus01#1#chr05#0         3132306
IPO323#1#chr05          2861803
ST99CH_1A5#1#chr05#0    2700053
ST99CH_1E4#1#chr05#0    2763254
ST99CH_3D1#1#chr05#0    3035485
ST99CH_3D7#1#chr05#0    2781580
TN09#1#chr05            3220060
YEQ92#1#chr05#0         3113700

# PGGB
vg paths -x pggb/zt_pggb.V1-0.gfa -E | sort
Aus01#1#chr05#0         3132306
IPO323#1#chr05#0        2861803
ST99CH_1A5#1#chr05#0    2700053
ST99CH_1E4#1#chr05#0    2763254
ST99CH_3D1#1#chr05#0    3035485
ST99CH_3D7#1#chr05#0    2781580
TN09#1#chr05#0          3220060
YEQ92#1#chr05#0         3113700

# Fasta index from ztchr5.fa
cat pggb/ztchr5.fa.gz.fai | cut -f1-2 | sort
Aus01#1#chr05           3132306
IPO323#1#chr05          2861803
ST99CH_1A5#1#chr05      2700053
ST99CH_1E4#1#chr05      2763254
ST99CH_3D1#1#chr05      3035485
ST99CH_3D7#1#chr05      2781580
TN09#1#chr05            3220060
YEQ92#1#chr05           3113700

Here we can see that PGGB, Minigraph-Cactus and the FASTA index show the same lengths for haplotypes.

Sub-graph extraction

We will explore the region of chr5 with a specific cluster of metabolism in TN09. First extract region from provided information in Data page. So we therefore need to find the coordinates of genes jg9858 to jg9867.

zgrep "jg9858" $DATADIR/TN09.chr_5.genes.gff.gz

TN09.chr_5      AUGUSTUS        mRNA    2758591 2763545 .       +       .       ID=jg9858.t1;geneID=TN09.jg9858
TN09.chr_5      AUGUSTUS        CDS     2758591 2758850 0.86    +       0       Parent=jg9858.t1
...

zgrep "jg9866" $DATADIR/TN09.chr_5.genes.gff.gz

TN09.chr_5      AUGUSTUS        mRNA    2810703 2812735 .       +       .       ID=jg9866.t1;geneID=TN09.jg9866
TN09.chr_5      AUGUSTUS        CDS     2810703 2810992 0.60    +       0       Parent=jg9866.t1
...
We unlarge the region by 2kb bases upstream and downstream on TN09 : TN09#1#chr05:2756591-2814735 (~60kbp). We will extract the regions from both graphs.

There are two tool suites that can be used to manipulate graphs. Since there are no established standards, we will provide the commands to extract the region using both tools, but you should only use one during the practice. Each tool has its own binary format, so we will need to convert our graphs into one of these formats. - VG : .vg, .pg, .xg - Odgi : .og

From Minigraph-Cactus graph

We can extract the region from TN09, thanks to --reference option of MC, which indexes specifically both paths, allowing genomic coordinates based queries.

Note that haplotypes that are not 'reference' can be sparse, meaning the region you are looking for is not directly queriable. This is patially due to the fact that some 'not well aligned regions' were filtered (option clip). In this case, you can map (align) the region's sequence onto the graph and extract the corresponding path. This could be achieved by chaining the vg commands index and map (providing the region sequence as parameters -s) and surject (with the alignments file .gam as input parameters -g).
Note that haplotypes that are not 'reference' can be sparse, meaning the region you are looking for is not directly queriable. In this case, you can map (align) the region's sequence onto the graph and extract the corresponding path.

VG
# Extracting the sub-graph
vg find -x MC/zt_mc.V1-1.gfa -p TN09#1#chr05:2756591-2814735 -E > MC/MC.TN09_chr_5_2756591_2814735.pg
# Converting back to GFA
vg convert MC/MC.TN09_chr_5_2756591_2814735.pg -f -W > MC/MC.TN09_chr_5_2756591_2814735.V1-0.gfa
Odgi
# Converting GFAv1.1 to GFAv1.0
vg convert -g -f -W MC/zt_mc.V1-1.gfa > MC/zt_mc.V1-0.gfa
# Extracting the sub-graph
odgi extract -i MC/zt_mc.V1-0.gfa -t $(nproc) -o MC/MC.TN09_chr_5_2756591_2814735.V1-0.og -r TN09#1#chr05:2756591-2814735 -E -P
# converting back to GFA
odgi view -i MC/MC.TN09_chr_5_2756591_2814735.V1-0.og -g -t $(nproc) -P > MC/MC.TN09_chr_5_2756591_2814735.V1-0.gfa

From PGGB graph

Compared to Minigraph-Cactus, PGGB retains all haplotype sequences in the final graph. Therefore, if a haplotype is present in the input assemblies used by PGGB, the region can be queried in the graph.

VG
# find corresponding nodes 
vg find -x pggb/zt_pggb.V1-0.gfa -p TN09#1#chr05#0:2756591-2814735 -E > pggb/pggb.TN09_chr5_2756591_2814735.pg
# convert in gfa for visualisation with bandage
vg convert pggb/pggb.TN09_chr5_2756591_2814735.pg -f -W > pggb/pggb.TN09_chr5_2756591_2814735.V1-0.gfa
Odgi
# find corresponding nodes
odgi extract -i pggb/zt_pggb.V1-0.gfa -t $(nproc) -o pggb/pggb.TN09_chr5_2756591_2814735.V1-0.og -r TN09#1#chr05:2756591-2814735 -E -P
# convert in gfa for visualisation with bandage
odgi view -i pggb/pggb.TN09_chr5_2756591_2814735.V1-0.og -g -t $(nproc) -P > pggb/pggb.TN09_chr5_2756591_2814735.V1-0.gfa

Format conversion

This conversion will enable to visualise the graphs with SequenceTubeMap.

# Sub-graph conversion to XG file format
vg convert -g pggb/pggb.TN09_chr5_2756591_2814735.V1-0.gfa -x -t $(nproc) > pggb/pggb.TN09_chr5_2756591_2814735.V1-0.xg

vg convert -g MC/MC.TN09_chr_5_2756591_2814735.V1-0.gfa -x -t $(nproc) > MC/MC.TN09_chr_5_2756591_2814735.V1-0.xg

# Converting PGGB GFAv1.0 to GFAv1.1
wget "https://forgemia.inra.fr/alexis.mergez/pangetools/-/raw/main/GFAvc.py?ref_type=heads" -O GFAvc.py
awk '{printf("%s\t0\t%s\n", $1, $2)}' pggb/ztchr5.fa.gz.fai > pggb/ztchr5.index
python GFAvc.py -G pggb/zt_pggb.V1-0.gfa -o pggb/zt_pggb.V1-1.gfa -i pggb/ztchr5.index

# Whole graph conversion to XG file format
vg convert -g pggb/zt_pggb.V1-1.gfa -x -t $(nproc) > pggb/zt_pggb.V1-1.xg

vg convert -g MC/zt_mc.V1-1.gfa -x -t $(nproc) > MC/zt_mc.V1-1.xg

Visualisation

We will visualise some of the graphs that we have produced, as well as the paths for some genomes. Overall, complex variations are hard to visualise, whatever the tool. It is just hard to project graph complexity to 2 dimensions and every software has its pros and cons. But visualisation may remain useful to explore a local and simple variation pattern or to do a figure.

All visualisation tools below will have to be launched on you LOCAL machine, not the server. We hope that you installed them beforehand, as in the instructions that were sent to you by email . :)

Bandage

Bandage (or its fork Bandage-NG) is designed to visualise very large graphs (millions of nodes). You will use it if you need to interact with large/complex chunks of graphs, such as the equivalent of a few megabases of a chromosome. It is not designed specifically for pangenome graphs, but it's compatible with GFA format.

Pros:

  • can open very large graphs chunks
  • can zoom/unzoom on complex structural variations
  • sequence search via blast
  • manual path highlight via node lists
  • some annotation possible via CSV inputs

Cons:

  • path information is not handled per say (you cannot select an haplotype / genome)
  • limited in terms of graphical capabilities (no sequence display)

Bandage is not allowed to run on the genotoul cluster, thus you will run it on your own computer.

# download and execution
wget https://github.com/rrwick/Bandage/releases/download/v0.9.0/Bandage_Ubuntu-x86-64_v0.9.0_AppImage.zip
unzip Bandage_Ubuntu-x86-64_v0.9.0_AppImage.zip
./Bandage_Ubuntu-x86-64_v0.9.0.AppImage

First, from your LOCAL console, download the full graph and graph chunk that we extracted previously in GFA format. Create a directory somewhere and go into this directory with your LOCAL console.

# in the following commands replace login by your own login 
scp login@genobioinfo.toulouse.inrae.fr:~/work/pangbomics/pggb/zt_pggb.V1-0.gfa ./
scp login@genobioinfo.toulouse.inrae.fr:~/work/pangbomics/pggb/pggb.TN09_chr5_2756591_2814735.V1-0.gfa ./
scp login@genobioinfo.toulouse.inrae.fr:~/work/pangbomics/MC/MC.TN09_chr_5_2756591_2814735.V1-0.gfa ./
  • Load the full PGGB graph (zt_pggb.V1-0.gfa) via menu File -> Load grap -> select the full GFA
  • On the top left, click on "More info" to get some general statistics
  • Click on "Draw graph", note that you can cancel the laying out when you want, it will just display, just with more overlapping node. Wait 2 minutes at most.
  • On the left, set node width to 20, then uniform colors to see more things when zoomed out.
  • You can zoom in & out with CTRL+MOUSE_WHEEL. It will not possible to see the full graph onscreen, you can export it as an image representing the "full pane" to see the same picture as below.

The full graph should look like these 3 large circles. You can immediately notice that 2 regions show a high complexity of structural variations (e.g. where these 3 large circles are joining together). Also, you can notice that structural variations of lower complexity or shorter are actually present all along the genomes (e.g. small hairpins all along the 3 large circles).

pggb_full_graph_bandage

We will know focus on the TN09 region and observe potential differences between PGGB and MC. To do so, we need to load the corresponding graph, then we will highlight each haplotype path. Because Bandage does not know how to handle paths from the GFA, we need to manually select the list of nodes of each path, then we will color them.

  • As before, load the GFA from MC (MC.TN09_chr5_2756591_2814735.V1-0.gfa)
  • Then, list all nodes belonging to 3 of the haplotypes with the following commands.
  • For each command, copy-paste the list of nodes displayed in the console to the top right field of Bandage labelled "Search nodes"
  • Then, set a color for the selected nodes

# load the gfa file in bandage
# find nodes to highlight
# IPO323 nodes (reference genome)
grep -P "IPO323" MC/MC.TN09_chr_5_2756591_2814735.V1-0.gfa | cut -f 3 | sed 's/[+|-]//g'
# TN09 nodes
grep -P "TN09" MC/MC.TN09_chr_5_2756591_2814735.V1-0.gfa | cut -f 3 | sed 's/[+|-]//g'
# Aus01 nodes 
grep -P "Aus01" MC/MC.TN09_chr_5_2756591_2814735.V1-0.gfa | cut -f 3 | sed 's/[+|-]//g'
Repeat this process for the different haplotypes if you want to obtain all screenshots below.

Can you see a difference between haplotypes ? Between MC and PGGB ? You may see differences, such as :

Number of nodes per path:

  • MC #nodes
    • vg : 633, 526, 902 for IPO323, TN09 and aus01.
    • odgi : 632, 525 and 901 (1 less, which tool had a programming oversight?).
  • PGGB #nodes
  • vg : 228, 365 and 394
  • odgi : 230, 365 and 399

How to explain the difference between vg and odgi? Is one more accurate than the other?

IPO323 path

mc_IPO323

TN09 path

mc_TN09

Aus01 path

mc_Aus01

# load the gfa file in bandage
# find nodes to highlight
# IPO323 nodes (reference genome)
grep -P "IPO323" pggb/pggb.TN09_chr5_2756591_2814735.V1-0.gfa | cut -f 3 | sed 's/[+|-]//g'
# TN09 nodes
grep -P "TN09" pggb/pggb.TN09_chr5_2756591_2814735.V1-0.gfa | cut -f 3 | sed 's/[+|-]//g'
# Aus01 nodes 
grep -P "Aus01" pggb/pggb.TN09_chr5_2756591_2814735.V1-0.gfa | cut -f 3 | sed 's/[+|-]//g'

SequenceTubeMap

SequenceTubeMap is a tool still in development. It was built with pangenome graphs in mind. It is a compromise between alignement-like and graph-like views of the pangenome graph. It produces relatively comprehensive figures for regions of low complexity. But be aware that complex regions will not be handled well, and the figure could even lead to false interpretation (figure is correct, but we are not use to read graph projected in 2D space).

Pros:

  • alignement-like projection of the graph
  • compressed/uncompressed view of large region
  • quite "artistic" and eye-catching

Cons:

  • still in development (bugs)
  • installation can be difficult (web server + environment)
  • for large graphs : requires preliminary indexation and registration of indexes in the web server

Go to SequenceTubeMap demo website: https://vgteam.github.io/sequenceTubeMap/

You arrive on the dodemome, which shows by default the result of some reads mapped to a simple graph with 3 SNPs. You can select a few other demos with the top menu "Data".

Let's upload our own graph extraction. We need to retrieve and upload the graph in the "PG" format (it contains the graph and some index).

On your LOCAL machine, open a terminal and download your PG graph to your local working directory.

scp login@genobioinfo.toulouse.inrae.fr:~/work/pangbomics/pggb/pggb.TN09_chr5_2756591_2814735.V1-0.xg ./

  • In "Data" choosebox, select "custom"
  • Click on "configure track" that appeared on the right.
  • Click on the "+" that appeared
  • Select "Graph", "upload", "Browse" and select the PG graph (.xg) that you downloaded.
  • On the right, click on the grey wheel, set reference paths in reds and others in grayscale
  • Close windows with the small cross on top right corner of the windows
  • Click on red button "GO"
  • If loaded correctly, you will see a red message saying "Wrong query: missing region"

This tool allow different ways to select a region to display, it can be by node_id or linear coordinates of one selected haplotype. A short manual is available if you click on "?". Some examples are :

  • A coordinate range (e.g. "chr1:1-100")
  • A node ID range (e.g. "node:100-110")
  • A start position and a distance (e.g. "chr1:1+100")
  • A node ID anchor and a distance (e.g. "node:100+10")

Find the first node of the GFA file with

grep '^S' pggb/pggb.TN09_chr5_2756591_2814735.V1-0.gfa | head -1 | cut -f 2

This will be your nodeid. Let's look around this one.

  • In "region" box, set "node:nodeid+100" to select 100 nodes after the nodeid.
  • Click on red button "GO"
  • You can zoom/unzoom with the mouse wheel
  • By scrolling from the start to the right up to node id (nodeid are displayed by letting mouse their top), you can see just after a giant node, that is unique to 1 haplotype and its length is 51kb! This make the whole visualisation somewhat broken.

screenshot sequtubemap 2

  • Go below the visualisation panel, in "Visualisation options" and click on "Compressed view". Now nodes are NOT proportionnal to their sequence length, you can have an overview of the whole region. Try to find again the node id 32542, which is our 51kb insertion, you should see this :

screenshot seqtubemap 1

GFAviz

(OPTIONAL. If you are late or cannot install it, just have a look at the picture below.)

GFAviz seems to not be maintained anymore, but it is relatively useful to produce figures. It is designed with pangenome graphs in mind and it allows a lot of graphical customisation and annotations. But this comes to a cost. Only small graphs can be opened and manipulated (dozen of kilobases / thousands of node at best).

Pros:

  • lots of graphical customization
  • allow to highlight paths independantly, node lengths are represented
  • can be "artistic" and eye-catching if you spend time on it

Cons:

  • not maintained anymore (?)
  • restricted to small graphs chunks
# installation you should have done beforehand (debian)
apt-get install qt5-qmake qtbase5-dev qtbase5-dev-tools libqt5svg5-dev libqt5webenginewidgets5 libqt5webchannel5-dev qtwebengine5-dev
git clone https://github.com/ggonnella/gfaviz
cd gfaviz
qmake
make

Here is what to do on Mac.

Install brew

unset HOMEBREW_BREW_GIT_REMOTE HOMEBREW_CORE_GIT_REMOTE
% /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
ln -s /usr/local/opt/qt@5/bin/qmake /usr/local/bin

Install QT5 and LLVM

brew install qt@5
brew install llvm
CPLUS_INCLUDE_PATH=/usr/local/opt/llvm/include/c++/v1:/Library/Developer/CommandLineTools/SDKs/MacOSX15.sdk/usr/include
Make and compile GFAviz as previously.

We will open the TN09 region. On your LOCAL machine, open a terminal and download your GFA to your home.

scp login@genobioinfo.toulouse.inrae.fr:~/work/pangbomics/pggb/pggb.TN09_chr5_2756591_2814735.V1-0.gfa ~/

  • Open GFAviz, then menu File -> Open GFA -> select the dowloaded GFA

By default, you see all nodes, all paths, with default colors. Let's select only 3 haplotypes with interesting differences.

  • Menu Selection -> select all.
  • On the right -> Style Pane 'P/O/U' -> set width to 0 -> click on graph to update
  • On the right -> GFA elements -> groups -> select 1st haplotype 'Aus1' with a double click (you should see it as selected in the middle textbox)
  • On the right -> Style Pane 'P/O/U' -> set width to 2 and choose green color-> click on graph to update (eventually zoom/dezoom to update)
  • Do the same with 2nd haplotype, but in red and 3rd in Blue.

You can now see more clearly the variants that are specific to these selected haplotypes. Also you can observe that a big segement between 2 loops is not shared by them.

  • click on this segment, on top right you can retrieve to which group it belongs (TN09#1#chr05#0) and its sequence (51kb !)
  • click on Pane 'S' and display the sequence direction (tickbox) and node_id (Element label -> show) and color it in purple.

You should have this result, that you can export this as a vectorial image for more hand-made annotations. You could also play with more layouts in GFAviz.

screenshot GFAviz

Variant analyses

Calling variants directly from the graph

The graph holds many variations that distinguish the integrated haplotypes and we can extract them as VCF files. By definition, the selection of a reference genome becomes a compulsory step to produce the VCF, because all variants will have to be described relatively to the sequence coordinates of this reference.

In practice, we need to switch from the graph coordinate system to the linear coordinate system of the chosen reference. This operation is named surjection or projection depending on the tool. Because graph construction is not garanteed to keep information from all genome positions (remember the option related to unaligned regions), a correct VCF extraction may be possible only for those genomes where we explicitely asked to keep all positions.

In practice, MC graph garantee to keep all positions from haplotypes set as references at construction time (option --reference). Then, you can produce as many VCF as haplotypes set as references in the graph. With PGGB, because there is no notion of reference compared to non-reference, you should be able to produce all possible VCF. In the litterature using MC, many authors consider their most 'reliable' genome as references (those of highest quality and/or that are well annotated).

You may notice that this is a kind of step back from the philosophy behind pangenome graphs, that aims for no reference-induced biaises. Unfortunately, until the community moves away from the VCFs and create a true reference-coodinates-free format for graph variations, we have to do this step back immediately for any variation extraction.

Here we will extract variants projected to 'IP0323', using the vg deconstruct command :

# Calling variants from IPO323 haplotype
## On PGGB graph
vg deconstruct \
  -a -e \
  -P IPO323 \
  -t $(nproc) -v \
  pggb/zt_pggb.V1-1.xg \
  > pggb/pggb.IPO323.var.vcf

## On MC graph
vg deconstruct \
  -a -e -P IPO323 \
  -t $(nproc) -v \
  MC/zt_mc.V1-1.xg \
  > MC/MC.IPO323.var.vcf

Briefly, -a indicates that all snarls or bubbles (variants) in the graph are considered. By default, nested variants are not considered (ex: SNPs in insertions). The -e argument restricts VG to only consider path traversals (enabled by default in newer version of VG).

scp login@genobioinfo.toulouse.inrae.fr:~/work/pangbomics/*/*.vcf ~/
scp login@genobioinfo.toulouse.inrae.fr:~/work/pangbomics/pggb/ztchr5.fa* ~/
We can visualise the mapping with a IGV. you should have installed it on your local computer from https://igv.org/doc/desktop/#DownloadPage/.

Then upload the fasta file of yor genome by clicking on 'Genomes/Load Genome' -> 'From File' on the menu, then the VCF by selecting both files in the menu File/Load From File.

Load the FASTA file as a genome and open both VCF files as tracks. We can then focus on the region we previously explored visually : IPO323#1#chr05:2,756,591-2,814,735

screenshot IGV vg deconstruct

The VCF-defined reference is on top, with its coordinate ladder. We zoomed to a region close to the end of the genome, around 2760kb. Other haplotypes are in blue lines. In this screenshot, we can see that the left and center parts of this regions are "relatively" similar (grey+blue zones) and mostly aligned with the selected reference.

However, the right part between 2800kb and 2810kb shows a totally different behaviour between MG and PGGB:

  • In MC graph, for Aus01, ST99CH_307, from coordinate 2800kb, nothing is aligned (white blocks). Haplotypes appear to show deletions when compared to the reference. A large rgion appears specific to YEQ92 at 2800-2805kb. A region of same length, appears in other haplotypes around 2810kb.
  • In PGGB graph, one out of two haplotypes show the same region in 2800-2805kb. The other half shows alignment around in 2801kb.

Overall, this regions is hard to resolve from the point of view of the reference coordinates. We may have for instance, a duplicated region, a region specific to non-reference assemblies or transposons.

But the different behaviour of the different haplotypes in the two tools also hints to a possible artifact of pairwise alignment. It is anyway hard to grasp how real variations may have produced so different variants around 2805kb fro some haplotypes and around 2810kb for others, without a more detailed investigation. This is a reminder that the graph is just a model to encompass whole genome alignments. Where the aligners will fail, the graph will maintain these failures and variant extractions may results to errors. Some authors propose to cross the variants extracted from several methods to build a 'high-quality set' of variants.

Mapping reads to the graph

Currently, the most cited tools to map reads onto a pangenome graph are :

In our case, we will map short reads, so we will use giraffe. GraphAligner can index himself a GFA before the mapping. With Giraffe, we have to index the graph with a dedicated command beforehand.

Indexing for Giraffe

There are several ways to index the graph, we will use the recommended vg autoindex : see page for more information on these index.

# PGGB graph
vg autoindex \
  -p pggb/zt_pggb.V1-1.gfa \
  -t $(nproc) \
  -g pggb/zt_pggb.V1-1.gfa \
  -w giraffe

# MC graph
vg autoindex \
  -p MC/zt_mc.V1-1.gfa \
  -t $(nproc) \
  -g MC/zt_mc.V1-1.gfa \
  -w giraffe

These commands create three files:

  • *.gbz: a read-only, compressed version of the graph with several index included in the archive.
  • *.dist: a distance index
  • *.min: a minimizer index

This indexation can be a long process, in particular with graph that are rich in cycles or complex bubbles. In practice, VG can handle most graphs created with MC (because their topology is less complex). As PGGB produces many loops in repetitive regions and complex topologies (after seqwishstep, at least), it may happen that the VG tool is not able to handle them.

Mapping with Giraffe

Currently does not work with vg giraffe 1.48, 1.52, 1.57

# PGGB graph
vg giraffe -t 3 -p \
  -Z pggb/zt_pggb.V1-1.gfa.giraffe.gbz \
  -d pggb/zt_pggb.V1-1.gfa.dist \
  -m pggb/zt_pggb.V1-1.gfa.min \
  -o gaf \
  -f $DATADIR/SRR16762562.chr5.R1.fastq.gz \
  -f $DATADIR/SRR16762562.chr5.R2.fastq.gz \
  > pggb/pggb.SRR16762562.gaf 

# MC graph
vg giraffe -t 3 -p \
  -Z MC/zt_mc.V1-1.gfa.giraffe.gbz \
  -d MC/zt_mc.V1-1.gfa.dist \
  -m MC/zt_mc.V1-1.gfa.min \
  -o gaf \
  -f $DATADIR/SRR16762562.chr5.R1.fastq.gz \
  -f $DATADIR/SRR16762562.chr5.R2.fastq.gz \
  > MC/mc.SRR16762562.gaf 

The resulting file is in GAF (Graph Alignment Format) and you can find its specification here.

Briefly, each alignment can be viewed as a path within the corresponding graph. Alignments have not been projected onto any genomes. Although Giraffe allows you to directly produce a SAM/BAM file with read alignments projected onto the desired genome, this step can also be done manually from the GAF and is not strictly necessary for the next steps.

To project the graph alignments to a genome alignement, vg surject can be used :

By default, vg surject will project the alignments onto reference genomes only. In our case, we will provide the IPO323 path name to force the projection onto it, in order to do that we use the vg path with the index .xg as a reference.

# PGGB graph
vg paths -x pggb/zt_pggb.V1-1.gfa.giraffe.gbz -S IPO323 -L > pggb/IPO323.paths.txt
vg surject \
  -x pggb/zt_pggb.V1-1.gfa.giraffe.gbz \
  --interleaved \
  -F pggb/IPO323.paths.txt \
  -b -N SRR16762562 \
  -G pggb/pggb.SRR16762562.gaf | \
  samtools sort > pggb/pggb.SRR16762562.IPO323.bam

# MC graph
vg paths -x MC/zt_mc.V1-1.gfa.giraffe.gbz -S IPO323 -L > MC/IPO323.paths.txt
vg surject \
  -x MC/zt_mc.V1-1.gfa.giraffe.gbz \
  --interleaved \
  -F MC/IPO323.paths.txt \
  -b -N SRR16762562 \
  -G MC/mc.SRR16762562.gaf | \
  samtools sort > MC/mc.SRR16762562.IPO323.bam

Visualisation

We will observe into details these read mappings into the reference IP0323.

# observe rapibly raw the mapping results
samtools view MC/mc.SRR16762562.IPO323.bam | more

# specify IPO323
samtools view MC/mc.SRR16762562.IPO323.bam | grep -c -P "IPO323" 
# we have around ~27224 mappings

# surject again againt a non-reference haplotye, specify TN09 
vg paths -x MC/zt_mc.V1-1.gfa.giraffe.gbz -S TN09 -L > MC/TN09.paths.txt
vg surject \
  -x MC/zt_mc.V1-1.gfa.giraffe.gbz \
  --interleaved \
  -F MC/TN09.paths.txt \
  -b -N SRR16762562 \
  -G MC/mc.SRR16762562.gaf | \
  samtools sort > MC/mc.SRR16762562.TN09.bam
samtools view MC/mc.SRR16762562.TN09.bam | grep -c -P "TN09" 
# ~149686
# the sample SRR16762562 shows more mappings to this haplotype

samtools index MC/mc.SRR16762562.IPO323.bam
samtools index MC/mc.SRR16762562.TN09.bam

IGV : mapping to the reference IPO323

Do not try to open IGV in the following steps, it's just for the example. We opened the mapping results in IGV, in one particular region and here is what we see.

mc_IPO323_SRR16762562

After scrolling to the region displayed on the picture, ou can observe a small interval with a coverage drop, and it seems that some reads are partially mapped to its flanking coordinates, and clipped (the colored bases on the flanks). A rapid hypothesis can be that the genome of our mapped sample holds a region which is absent from the reference, e.g. a deletion of this region relatively to the reference.

IGV : mapping to the alternative haplotype TN09

mc_TN09_SRR16762562

We are know displaying the read mapping into the equivalent graph region, but after projection into the coordiinates of the alternative haplotype TN09. In the picture, the extreme left and right bases are the flanking regions from our hypothetical deletion.

We can observe that some segments are mapped with many reads. A rapid blast would confirm that these are putative Transposable Elements. Often, their sequence similarity tends to accumulate MANY transposon reads to the same coordinates, even if said reads are actually related to transposons located at a different genome coordinate.

You can also observe that these putative tranposon segments are interleaved with segment whithout a single read mapping. On tp of this, these reads are clipped and not aligned in and over these segments.

Several possible interpretations :

  • TN09 shows some an insertion (relatively to the reference) in which are some transposons, and there are some segments of DNA between these transposons which are absent from the sample (there would be some mapping in the blank regions otherwise).
  • the genome of our sample has an insertion (relatively to the reference) and this insertion does not show comprehensive homology to TN09. It probably holds transposons (because of the reads overlapping the flanking regions) but also its own DNA variants, currently absent from the graph.

This lack of information may be solved in future graph updates, after we build or augment the graph with a genome closely-related to what was actually present in our sample.

Variant calling

We will note here the vocabulary used by the authors of MC. This is slightly different from classical definitions. See the glossary for clarifications.

  • genotyping: Determines which variants in the graph are present in (each haplotype) of the sample.
  • calling: Determines which variants in the reads are present in (each haplotype) of the sample. These variants may or may not be in the graph.

One strength of pangenome graphs is that they allow Structural Variants (SVs), which are normally different to determine from short reads, to be efficiently genotyped.

vg deconstruct and vg surject were producing VCFs files describing the RAW variants, as described by the graph topology (bubbles or snarls). The command vg call allows to go further and to really take into phasing, with the two alleles. For instance if we had two paths for haplotype named by convetion 'IPO323#1#chr05' and 'IP0323#2#chr05' to differenciate the haplotypes of the same chromosome, this information can be understood by the command vg call.

We will not go into details for this part.

There is a good tutorial here, with a human example based on phased datasets. This allows to do phased genotyping with vg call, as well as alternative tools such as DeepVariant or Pangenies.

# PGGB
bgzip -@ $(nproc) pggb/pggb.SRR16762562.gaf
vg pack -x pggb/zt_pggb.V1-1.gfa.giraffe.gbz -a pggb/pggb.SRR16762562.gaf.gz -o pggb/pggb.SRR16762562.pack -Q5
vg call pggb/zt_pggb.V1-1.gfa.giraffe.gbz \
  -k pggb/pggb.SRR16762562.pack \
  -p IPO323#1#chr05 TN09#1#chr05 \
  -s SRR16762562 \
  -az > pggb/pggb.SRR16762562.vcf

# MC
bgzip -@ $(nproc) MC/mc.SRR16762562.gaf
vg pack -x MC/zt_mc.V1-1.gfa.giraffe.gbz -a MC/mc.SRR16762562.gaf.gz -o MC/mc.SRR16762562.pack -Q5
vg call MC/zt_mc.V1-1.gfa.giraffe.gbz \
  -k MC/mc.SRR16762562.pack \
  -p IPO323#1#chr05 TN09#1#chr05 \
  -s SRR16762562 \
  -az > MC/mc.SRR16762562.vcf

To go further

Other tutorials / Hackathons

Authors

  • Lots of the content:
    • Nicolas Lapalu (nicolas.lapalu[at]inrae.fr)
    • Antoine Malet (antoine.malet[at]inrae.fr)
  • Additions/corrections from:
    • Alexis Mergez (alexis.mergez[at]inrae.fr)
    • Benjamin Linard (benjamin.linard[at]inrae.fr)
    • Ludovic Duvaux (ludovic.duvaux[at]inrae.fr)