Skip to content

Tutorial introduction

This is a tutorial written to support the Pangenomics scientific days 2024 Dec. 5 & 6, organized by Infrastucture de Recherche (IR) BioinfOmics INRAE and collaborators. Attendees will learn to create a pangenome graph with different tools and use it for downstream analysis. The expected outcomes of the course are:

  • Pangenome graph construction, statistics
  • Graph exploration / visualization
  • Short-read mapping and genotyping

This tutorial uses the Genotoul bioinfo platform bioinformatics facility with pre-installed tools thanks to the module system simplyfing software version management.

About IT resources and Data

As we will run this tutorial on the Genotoul Platform bioinformatics cluster, for simplicity we will run the following commands in interactive mode via slurm job manager. So, connect to the cluster ans ask for a node with 4 cpu and 60Go of RAM.

ssh login@genobioinfo.toulouse.inrae.fr
# replace login by your user name / If you do not have an account ask the organizers for a training account
# ask for a node
srun -c 4 --mem 60G --pty bash
# mv to your work and create directory
cd work 
mkdir pangbomics
cd pangbomics

Download data used for this tutorial. More about the data at the dedicated page: Data

wget https://genotoul-bioinfo.pages.mia.inra.fr/pangenome_hackathon/data/ztritici.tar
tar -x -f ztritici.tar
# unzip all genome files
gzip -d ztritici/*.fa.gz
# set the DATADIR path for the tutorial
export DATADIR=$(pwd)/ztritici
echo $DATADIR

Data preparation

For clarity, haplotype sequences will be renamed using PanSN.

NB we call "haplotype" each individual genome assembly. We could have two haplotypes for the same diploïd genome.

Briefly, each sequence ID should be named as follow : <genome_name>#<haplotype_id>#<chr_name or ctg_name>.

for file in ztritici/*.fa; do
    genome=$(basename $file .fa | cut -f1 -d'.')
    echo $genome
    sed -i -e "1 s/^.*$/>$genome#1#chr05/" $file
done

Pangenome Graph Construction

Before running graph constructions

For the next 2 steps (Minigraph-Cactus and PGGB), computations will take some time !

Please launch the corresponding construction commands, then open a second console while they are running. This will allow you to parse / explore the result files we provide.

BE CAREFUL, make sure that the first session created previously (and assigned to 60Gb of memory) is used to run Minigraph-Cactus and PGGB commands. Use this second session only to explore files, with commands such as more or less.

ssh login@genobioinfo.toulouse.inrae.fr
# replace login by your user name / If you do not have an account ask the organizers for a training account
# ask for a node
srun --pty bash
# mv to your work and create directory
cd work/pangbomics
wget https://genotoul-bioinfo.pages.mia.inra.fr/pangenome_hackathon/data/results_graphs.tar

Minigraph-Cactus

Minigraph-Cactus is a Toil workflow that takes the output graph from Minigraph and adds back small SVs using Cactus.

Briefly on Minigraph : The first genome is used as the reference and serves as the backbone of the graph. Variations in the haplotypes are added incrementally, starting from the closest to the farthest in genomic distance. Minigraph removes structural variants (SVs) smaller than 50 bp by default, resulting in a simpler graph compared to alternative tools.

First, we create a seqfile that will tell Minigraph-Cactus where genome sequences are located.

# Create a directory for MC
mkdir MC

# Create the seqfile
for file in $DATADIR/*.fa; do
    echo -e "$(basename $file .fa | cut -f1 -d'.').1\t${file}" >> MC/zt.seqfile
done
Looking at the file, it's simply an index giving the haplotype name and its corresponding FASTA.
#$ cat MC/zt.seqfile
IPO323.1  /work/user/<user>/pangbomics/ztritici/IPO323.chr_5.fa
Aus01.1   /work/user/<user>/pangbomics/ztritici/Aus01.chr_5.fa
1A5.1     /work/user/<user>/pangbomics/ztritici/ST99CH_1A5.chr_5.fa
1E4.1     /work/user/<user>/pangbomics/ztritici/ST99CH_1E4.chr_5.fa
3D1.1     /work/user/<user>/pangbomics/ztritici/ST99CH_3D1.chr_5.fa
3D7.1     /work/user/<user>/pangbomics/ztritici/ST99CH_3D7.chr_5.fa
TN09.1    /work/user/<user>/pangbomics/ztritici/TN09.chr_5.fa
YEQ92.1   /work/user/<user>/pangbomics/ztritici/YEQ92.chr_5.fa

Now, we can run Minigraph-Cactus. It will take around 15min:

# load cactus env
module load bioinfo/Cactus/2.7.1
# info
# minimal command is: cactus-pangenome jobstore zt.seqfile --outDir zt_mc --outName zt_mc --reference IPO323
# we will run the command with several options to obtain the graph in different formats with their indexes

cactus-pangenome jobstore MC/zt.seqfile --outDir MC/zt_mc --outName zt_mc --reference IPO323.1 TN09.1 --gfa clip filter full 

#cactus-pangenome jobstore_full zt.seqfile --outDir MC/zt_mc_full --outName zt_mc --reference IPO323.1 --filter 0 --clip 0 --gbz --gfa

# decompressing outputs
gzip -d MC/zt_mc/*.full.gfa.gz

# Retrieving graphs
ln MC/zt_mc/zt_mc.full.gfa MC/zt_mc.V1-1.gfa
Few words about MC options. MC is a workflow with many optional steps, depending of your input data (genome size, ploidy level, genome complexity, ...). A good practice is to run with default options and perform tuning if required. The most interesting options are:

  • --filter [0-9]: create an Allele Frequency graph with nodes and edges supported by the number of haplotypes provided. By default, MC set --filter to 2 and --clip to 10000, in the output you will have 3 graphs:
    • zt_mc.d2.gfa.gz (clipped and filtered graph with node supported by at least dX haplotypes)
    • zt_mc.full.gfa.gz (graph with all nodes)
    • zt_mc.gfa.gz (clipped graph)

More about cactus, look at the documentation

In our case, we will focus only on the full graph since it contains complete haplotypes and can be compared to the PGGB graph. However, keep in mind that this is not the default graph, and it undermines the purpose of the Minigraph-Cactus approach.

PGGB

The PGGB approach differs from Minigraph and Minigraph-Cactus in that it does not use a reference genome. The entire graph is built through pairwise alignment of all haplotypes using wfmash, followed by graph induction with seqwish. In theory, PGGB graphs are expected to be more complete, but this comes with a computational cost.

PGGB requires a single FASTA, compressed with BGzip and indexed using samtools. As we already renamed our sequences according to PanSN, we will simply merge all FASTAs into a single file.

mkdir pggb

# Concatenating FASTAs into a single one
cat $DATADIR/*.fa | bgzip > pggb/ztchr5.fa.gz

# Indexing
module load bioinfo/samtools/1.20
samtools faidx pggb/ztchr5.fa.gz

We can then run PGGB, it will take 13min.

# load pggb
module purge
module load devel/Miniconda/Miniconda3
module load bioinfo/pggb/0.5.4
pggb -i pggb/ztchr5.fa.gz -o pggb/ztchr5 -n 8 -t 4 -p 90 -s 5k -X -v

# Retrieving final graph
mv pggb/ztchr5/*.gfa pggb/zt_pggb.V1-0.gfa 

There are 3 key parameters with PGGB :
- -p : Mapping identity. This indicate the alignment tool (wfmash) that the expected similarity between our genomes is 90%.
- -s : Segment length. This is the window the alignment tool (wfmash) will use to align sequences. This value should be longer than the expected length of repeats within your genomes.
- -X : Skipping the graph normalization with SmoothXG

The -n parameter corresponds to the number of haplotypes we have in our dataset.

Notes about GFA specifications and formats

GFA format corresponds to a human-readable (text) representation of variation graphs. Several versions of the specification already exist. Today, v1.0 and v1.1 are the most used versions. Full specification is here: GFA specification

Here is a summary of the main differences between v1.0 and v1.1:

H   VN:Z:1.0                                  # header, showing here version=1.0
S   11  ACCTT                                 # a Segment (node), with ID=11, and sequence=ACCTT
S   12  TCAAGG                                # carefull, several programs only accepts integer IDs (no letters/symbols)
S   13  CTTGATT
L   11  +   12  -                             # a Link (edge), that links node 11 (read forward) to node 12 (read as revcomp)
L   12  -   13  +    
L   11  +   13  +   
P   NA12878#1#chr1  11+,12-,13+               # >v1.0 : Path of an haplotype, e.g. a list of node ids and directions
W   NA12878 1   chr1    0   18  >11<12>13     # >v1.1 : Walk of an haplotype, contig, mapped read, ...
                                                        More complex identifiers and coordinates can be associated.

Whatever if you read the path (P) or walk (W), the resulting path (genome or haplotype) is:

11+ ACCTT
12- CCTTGA     # <-- reverse complement !
13+ CTTGATT
NA12878#1#chr1  ACCTTCCTTGACTTGATT

Graph statistics

Loading required tools :

module purge
module load bioinfo/vg/1.57.0
module load devel/Miniconda/Miniconda3
module load compilers/gcc/12.2.0
module load bioinfo/odgi/0.8.6_compil

General statistics

We will compare both PGGB and Minigraph-Cactus graphs general statistics.

Note: You may stumble on an error warning:[GFAParser] Skipping GFA W line: GFA format error: On pass 1: On line XXX: Duplicate path YYY exists in graph. It seems that vg complains when parsing the GFA v1.1 format. Yet, the warning does not seem to have any effect.

# MC
vg stats -lz MC/zt_mc.V1-1.gfa
nodes   331950
edges   448014
length  7651968

# PGGB
vg stats -lz pggb/zt_pggb.V1-0.gfa
nodes   155544
edges   209508
length  7500395

The length corresponds to the total sequence length contained within the graph. Since haplotypes are complete, the shorter the length, the more compressed the graph is. In our case, PGGB produced the more compressed version of the two. This can have two implications:

  • PGGB was better at identifying corresponding regions between haplotypes, resulting in better compression of the information.
  • PGGB compressed repeats and created many cycles in the graph.

In reality, it is often a combination of both.

As a sanity check, we can insure our haplotypes are complete in the graph :

# MC
vg paths -x MC/zt_mc.V1-1.gfa -E | sort
Aus01#1#chr05#0         3132306
IPO323#1#chr05          2861803
ST99CH_1A5#1#chr05#0    2700053
ST99CH_1E4#1#chr05#0    2763254
ST99CH_3D1#1#chr05#0    3035485
ST99CH_3D7#1#chr05#0    2781580
TN09#1#chr05            3220060
YEQ92#1#chr05#0         3113700

# PGGB
vg paths -x pggb/zt_pggb.V1-0.gfa -E | sort
Aus01#1#chr05#0         3132306
IPO323#1#chr05#0        2861803
ST99CH_1A5#1#chr05#0    2700053
ST99CH_1E4#1#chr05#0    2763254
ST99CH_3D1#1#chr05#0    3035485
ST99CH_3D7#1#chr05#0    2781580
TN09#1#chr05#0          3220060
YEQ92#1#chr05#0         3113700

# Fasta index from ztchr5.fa
cat pggb/ztchr5.fa.gz.fai | cut -f1-2 | sort
Aus01#1#chr05           3132306
IPO323#1#chr05          2861803
ST99CH_1A5#1#chr05      2700053
ST99CH_1E4#1#chr05      2763254
ST99CH_3D1#1#chr05      3035485
ST99CH_3D7#1#chr05      2781580
TN09#1#chr05            3220060
YEQ92#1#chr05           3113700

Here we can see that PGGB, Minigraph-Cactus and the FASTA index show the same lengths for haplotypes.

Sub-graph extraction

We will explore the region of chr5 with a specific cluster of metabolism in TN09. First extract region from provided information in Data page. So we therefore need to find the coordinates of genes jg9858 to jg9867.

zgrep "jg9858" $DATADIR/TN09.chr_5.genes.gff.gz

TN09.chr_5      AUGUSTUS        mRNA    2758591 2763545 .       +       .       ID=jg9858.t1;geneID=TN09.jg9858
TN09.chr_5      AUGUSTUS        CDS     2758591 2758850 0.86    +       0       Parent=jg9858.t1
...

zgrep "jg9866" $DATADIR/TN09.chr_5.genes.gff.gz

TN09.chr_5      AUGUSTUS        mRNA    2810703 2812735 .       +       .       ID=jg9866.t1;geneID=TN09.jg9866
TN09.chr_5      AUGUSTUS        CDS     2810703 2810992 0.60    +       0       Parent=jg9866.t1
...
We unlarge the region by 2kb bases upstream and downstream on TN09 : TN09#1#chr05:2756591-2814735 (~60kbp). We will extract the regions from both graphs.

There are two tool suites that can be used to manipulate graphs. Since there are no established standards, we will provide the commands to extract the region using both tools, but you should only use one during the practice. Each tool has its own binary format, so we will need to convert our graphs into one of these formats. - VG : .vg, .pg, .xg - Odgi : .og

From Minigraph-Cactus graph

We can extract the region from TN09, thanks to --reference option of MC, which indexes specifically both paths, allowing genomic coordinates based queries.

Note that haplotypes that are not 'reference' can be sparse, meaning the region you are looking for is not directly queriable. In this case, you can map (align) the region's sequence onto the graph and extract the corresponding path.

VG
# Extracting the sub-graph
vg find -x MC/zt_mc.V1-1.gfa -p TN09#1#chr05:2756591-2814735 -E > MC/MC.TN09_chr_5_2756591_2814735.pg
# Converting back to GFA
vg convert MC/MC.TN09_chr_5_2756591_2814735.pg -f -W > MC/MC.TN09_chr_5_2756591_2814735.V1-0.gfa
Odgi
# Converting GFAv1.1 to GFAv1.0
vg convert -g -f -W MC/zt_mc.V1-1.gfa > MC/zt_mc.V1-0.gfa
# Extracting the sub-graph
odgi extract -i MC/zt_mc.V1-0.gfa -t $(nproc) -o MC/MC.TN09_chr_5_2756591_2814735.V1-0.og -r TN09#1#chr05:2756591-2814735 -E -P
# converting back to GFA
odgi view -i MC/MC.TN09_chr_5_2756591_2814735.V1-0.og -g -t $(nproc) -P > MC/MC.TN09_chr_5_2756591_2814735.V1-0.gfa

From PGGB graph

Compared to Minigraph-Cactus, PGGB retains all haplotype sequences in the final graph. Therefore, if a haplotype is present in the input assemblies used by PGGB, the region can be queried in the graph.

VG
# find corresponding nodes 
vg find -x pggb/zt_pggb.V1-0.gfa -p TN09#1#chr05#0:2756591-2814735 -E > pggb/pggb.TN09_chr5_2756591_2814735.pg
# convert in gfa for visualization with bandage
vg convert pggb/pggb.TN09_chr5_2756591_2814735.pg -f -W > pggb/pggb.TN09_chr5_2756591_2814735.V1-0.gfa
Odgi
# find corresponding nodes
odgi extract -i pggb/zt_pggb.V1-0.gfa -t $(nproc) -o pggb/pggb.TN09_chr5_2756591_2814735.V1-0.og -r TN09#1#chr05:2756591-2814735 -E -P
# convert in gfa for visualization with bandage
odgi view -i pggb/pggb.TN09_chr5_2756591_2814735.V1-0.og -g -t $(nproc) -P > pggb/pggb.TN09_chr5_2756591_2814735.V1-0.gfa

Format conversion

This conversion will enable to visualize the graphs with SequenceTubeMap.

# Sub-graph conversion to XG file format
vg convert -g pggb/pggb.TN09_chr5_2756591_2814735.V1-0.gfa -x -t $(nproc) > pggb/pggb.TN09_chr5_2756591_2814735.V1-0.xg

vg convert -g MC/MC.TN09_chr_5_2756591_2814735.V1-0.gfa -x -t $(nproc) > MC/MC.TN09_chr_5_2756591_2814735.V1-0.xg

# Whole graph conversion to XG file format
vg convert -g pggb/zt_pggb.V1-0.gfa -x -t $(nproc) > pggb/zt_pggb.V1-0.xg

vg convert -g MC/zt_mc.V1-1.gfa -x -t $(nproc) > MC/zt_mc.V1-1.xg

Visualisation

We will visualize some of the graphs that we have produced, as well as the paths for some genomes. Overall, complex variations are hard to visualise, whatever the tool. It is just hard to project graph complexity to 2 dimensions and every software has its pros and cons. But visualisation may remain useful to explore a local and simple variation pattern or to do a figure.

Bandage

Bandage (or its fork Bandage-NG) is designed to visualise very large graphs (millions of nodes). You will use it if you need to interact with large/complex chunks of graphs, such as the equivalent of a few megabases of a chromosome. It is not designed specifically for pangenome graphs, but it's compatible with GFA format.

Pros:

  • can open very large graphs chunks
  • can zoom/unzoom on complex structural variations
  • sequence search via blast
  • manual path highlight via node lists
  • some annotation possible via CSV inputs

Cons:

  • path information is not handled per say (you cannot select an haplotype / genome)
  • limited in terms of graphical capabilities (no sequence display)
# download and execution
wget https://github.com/rrwick/Bandage/releases/download/v0.9.0/Bandage_Ubuntu-x86-64_v0.9.0_AppImage.zip
unzip Bandage_Ubuntu-x86-64_v0.9.0_AppImage.zip
./Bandage_Ubuntu-x86-64_v0.9.0.AppImage

First, from your LOCAL console, download the full graph and graph chunk that we extracted previously in GFA format.

# in the following commands replace login by your own login 
scp login@genobioinfo.toulouse.inrae.fr:~/work/pangbomics/pggb/zt_pggb.V1-0.gfa ~/
scp login@genobioinfo.toulouse.inrae.fr:~/work/pangbomics/pggb/pggb.TN09_chr5_2756591_2814735.V1-0.gfa ~/
scp login@genobioinfo.toulouse.inrae.fr:~/work/pangbomics/MC/MC.TN09_chr_5_2756591_2814735.V1-0.gfa ~/
  • Load the full PGGB graph (zt_pggb.V1-0.gfa) via menu File -> Load grap -> select the full GFA
  • On the top left, click on "More info" to get some general statistics
  • Click on "Draw graph", note that you can cancel the laying out when you want, it will just display, just with more overlapping node. Wait 2 minutes at most.
  • On the left, set node length to 20, then uniform colors to see more things when zoomed out.
  • You can zoom in & out with CTRL+MOUSE_WHEEL

The full graph should look like these 3 large circles. You can immediately notice that 2 regions show a high complexity of structural variations (e.g. where these 3 large circles are joining together). Also, you can notice that structural variations of lower complexity or shorter are actually present all along the genomes (e.g. small hairpins all along the 3 large circles).

pggb_full_graph_bandage

We will know focus on the TN09 region and observe potential differences between PGGB and MC. To do so, we need to load the corresponding graph, then we will highlight each haplotype path. Because Bandage does not know how to handle paths from the GFA, we need to manually select the list of nodes of each path, then we will color them.

  • As before, load the GFA from MC (MC.TN09_chr5_2756591_2814735.V1-0.gfa)
  • Then, list all nodes belonging to 3 of the haplotypes with the following commands.
  • For each command, copy-paste the list of nodes displayed in the console to the top right field of Bandage labelled "Search nodes"
  • Then, set a color for the selected nodes

# load the gfa file in bandage
# find nodes to highlight
# IPO323 nodes (reference genome)
grep -P "IPO323" MC/MC.TN09_chr_5_2756591_2814735.V1-0.gfa | cut -f 3 | sed 's/[+|-]//g'
# TN09 nodes
grep -P "TN09" MC/MC.TN09_chr_5_2756591_2814735.V1-0.gfa | cut -f 3 | sed 's/[+|-]//g'
# Aus01 nodes 
grep -P "Aus01" MC/MC.TN09_chr_5_2756591_2814735.V1-0.gfa | cut -f 3 | sed 's/[+|-]//g'
Repeat this process for the different haplotypes if you want to obtain all screenshots below.

Can you see a difference between haplotypes ? Between MC and PGGB ?

IPO323 path

mc_IPO323

TN09 path

mc_TN09

Aus01 path

mc_Aus01

# load the gfa file in bandage
# find nodes to highlight
# IPO323 nodes (reference genome)
grep -P "IPO323" pggb/pggb.TN09_chr5_2756591_2814735.V1-0.gfa | cut -f 3 | sed 's/[+|-]//g'
# TN09 nodes
grep -P "TN09" pggb/pggb.TN09_chr5_2756591_2814735.V1-0.gfa | cut -f 3 | sed 's/[+|-]//g'
# Aus01 nodes 
grep -P "Aus01" pggb/pggb.TN09_chr5_2756591_2814735.V1-0.gfa | cut -f 3 | sed 's/[+|-]//g'

SequenceTubeMap

SequenceTubeMap is a tool still in development. It was built with pangenome graphs in mind. It is a compromise between alignement-like and graph-like views of the pangenome graph. It produces relatively comprehensive figures for regions of low complexity. But be aware that complex regions will not be handled well, and the figure could even lead to false interpretation (figure is correct, but we are not use to read graph projected in 2D space).

Pros:

  • alignement-like projection of the graph
  • compressed/uncompressed view of large region
  • quite "artistic" and eye-catching

Cons:

  • still in development (bugs)
  • installation can be difficult (web server + environment)
  • for large graphs : requires preliminary indexation and registration of indexes in the web server

Go to SequenceTubeMap demo website: https://vgteam.github.io/sequenceTubeMap/

You arrive on the dodemome, which shows by default the result of some reads mapped to a simple graph with 3 SNPs. You can select a few other demos with the top menu "Data".

Let's upload our own graph extraction. We need to retrieve and upload the graph in the "PG" format (it contains the graph and some index).

On your LOCAL machine, open a terminal and download your PG graph to your home.

scp login@genobioinfo.toulouse.inrae.fr:~/work/pangbomics/pggb/pggb.TN09_chr5_2756591_2814735.V1-0.xg ~/

  • In "Data" choosebox, select "custom"
  • Click on "configure track" that appeared on the right.
  • Click on the "+" that appeared
  • Select "Graph", "upload" and select your PG graph that you downloded.
  • On the right, click on the grey wheel, set reference paths in reds and others in grays
  • Close windows with the small cross on top right corner of the windows
  • Click on red button "GO"
  • If loaded correctly, you will see a red message saying "Wrong query: missing region"

This tool allow different ways to select a region to display, it can be by node_id or linear coordinates of one selected haplotype. A short manual is available if you click on "?". Some examples are :

  • A coordinate range (e.g. "chr1:1-100")
  • A node ID range (e.g. "node:100-110")
  • A start position and a distance (e.g. "chr1:1+100")
  • A node ID anchor and a distance (e.g. "node:100+10")

Find the first node of the GFA file with

grep '^S' pggb/pggb.TN09_chr5_2756591_2814735.V1-0.gfa | head -1 | cut -f 2

This will be your nodeid. Let's look around this one.

  • In "region" box, set "node:nodeid+100" to select 100 nodes after the nodeid.
  • Click on red button "GO"
  • You can zoom/unzoom with the mouse wheel
  • By scrolling from the start to the right up to node id (nodeid are displayed by letting mouse their top), you can see just after a giant node, that is unique to 1 haplotype and its length is 51kb! This make the whole visualisation somewhat broken.

screenshot sequtubemap 2

  • Go below the visualisation pane, in "Visualisation options" and click on "Compressed view". Now nodes are NOT proportionnal to their sequence length, you can have an overview of the whole region. Try to find again the node id 32542, which is our 51kb insertion, you should see this :

screenshot seqtubemap 1

GFAviz

(OPTIONAL. If you are late, just have a look at the picture below.)

GFAviz seems to not be maintained anymore, but it is relatively useful to produce figures. It is designed with pangenome graphs in mind and it allows a lot of graphical customisation and annotations. But this comes to a cost. Only small graphs can be opened and manipulated (dozen of kilobases / thousands of node at best).

Pros: - lots of graphical customization - allow to highlight paths independantly, node lengths are represented - can be "artistic" and eye-catching if you spend time on it

Cons: - not maintained anymore (?) - restricted to small graphs chunks

# installation you should have done beforehand (debian)
apt-get install qt5-qmake qtbase5-dev qtbase5-dev-tools libqt5svg5-dev libqt5webenginewidgets5 libqt5webchannel5-dev qtwebengine5-dev
git clone https://github.com/ggonnella/gfaviz
cd gfaviz
qmake
make

We will open the TN09 region. On your LOCAL machine, open a terminal and download your GFA to your home.

scp login@genobioinfo.toulouse.inrae.fr:~/work/pangbomics/pggb/pggb.TN09_chr5_2756591_2814735.V1-0.gfa ~/

  • Open GFAviz, then menu File -> Open GFA -> select the dowloaded GFA

By default, you see all nodes, all paths, with default colors. Let's select only 3 haplotypes with interesting differences. - Menu Selection -> select all. - On the right -> Style Pane 'P/O/U' -> set width to 0 -> click on graph to update - On the right -> GFA elements -> groups -> select 1st haplotype 'Aus1' with a double click (you should see it as selected in the middle textbox) - On the right -> Style Pane 'P/O/U' -> set width to 2 and choose green color-> click on graph to update (eventually zoom/dezoom to update) - Do the same with 2nd haplotype, but in red and 3rd in Blue.

You can now see more clearly the variants that are specific to these selected haplotypes. Also you can observe that a big segement between 2 loops is not shared by them.

  • click on this segment, on top right you can retrieve to which group it belongs (TN09#1#chr05#0) and its sequence (51kb !)
  • click on Pane 'S' and display the sequence direction (tickbox) and node_id (Element label -> show) and color it in purple.

You should have this result, that you can export this as a vectorial image for more hand-made annotations. You could also play with more layouts in GFAviz.

screenshot GFAviz

Variants analyses

Calling from the graph directly

Variants can be directly called from the haplotypes contained within the graph using the vg deconstruct command :

# Calling variants from IPO323 haplotype
## On PGGB graph
vg deconstruct \
  -a -e -P IPO323 \
  -t $(nproc) -v \
  pggb/zt_pggb.V1-0.xg \
  > pggb.IPO323.var.vcf

## On MC graph
vg deconstruct \
  -a -e -P IPO323 \
  -t $(nproc) -v \
  MC/zt_mc.V1-1.xg \
  > MC.IPO323.var.vcf

Briefly, -a indicates that all snarls or bubbles in the graph are considered. The -e argument restricts VG to only consider path traversals (enabled by default in newer version of VG).

Although Giraffe allows you to directly produce a SAM/BAM file with read alignments projected onto the desired genome, this step can also be done manually from the GAF and is not strictly necessary for the next steps.

scp login@genobioinfo.toulouse.inrae.fr:~/work/pangbomics/*.vcf ~/
scp login@genobioinfo.toulouse.inrae.fr:~/work/pangbomics/pggb/ztchr5.fa* ~/

Load the FASTA file as a genome and open both VCF files as tracks. We can then focus on the region we previously explored visually : IPO323#1#chr05:2,756,591-2,814,735

screenshot IGV vg deconstruct

Short reads mapping onto the graph

There are two main tools to map onto a pangenome graph :

In our case, we will map short reads, so we will use giraffe. GraphAligner is simpler and you can directly map onto the GFA. Giraffe requires the graph to indexed beforehand.

Indexing for Giraffe

There are two ways to index the graph : manually or with vg autoindex. We will use the later but refer to this page for the manual process.

# PGGB graph
vg autoindex \
  -p pggb/zt_pggb.V1-0.gfa \
  -t $(nproc) \
  -g pggb/zt_pggb.V1-0.gfa \
  -w giraffe

# MC graph
vg autoindex \
  -p MC/zt_mc.V1-1.gfa \
  -t $(nproc) \
  -g MC/zt_mc.V1-1.gfa \
  -w giraffe

The previous commands create three files: - GBZ: a read-only, compressed version of the graph with several index - DIST: a distance index - MIN: a minimizer index

Note that more complex graphs can only be indexed when created with MC, as PGGB produces many loops in repetitive regions, which VG is not suited to handle.

Mapping with Giraffe

Currently does not work with vg giraffe 1.48, 1.52, 1.57

# PGGB graph
vg giraffe -t 3 -p \
  -Z pggb/zt_pggb.V1-0.gfa.giraffe.gbz \
  -d pggb/zt_pggb.V1-0.gfa.dist \
  -m pggb/zt_pggb.V1-0.gfa.min \
  -o gaf \
  -f $DATADIR/SRR16762562.chr5.R1.fastq.gz \
  -f $DATADIR/SRR16762562.chr5.R2.fastq.gz \
  > pggb/pggb.SRR16762562.gaf 

# MC graph
vg giraffe -t 3 -p \
  -Z MC/zt_mc.V1-1.gfa.giraffe.gbz \
  -d MC/zt_mc.V1-1.gfa.dist \
  -m MC/zt_mc.V1-1.gfa.min \
  -o gaf \
  -f $DATADIR/SRR16762562.chr5.R1.fastq.gz \
  -f $DATADIR/SRR16762562.chr5.R2.fastq.gz \
  > MC/mc.SRR16762562.gaf 

The resulting file is in GAF (Graph Alignment Format) which is specified here. Briefly, each alignment can be viewed as a path within the corresponding graph. Alignments have not been projected onto any genomes. Although Giraffe allows you to directly produce a SAM/BAM file with read alignments projected onto the desired genome, this step can also be done manually from the GAF and is not strictly necessary for the next steps.

To project the graph alignments to a genome alignement, vg surject can be used :

By default, vg surject will project the alignments onto reference genomes only. In our case, we will provide the IPO323 path name to force the projection onto it.

# PGGB graph
vg paths -x pggb/zt_pggb.V1-0.gfa.giraffe.xg -S IPO323 -L > pggb/IPO323.paths.txt
vg surject \
  -x pggb/zt_pggb.V1-0.xg \
  --interleaved \
  -F pggb/IPO323.paths.txt \
  -b -N SRR16762562 \
  -G pggb/pggb.SRR16762562.gaf | \
  samtools sort > pggb/pggb.SRR16762562.IPO323.bam

# MC graph
vg paths -x MC/zt_mc.V1-1.gfa.giraffe.gbz -S IPO323 -L > MC/IPO323.paths.txt
vg surject \
  -x MC/zt_mc.V1-1.gfa.giraffe.gbz \
  --interleaved \
  -F MC/IPO323.paths.txt \
  -b -N SRR16762562 \
  -G MC/mc.SRR16762562.gaf | \
  samtools sort > MC/mc.SRR16762562.IPO323.bam

Visualization

# see the mapping 
samtools view MC/mc.SRR16762562.bam | more



# specify IPO323

samtools view SRR16762562.IPO323.bam | grep -c -P "IPO323" 
2740

# specify TN09 
vg paths -x mc.TN09_chr5_2756591_2814735.giraffe.gbz -S TN09  -L > TN09.paths.txt
vg surject -x mc.TN09_chr5_2756591_2814735.giraffe.gbz  --interleaved -F TN09.paths.txt -b -N SRR16762562 mc.SRR16762562.gam | samtools sort > SRR16762562.TN09.bam
samtools view SRR16762562.TN09.bam | grep -c -P "TN09" 
149686

# SRR16762562 contains more mappings

# visualize alignment on sequence with IGV:
vg paths -x mc.TN09_chr5_2756591_2814735.vg -F -Q IPO323 > IPO323_chr5_2756591_2814735.fasta
vg paths -x mc.TN09_chr5_2756591_2814735.vg -F -Q TN09 > TN09_chr5_2756591_2814735.fasta

samtools index SRR16762562.IPO323.bam
samtools index SRR16762562.TN09.bam

IGV IPO323

mc_IPO323_SRR16762562

IGV TN09

mc_TN09_SRR16762562

Problem: many bad mapping => mapping on TE (see coverage). We need to perform quality control on mapping. Bias onmapping due to subsample of a specific region.

Mapping on pggb graph

# index the gfa
vg autoindex -p pggb.TN09_chr5_2756591_2814735 -w giraffe -g pggb.TN09_chr5_2756591_2814735.gfa
# mapping with giraffe
vg giraffe -t 3 -p -Z  pggb.TN09_chr5_2756591_2814735.giraffe.gbz -f $DATADIR/SRR16762562.chr5.R1.fastq.gz -f $DATADIR/SRR16762562.chr5.R2.fastq.gz > mc.SRR16762562.gam 
vg giraffe -t 3 -p -Z  pggb.TN09_chr5_2756591_2814735.giraffe.gbz -f $DATADIR/SRR16762562.chr5.R1.fastq.gz -f $DATADIR/SRR16762562.chr5.R2.fastq.gz -o BAM > mc.SRR16762562.bam  => failed no ref paths in the graph / need to update .gbz with gbz gwbt

add ref path (follow: add ref path)

# validate no ref path
vg paths -M -x pggb.TN09_chr5_2756591_2814735.giraffe.gbz
#NAME   SENSE   SAMPLE  HAPLOTYPE       LOCUS   PHASE_BLOCK     SUBRANGE
1A5#1#chr5#0    HAPLOTYPE       1A5     1       chr5    0       NO_SUBRANGE
1E4#1#chr5#0    HAPLOTYPE       1E4     1       chr5    0       NO_SUBRANGE
3D1#1#chr5#0    HAPLOTYPE       3D1     1       chr5    0       NO_SUBRANGE
3D7#1#chr5#0    HAPLOTYPE       3D7     1       chr5    0       NO_SUBRANGE
Aus01#1#chr5#0  HAPLOTYPE       Aus01   1       chr5    0       NO_SUBRANGE
IPO323#1#chr5#0 HAPLOTYPE       IPO323  1       chr5    0       NO_SUBRANGE
TN09#1#chr5#0   HAPLOTYPE       TN09    1       chr5    0       NO_SUBRANGE
YEQ92#1#chr5#0  HAPLOTYPE       YEQ92   1       chr5    0       NO_SUBRANGE


vg convert -a --ref-sample IPO323 --ref-sample TN09  pggb.TN09_chr5_2756591_2814735.giraffe.gbz > tmp.vg
vg paths -M -x tmp.vg

#NAME   SENSE   SAMPLE  HAPLOTYPE       LOCUS   PHASE_BLOCK     SUBRANGE
3D1#1#chr5#0    HAPLOTYPE       3D1     1       chr5    0       NO_SUBRANGE
Aus01#1#chr5#0  HAPLOTYPE       Aus01   1       chr5    0       NO_SUBRANGE
YEQ92#1#chr5#0  HAPLOTYPE       YEQ92   1       chr5    0       NO_SUBRANGE
TN09#1#chr5     REFERENCE       TN09    1       chr5    NO_PHASE_BLOCK  NO_SUBRANGE
1E4#1#chr5#0    HAPLOTYPE       1E4     1       chr5    0       NO_SUBRANGE
3D7#1#chr5#0    HAPLOTYPE       3D7     1       chr5    0       NO_SUBRANGE
IPO323#1#chr5   REFERENCE       IPO323  1       chr5    NO_PHASE_BLOCK  NO_SUBRANGE
1A5#1#chr5#0    HAPLOTYPE       1A5     1       chr5    0       NO_SUBRANGE

# reindex in gbz format with reference
vg gbwt -x tmp.vg --index-paths -g pggb.TN09_chr5_2756591_2814735.giraffe.reindex.gbz --gbz-format

# map and export in bam format with new re-indexed graph
vg giraffe -t 3 -p -Z  pggb.TN09_chr5_2756591_2814735.giraffe.reindex.gbz -f $DATADIR/SRR16762562.chr5.R1.fastq.gz -f $DATADIR/SRR16762562.chr5.R2.fastq.gz -o BAM > mc.SRR16762562.bam  => failed no ref paths in the graph / need to update .gbz with gbz gwbt

Variant calling and genotyping

Variant calling from MC

vg gamsort mc.SRR16762562.gam > mc.SRR16762562.sorted.gam
vg pack -x mc.TN09_chr5_2756591_2814735.giraffe.gbz -g mc.SRR16762562.sorted.gam -o mc.SRR16762562.pack
vg paths -Mx  mc.TN09_chr5_2756591_2814735.giraffe.gbz
vg call mc.TN09_chr5_2756591_2814735.giraffe.gbz -p IPO323#0#chr_5 TN09#0#TN09.chr_5 -k mc.SRR16762562.pack > mc.SRR16762562.vcf

To go further

Pan1c View report generated using the same dataset.

References

Authors

  • Most of the content:
  • Nicolas Lapalu (nicolas.lapalu[at]inrae.fr)
  • Antoine Malet (antoine.malet[at]inrae.fr)
  • Additions/corrections from:
  • Alexis Mergez (alexis.mergez[at]inrae.fr)
  • Benjamin Linard (benjamin.linard[at]inrae.fr)