Computational Pangenomics – Shonan meeting no. 197

Description of the Meeting

Computational Pangenomics encompasses different research efforts for transitioning the existing paradigm from a sequence-based reference genome to a pan-genome, i.e., an evolutionarily coherent collection of genomes. Such a transition is urgently needed to effectively exploit the data masses produced by the technical advances and the widespread adoption of sequencing technologies. Graph-based representations of collections of genomes and diploid-aware assemblers have been recently proposed, but a large amount of work is still needed to shift to a pan-genomic view into the current research practice. Indeed, the traditional approach considers a single sequence as a reference genome, and that sequence has been obtained starting from sample tissues of unknown donors, and it has been refined through the integration of different samples. So, the human reference genome is actually the fusion of several individuals’ genomes, where the characteristics of each single genome is lost. This approach led to important contributions to our understanding of human physiology and of several pathologies like cancer. However, it was essentially motivated by the limits of the early sequencing technologies and of the associated costs. In the recent years, new sequencing technologies have revolutionized the field by increasing the throughput (i.e., the amount of sequences produced in a single run), by increasing the length of the produced sequences (longer sequences allow to better disambiguate repetitive regions in the genome), by increasing the quality of the base calls (having less errors allows to reliably capture variations among individuals) while costs dramatically decreased (sequencing can be almost considered as a routine task). The resulting wealth of data bears the promise of a new course for precision medicine (i.e., adapting treatments to each individual’s genetic profile). For example, thanks to the advancements of sequencing technologies, it is now possible to characterize the genetic content of a single cell and this has profound implications in the study of the evolution of cancer, where the genetic content of different cancer cells may be different due to the progressive accumulation of mutations occurred during the replication of the cancer cells. Finally, we should note that, even if the human genome was the main focus in the early days of Bioinformatics, we are assisting a spread in the use of sequencing technologies for the characterization of a growing number of species. For example, widespread sequencing efforts of the novel SARS-CoV-2 virus played a central role in the response to the pandemic, since the characterization of virus variations are aiding in tracking the international spread and in the development of the vaccines. From the computational perspective, the core problem is now how to find, to represent, and to query/compare a very large set of genetic variations obtained from large collections of genomes with the ultimate goal of making sense of such a wealth of data both for improving our understanding of the underlying biological mechanisms and for implementing the promises of translational precision medicine. Some initial and promising representations have been proposed—either based on (multi)graphs or on indexes of highly-repetitive collections of strings—but much further work is needed to really perform the transition to novel practical representations of the reference pan-genome and to novel algorithms able to exploit them.

As a consequence, the development of computational pangenomic must be sustained by coordinated efforts of different highly-specialized research areas: starting from research in stringology and indexing (for developing novel and efficient representations of the pangenome) to research in any area of Bioinformatics (for transitioning existing algorithms to the new paradigm) and to research in the data mining area (for further exploring new applications and discovering potential new associations between genetic variations and phenotypic traits). The meeting aims to provide an occasion for researchers of these research areas to present recent advances and to foster the questions that will drive the future research efforts in computational pangenomics.

Link to Shonan Page: https://shonan.nii.ac.jp/seminars/197/

Organizers

List of Participants

Paola Bonizzoni (Organizer) https://algolab.eu/people/bonizzoni/
Alberto Policriti (Organizer) http://users.dimi.uniud.it/~alberto.policriti
Kunihiko Sadakane (Organizer) https://researchmap.jp/sada/
Dominik Koeppl
Nadia Pisanti
Iman Hajirasouliha
Hideo Bannai
Daniel Doerr
Tomas Vinar
Simone Ciccolella
Fereydoun Hormozdiari
Luca Denti
Travis Gagie
Peter Peresini
Yuto Nakashima
Rayan Chikhi
Sung-Hwan Kim
Nicola Prezza
Yoshihiro Shibuya
Sharma Valliyil Thankachan

Talks

Luca Denti – Structural Variation Discovery from sample-specific strings – slides

Abstract

SVDSS (Structural Variation Discovery from sample-specific strings) is a new method for discovery of SVs from PacBio HiFi reads that combines and effectively leverages mapping-free, mapping-based and assembly-based methodologies for overall superior SV discovery performance. Although the increased accuracy, there is still room for improvements and most limitations are still not fully solved yet. For instance, long read alignment in repetitive regions of the genome and SVs calling in heterozygous regions are still quite inaccurate. During the Shonan meeting, we aimed to improve long read alignment in repetitive regions of the genome by performing an ad-hoc local realignment that prefers higher consistency around potential variation over higher alignment score. To this aim, we formulated a new computational problem and we discussed different methodologies to solve it.

Sung-Hwan Kim – Algorithms for Computing Co-lex Order of Automata – slides

Abstract

With the recent advancement of the sequencing technologies and computational capabilities, now it is required to process a large number of reference sequences at the pangenomic scale. For indexing pangenomic graphs represented as (non)deterministic finite automata, computing the co-lexicographical order is an essential procedure. Despite of some remarkable results on particular special cases, there is still room to be improved for general cases, especially for one efficient both from a theoretical and practical point of view. This talk gives a brief survey on the state-of-the-art algorithms for computing co-lexicographical order of automata. In particular, three main techniques covering several important algorithms for deterministic and non-deterministic automata are discussed as well as the future direction of further improvement.

Nicola Prezza – Indexing regular languages with co-lex orders – slides

Abstract

Abstract: NFAs are inherently unordered objects, but they represent regular languages on which one can very naturally define a total order: for example, the co-lexicographic order in which words are compared alphabetically from right to left. In this talk I will show that interesting things happen when one tries to map this total order to the states of an accepting NFA for the language: the resulting order of the states is a partial pre-order whose width p turns out to be an important parameter for NFAs and regular languages. For example, take the classic powerset determinization algorithm for converting an NFA of size n into an equivalent DFA: while a straightforward analysis shows that the size of the resulting DFA is at most $2^n$, we prove that it is actually at most $(n-p+1)*2^p$. This implies that PSPACE-complete problems such as NFA equivalence or universality are actually easy on NFAs of small width p (the case p=1 – total order – is particularly interesting). Another implication of this theory is that we can compress NFAs to just $O(log p)$ bits per transition while supporting fast membership queries in the substring closure of the language.

Sharma Valliyil Thankachan – Compressibility-Aware Quantum Algorithms on Strings

Abstract

Quantum algorithms have been established for many basic problems on strings. This work shows that new, faster quantum algorithms are possible when the string is highly compressible. We focus on two popular methods for compression — the Lempel-Ziv77 algorithm (LZ77) and the Run-length-encoded Burrows-Wheeler Transform (RL-BWT), and provide optimal quantum algorithms. We also show an efficient way of constructing a (known) compact index with equivalent capabilities as the suffix tree. This data structure is then applied to numerous problems, such as the longest common substring, finding maximal unique matches, lyndon factorization, etc (see arXiv:2302.07235 for a preliminary version).

Tomas Vinar – Analyzing SARS-CoV-2 waste water samples, by Deconstructing a pangenome

Abstract

The genomes of SARS-CoV-2 are classified into variants, some of which are monitored as variants of concern (e.g. the Delta variant B.1.617.2 or Omicron variant B.1.1.529). Proportions of these variants circulating in a human population are typically estimated by large-scale sequencing of individual patient samples. Sequencing a mixture of SARS-CoV-2 RNA molecules from wastewater provides a cost-effective alternative, but requires methods for estimating variant proportions in a mixed sample. From the modeling point of view, a sequenced sample is a pangenome of SARS-CoV-2 strains which needs to be deconstructed into individual genomes. We will briefly explore a solution to this problem and outline limitations of our current approach and some open problems in this area.

Daniel Doer – Investigating Allelic and Non-Allelic Homologous Recombination through Founder Sequences – slides

Abstract

Homologous recombination is a major driver of genetic variation of populations. Massive sequencing efforts enable the study of population genetic variation through large collections of genomic sequences, depending on the context called “haplotype reference panel” or “pangenome”. In search for compact, descriptive, and computationally amendable representations of pangenomes, the theory of founder sequences has recently celebrated a comeback in the form of (elastic) founder graphs that enable linear time construction and indexability. Yet, founder graphs have limited ability to represent structural variation.
The variation graph, an alternative data structure, has gained popularity due to its ability to broadly capture genetic variation, including structural variation. This talk discusses the deep connection between founder and variation graphs with respect to homologous recombination. In particular, we highlight how homologous recombination between non-allelic loci gives rise to structural variation.
Consequently, we propose a computational model that unifies both allelic and non-allelic homologous recombination and discuss open problems arising from this model.

Travis Gagie – Pangenomic FM-indexes – slides

Abstract

DNA alignment has been a killer app for the FM-index, but aligning DNA reads against a single genome can bias research results and medical diagnoses. In the past few years we have found ways to FM-index datasets of thousands of genomes, but researchers want the results expressed in terms of compact representations called pangenome graphs. Hundreds of matches in the dataset may correspond to only one or two matches in the graph. Given a read, therefore, we would like to find which parts of it match well and where they match in the graph, in time depending on the length of the read and the number of matches in the graph but not on the number of matches in the dataset.

Schedule

Sunday 19 February

15.00 onwards	Check-in
19.00-21.00	Welcome banquet

Monday 20 February

07.30-09.00	Breakfast
09.00-09.15	Welcoming address – Paola Bonizzoni, Alberto Policriti, Sadakane Kunihiko
09.15-10.00	Ice-break
10.00-10.30	Presentation of PANGAIA/ALPACA networks
10.30-11.00	Coffee break
11.15-12.00	Talk session 1
12.00-13.30	Lunch
13.30-15.00	Talk session 2
15.00-15.30	Coffee break
15.30-16.30	Talk session 3
16.30-18.00	Free time
18.00-19.30	Dinner

Tuesday 21 February

07.30-09.00	Breakfast
09.00-11.00	Talk session 4
10.30-11.00	Coffee break
11.15-12.00	Plenary session 1
12.00-13.30	Lunch
13.30-15.00	Teamwork session 1
15.00-15.30	Coffee break
15.30-16.30	Teamwork session 2
16.30-18.00	Free time
18.00-19.30	Dinner

Wednesday 22 February

07.30-09.00	Breakfast
09.00-11.00	Plenary session 2
10.30-11.00	Coffee break
11.00-12.00	Teamwork session 3
12.00-13.30	Lunch
13.00-20.45	Excursion and dinner

Thursday 23 February

07.30-09.00	Breakfast
09.00-11.00	Plenary session 3
10.30-11.00	Coffee break
11.15-12.00	Teamwork session 4
12.00-13.30	Lunch
13.30-15.00	Teamwork session 5
15.00-15.30	Coffee break
15.30-16.30	Teamwork session 6
16.30-18.00	Free time
18.00-19.30	Dinner

Friday 24 February

07.30-09.30	Breakfast and check-out
09.30-10.15	Plenary session 4
10.15-11.00	Coffee break
11.00-11.30	Plenary session 5
11.30-12.00	Conclusion and wrap-up
12.00-13.30	Lunch and end of meeting