DISCOVER and VALIDATE

DISCOVER and VALIDATE are two modules in Allelica’s Polygenic Risk Score Software as a Service pipeline.

These modules work together to enable researchers to construct and validate new PRS using Allelica's world-class infrastructure.

new-risk

DISCOVER: develop new Polygenic Risk Scores

There is an ever-growing list of different algorithms that can be used to develop a new PRS. Each has its own potential benefits and drawbacks and it isn't usually clear which is the best algorithm to apply to a given disease. It is therefore advisable to try a range of different methods and select the one that performs best. However, different algorithms often require different input data formats and have usually been written in code that is either slow or poorly written, and which is not transferable across different projects. The result is that it can take over a year for an experienced bioinformatician to build and test a new PRS.

DISCOVER was built to provide a user friendly, cloud-based computing solution to build state-of-the-art, publication ready polygenic risk scores, quickly and robustly, and in a standardized way.


Different diseases and phenotypes have different genetic architectures. Understanding which PRS method will give the best predictive performance requires trial of several PRS algorithms. DISCOVER runs a suite of different PRS algorithms in parallel in order to identify the best predictive model for the disease under investigation.

Bioinformatics made simple

By deploying DISCOVER, users can assess ten PRS development algorithms simultaneously, including various flavours of LDPred2, SCT, PRScs, SBayesR, lassosum, Support Vector Machine and Clumping & Thresholding. DISCOVER runs these methods in parallel, exponentially speeding up execution time and allowing users to build their own PRS from genetic data and summary statistics from a Genome Wide Association Study (GWAS). We can also easily implement new algorithms as they become available.


Crucially, DISCOVER provides several metrics for users to decide which is the best PRS to use moving forward. These include the Area Under the Curve of the Receiver Operator Curve, which measures how well the model classification works; the Odds Ratio per standard deviation, which measures how the model captures the gradient of risk of the disease in question; and the percentiles of the dataset that are at 3 fold or greater increased risk relative to the remainder of the population. These are industry standard metrics for model comparison and provide the user with all the relevant information to choose the best performing PRS.


Once a PRS is chosen, its transferability can then be tested with VALIDATE and its predictive power formally assessed on a different dataset before being applied to individuals’ data with the PREDICT module. This checks that PRS are applicable to different populations and quantifies the effects of over-fitting.

VALIDATE: precision validation for new PRS

To validate a PRS, users need a new, independent set of genomic data on which the phenotype of interest has been measured.

Using VALIDATE, researchers can validate their new PRS on an independent dataset, helping them to understand the applicability of their PRS to a new population. This new population can be similar to or different from the one on which the PRS developed, allowing VALIDATE to provide a robust, assessment of the predictive performance of a PRS on new data.

This is important because most published PRS concentrate on using genomic datasets from populations of predominantly western European genetic ancestry due to there being much more data available from these populations. Several researchers have highlighted this lack of diversity and Allelica supports all efforts to increase datasets to produce a more equitable approach to precision medicine. With VALIDATE, users can therefore test the applicability of scores to populations of diverse ancestry.

Measurable predictive performance

Area Under the Curve of the Receiver Operator Curve to measure how well the model classification works.
The Odds Ratio per standard deviation to measure how the model captures the gradient of risk of the disease in question
The percentiles of the dataset that are at 3 fold or greater increased risk relative to the remainder of the population

Alignment with the latest industry standards

The outputs of the DISCOVER and VALIDATE modules provide a set of PRS and a quantification of their predictive performance. Having the ability to choose the best PRS from different methods allows users to align their research to Polygenic Risk Score Reporting Standards published in Nature (2021).

alignment