Industrial Maths AI and Health study group – developing machine learning to better understand tumour pathology specimens
Posted on 21/03/2019
This is the fourth of several problems that will be discussed at a three-day study group in Manchester from 26-28 June.
If you are a researcher working in a UK university who would like to work on this problem, please register here.
The organisers, KTN, alongside the Universities of Manchester, are looking for researchers to work on the following conundrum:
Development of a Formalin-Fixed Paraffin-Embedded artefact filter for RNAseq based biomarkers – presented by Almac Diagnostic Services
Formalin-Fixed Paraffin-Embedded (FFPE) tissue is the most common format for archiving solid tumour pathology specimens after surgery, thus providing a potentially invaluable resource for translational clinical research. However, due to the formalin fixation process causing degradation and chemical modifications, FFPE samples are typically poor quality. This causes problems for technologies and protocols such as Next Generation Sequencing (NGS) that generate genomic “big data” to advance understanding of cancer, since they struggle to extract data of sufficient quality from FFPE for effective downstream research. For example, RNASeq uses NGS to measure the quantity of RNA (gene expression profile) in a tumour sample, which in turn can aid researchers in understanding the genetic drivers of the tumour’s progression, and its likely response to treatment. Accuracy of gene expression estimates can be impacted by artefacts called ‘PCR duplicates’. PCR duplicates arise when multiple copies of the same RNA molecule are sequenced, thus biasing sequencing at certain regions of the genome. The issue is more prevalent in poor quality samples such as FFPE and can often preclude further use of their gene expression profiles.
The objective of the challenge is to develop a machine learning approach to correct from high duplication rate across FFPE samples and achieve accurate gene expression estimates comparable to those from high quality material. Data already generated at Almac, and in the public domain, will be made available to the study group in order to address this challenge.
The research team will be provided with RNASeq gene expression data generated from 10 matched pairs of samples, with each pair consisting of one good quality (fresh frozen) and one poor quality (FFPE) sample from the same patient. For each gene, expression estimates from fresh frozen and FFPE samples will be compared and the gene classed as either ‘impacted’ or ‘unchanged’ depending on the difference in expression estimates between the two sample types. The team will also be provided with several gene and sample-level features, which can be used as input to train a machine learning approach to distinguish the two gene classes. If feasibility of this discrete classification approach is established, then an additional objective would be to develop a continuous classifier to predict the extent to which an individual gene’s expression estimate is impacted by FFPE. This will allow a correction to be applied to the expression estimates to remove bias introduced by FFPE and therefor provide more accurate gene expression measures comparable to those from high quality samples.
This would represent a significant step towards fully exploiting FFPE samples, particular those that would otherwise be discarded using standard approaches, increasing the opportunity to discover novel cancer drug targets and biomarkers from these samples.
Building on this initial support from AI experts in Cardiff and Manchester, Almac will be in a more informed position to incorporate FFPE artefact filtering methodologies into biomarker discovery and development pipelines across multiple RNA expression biomarker projects. It will also enhance the offering of its most recently launched RNA expression based claraT report which is already gaining traction with several clients.
If you’re a university researcher and think you could play a part in helping with this challenge, sign up here .
Almac Diagnostic Services is a stratified medicine company specialising in biomarker driven clinical trials. Its diagnostic experience spans oncology, immunology, CNS and infective diseases.
Almac’s global laboratories offer tailored solutions for molecular diagnostics from discovery to commercialisation:
• Biomarker discovery
• Custom assay development and validation
• DNA and RNA panel options for basket and umbrella trials
• Flexible CDx development and commercialisation
• Expert regulatory support
• Bioinformatics and software development