Dartmouth’s Norris Cotton Cancer Center researchers revealed in the Pacific Symposium on Biocomputing the possibility of relying on denoising autoencoders (DAs) to extract key biological principles from big data sets of gene expression in breast cancer cells.
DAs are a variant of artificial neural networks, which aim to learn compact and efficient representations from the input data by adding perturbations (called “noise”). DAs basically shuffle data around in order to understand it, while attempting to reconstruct the original data. Shuffling the data creates noise, and DAs have to recognize the features within the noise in order to characterize the input. As part of the training, the network generates a model, resamples the shuffled inputs and re-reconstructs the data, until it finds those inputs which bring its model closest to what is known to be true. Incorporating noise during the training yields robust features. The research team, led by Dr. Casey S. Greene, has applied DAs to a large collection of breast cancer gene expression data.
“Cancers are very complex,” clarified Dr. Greene in a news release. “Our goal is to measure which genes are being expressed, and to what extent they’re being expressed, and then automatically summarize what the cancer is doing and how we might control it.”
In order to do so, the team initially added noise to the input data, and then allowed a computer to learn how to remove the noise. As such, the computer had to learn about important features and concepts of breast cancer. “This approach of removing noise makes the models we constructed more generally applicable,” said Dr. Greene.
Then, the team focused on DAs as a tool to identify and extract information from genomic data. The advantage is that with DAs, a computer is trained directly on the data set, without requiring prior knowledge about biological principles. The model created by the computer is then compared to prior findings to verify if it can support them and to find possible areas where data generates new questions.
The team tested DAs in a large collection of data from breast cancer gene expression and found that DAs can indeed identify and extract important information from big genomic data sets and successfully construct features containing both clinical and molecular information regarding cancer.
“These techniques and findings will enable others to use the DAs to evaluate gene expression data in a variety of disease sites,” explained Dr. Greene. “While noise in data is usually viewed as a problem, adding noise to data can actually be a good thing because it can help reveal the underlying signal. When we did this to analyze data from breast cancers, we found gene expression features that generalize across studies and represent important clinical factors.”
The next goal is to study more complex models which require several levels of regulation to be taken into consideration. The researchers want to develop methods that besides modelling data can also provide them with all the information learned with such particular models.