Calculating the Synthesizability of DNA
In this post we'll go through the steps to develop a DNA synthesizability predictor (DNA-SP) in a completely no-code way using the Zenify platform. A DNA-SP is a machine learning (ML) model that aims to predict whether a given DNA sequence is easy or hard to manufacture in a lab. Providing such a capability is helpful to lab technicians because it drastically reduces the manufacturing iterative cycle that, typically, involves synthesising many related chunks of DNA where a failure in manufacturing one single chunk compromises the success of the final goal (e.g., assembly).
The reason why some DNA chunks are more difficult than other to synthesize lies in the presence of hairpins (structures of contiguous nucleotides belonging to one DNA strand that bind to each other), many duplicated sections, or too many (or too few) G/C nucleotides. To understand how ML can speed-up this task, we'll explore the following steps:
Data pre-processing: pre-computing DNA features.
Model training: using the DNA features to train a binary classifier (logistic regression or random forest).
Model testing: using the model to infer DNA synthesizability and suitably characterising its performance.
We'll show first the above steps applied to some basic use-cases (workflows in Zenify) and then we'll move to more complex scenarios.
Data preparation flow: metrics extraction
Data preparation in genomics consists in extracting some metrics/features from the raw sequences and is a required step to enable any downstream processing that we intend to perform. The reason why we need features extraction is that DNA sequences are composed of the characters T, G, C, and A, while mathematical models require numerical values as input, summarising relevant characteristics of the data.
Zenify makes data preparation an easy task! Given a dataset of DNA sequences labelled as "synthesized"/"cancelled" (depending on whether they are synthesizable or not), Zenify offers an easy way to ingest it into a flexible internal library consumable by a workflow. Once made available to the platform, data can then be "prepared" by calculating a fixed length features set that enriches the original data. Let's construct our data preparation workflow then as shown below and after all bricks have been drag-dropped-configured let's click the "Run" button to start execution.
In this workflow, the ingested dataset is pointed to by the DNA Dataset Source brick: while originally in the form of a csv file, data is here read as dataframe with headers inferred from the first line of the csv file. When loading data, the brick offers the option to filter data items by train, test, or validation partitions if the original data have been thus labelled. Otherwise, all data is loaded. The Nucleotide metrics brick implements the core functionality of the flow by extracting relevant metrics ("poly_runs", "pattern_runs", "i_motifs", "g_quad_motifs", etc...) on the nucleotides sequences, as dataframes that are finally saved by the CSV sink brick in a csv file that is now ready for training a model.
Training a prediction model
Using the nucleotides metrics file with labels produced in the previous workflow, we can now train a binary classifier by the Logistic Regression Training brick. It provides to either use all numerical features to predict labels or use a forward step-wise search starting from no features and adding the best feature according the Akaike or Bayesian Information Criterion (respectively, “aic” and “bic” options of the Feature Selection Method parameter).
We are finally ready to use the trained model of our classifier to perform actual predictions, on a labelled test set, by building the following flow.
After executing the flow, by clicking on the Logistic Regression Inference brick a table gets visualised ("confusion matrix") showing, for each DNA item, the confidence of the model that it is synthesizable or not. The highlighted cells represent the predicted label.
A global view about how the classifier is behaving is provided, instead, by the ROC Curve brick that produces a ROC curve: the higher the area under the curve, the better the performance is.
More advanced flows
Contextual multiple pre-processing
Data preparation, may sometimes require that a few different operations be performed besides feature extraction on the initial raw dataset with their outputs enriching the original data items. For example, we may want to assigns each DNA sequence to a train/validation/test partition under the constraint that very similar sequences are in the same partition. This can be done contextually to feature extraction in the same Zenify workflow, as illustrated below.
In this flow, a Zenify library of unlabelled data feeds the DNA Distance Matrix brick that computes pairwise distances between DNA sequence features, based on a distance computation algorithm (e.g., Blast Distance Metric). The specific algorithm to use can be selected through the DNA distance metric parameter of that brick. The same distance metric is then used by the Clustering brick to produce features clusters which are allocated to dataset partitions, given a specified ratio.
As we can see, here, differently to the basic data preparation flow seen before, we are also calculating multiple metrics in parallel so that to feed, later on, the classifiers with a richer DNA sequence "description" for better training.
Finally, partition assignment, original data, and extracted metrics are all saved into a single csv file by the CSV Sink brick.
Training two classifiers at once
We can now use all the pre-computed features and labels to train contextually two prediction models: a logistic regression classifier (as before) and a random forest classifier. Each model will be saved as a separated file, according to its specified file path. The Random Forest Training brick exposes the number_of_estimators configurable parameter to set the wanted number of trees in the random forest.
Testing two classifiers in one workflow
After having trained the models, we can restore and test them. We can do this all in one workflow where we:
load a set of DNA sequences with their labels (Labelled DNA sequence source brick);
extract features from the sequences (metrics-related bricks) and assemble them together (Horizontal Concatenation brick);
pass these features through the Random Forest Inference and Logistic Regression Inference bricks to get statistical predictions.
The probabilities from each of the two classifiers are converted to actual label predictions (Label Chooser brick) and then analysed together with the ground-truth labels in a ROC Curve Visualizer and Confusion Matrix brick and stored in different CSV files.
Taking a Close look at some DNA
The DNA chunks synthesizability is influenced by several factors, like the frequency of occurrence of hairpins or repeats, to mention a few. Therefore, being able to visualise them plays a critical role in helping researchers in their analysis, to make sense, for example, of the reason why some DNA chunks fail to be synthesised.
Zenify accounts for this important feature by providing a collection of bricks that enable relevant genomics-specific visualisations. Let' see one of them in operation in the following workflow.
Here, the Genome Track Sink brick visualises three different types of track motifs (hairpins, repeats and homopolymers) contextually, so that, eventually, co-presence of factors is immediately captured. By clicking on each instance of the retrieved motifs, detailed specs are displayed.
Great! Now you know how to perform as a complex task as training multiple classifiers to predict DNA synthesizability by easily constructing no-code Zenify workflows! Taking inspiration from those, you can construct different ones for your specific use-case and execute your workflows over and over again on different datasets, in a matter of one click!