ReadMe:

Encoding a DNA Lexicon:
To encode the BabbleBricks that make up a lexicon we begin by taking a list of words. We then enumerate this list
first in decimal and then in base 4. We convert these numbers into their DNA equivalent using the schema; A is 0, T is 1,
G is 2, C is 3 and pad them all up to 5 base pairs. Now we have these variable sequences we must ensure that no illegal
restriction sites can occur so we add gap sequences. Finally we append a stop codon region, restriction site preventing gapped error
correcting region and hangs in each BabbleBrick form. For example:

words:     base10: base4:  word coding: restriction gapped:  stop codons:          error correction:    AB and BA hangs:
i-gem       0      0         AAAAA         ACCAAAA           +TTAGCTAATCACTTATGA   +AAGGAATTAAGGAATTAA  +GGAG +CGCT
babbleblock 1      1         AAAAT         ACCAAAT           +TTAGCTAATCACTTATGA   +AAGGATTTAAGGATTTAA  +GGAG +CGCT
babblebrick 2      2         AAAAG         ACCAAAG           +TTAGCTAATCACTTATGA   +AAGGAGTTAAGGAGTTAA  +GGAG +CGCT
edinburgh   3      3         AAAAC         ACCAAAC           +TTAGCTAATCACTTATGA   +AAGGACTTAAGGACTTAA  +GGAG +CGCT
lexicon     4      10        AAATA         ACCAATA           +TTAGCTAATCACTTATGA   +AAGGATTTATGGAATTAA  +GGAG +CGCT
encoding    5      11        AAATT         ACCAATT           +TTAGCTAATCACTTATGA   +AAGGAGTTATGGATTTAA  +GGAG +CGCT

Encoding a BabbleBlock:
When we assemble our BabbleBricks together to create BabbleBlocks its vital we know what the sequence will be for both verification
purposes, so that we can instruct the user exactly which BabbleBricks to use and so we can work out our checksum and address values.
We start by appending our word coding BabbleBricks together:
      5' GGAGACCAAAATAGCTAATCACTTATGAAAGGAATTAAGGAATTAA + GGAGACCAAATTAGCTAATCACTTATGAAAGGATTTAAGGATTTAA
      5' GGAGACCAAAATAGCTAATCACTTATGAAAGGAATTAAGGAATTAAGGAGACCAAATTAGCTAATCACTTATGAAAGGATTTAAGGATTTAA
We then look at the word coding regions spaced at regular and use them to calculate a checksum as described in the checksum section below.
Finally, we append an address BabbleBlock which acts like a line number telling the decoding program where this BabbleBlock lies
in the overall archive.

Decoding a BabbleBlock:
In order to decode a BabbleBlock we first look at our checksum and use it to verify whether or not error correction needs to be done - if some
mistakes are flagged we use our error correcting apparatus, as described below, and return their results (what our sequence was before
the change) back to the decoding program. The error correcting program will look at each word coding region, convert it back to it numerical values and
use it as an index to look up the information value of that BabbleBrick in the lexicon. Having decoded all the BabbleBlocks in a batch these
will then be sorted in order using the address found at the end of each sequence before the decoded information is returned to the user.

Checksum:
A checksum is a commonly used technique in network engineering that we’ve adapted for use in DNA storage.
In this format the checksum allows the detection of errors in a BabbleBlock (DNA sentence). The checksum holds
the sum of the word coding regions of BabbleBricks (DNA words) in a BabbleBlock in a 4 BabbleBrick region that
can be found towards the end of a BabbleBlock.
When a BabbleBlock is being translated the checksum is looked at first to see if it matches the sums of the
word coding regions in the DNA we get back from sequencing. If this is the case, no errors have appeared
and the translation program can proceed as normal however if there is no match an error has occurred somewhere
in the BabbleBlock and other error correcting mechanisms will have to be used to make sure the stored
information can still be read back.

Optimal Rectangular Code:
The Optimal Rectangular Code (ORC) is a type of error correcting code that is present in all of our BabbleBricks.
When the checksum detects an error the next action is to checks the ORCs in each word and use them to correct any
errors that have taken place as follows:

                A:0, T:1, G:2, C:3
                BabbleBrick Word: ATCGA
                1. Add an 'A' giving ATCGA
                2. Arrange in rectangular form
                3. Calculate row and column sums
                A   T   C   4
                G   A   A   2
                2   1   3
                4. Store values in a vector ORC:{4, 2, 2, 1, 3}
                5. This is the ORC for this word stored in the DNA

                If the strand changes to TTCGA(for example),
                the word's ORC will not match during decoding.
                New ORC: {5, 2, 3, 1, 3}
                T   T   C   5
                G   A   A   2
                3   1   3
                A simple formula detects and corrects the error
                Fix = 3 - G = 3 - 2 = 1
                Correct = T - Fix = 1 - 1 = 0
                A = 0 giving ATCGA

Unfortunately, the ORC can only locate and fix single base pair errors however the properties of the ORC
facilitate another added level of error correction.

Beyond Optimal Rectangular Codes - The SuperFixer Function and Levenshtein Distance:
If the Optimal Rectangular Code (ORC) detects more than the single base pair error it can fix using its
normal actions the SuperFixer function is called. The SuperFixer uses the ORC and works out all the possible
words in the lexicon that fit this pattern. For any given ORC this is a maximum of 4 words and indeed there
are many words that have unique ORC’s. Once the SuperFixer has narrowed down the possible words the list
of possibilities is compared to what appears in the actual DNA and the sequence with the lowest Levenshtein
distance is used in the translation. Levenshtein distance is a metric which shows the number of deletions,
insertions or substitutions it will take to turn one string (or sequence) into another.
By taking the minimum Levenshtein distance we always work under the assumption that the smallest number of
errors is most probable. This assumption gives the best performance across errors of different magnitudes
due to the robustness of DNA as a molecule making small errors considerably more likely than larger ones.

