Data Challenge - Message Decoder

During the Summer of 2017 I was invited by a member of the Clinical Developers Club to participate in a data programming challenge. The data, stored in FASTA format, describing approximately 121 000 base pairs in which an ASCII-encoded message was hidden within an intermediary binary-encoded format was provided to participants.

The challenge was to implement a message decoder revealing the hidden message contained within the FASTA data file. An implementation using the GNU Bourne Again Shell (BASH) programming language augmented with a few basic *nix utilities successfully decoded the DNA sequence. Subsequently, a solution implemented with the Beginner’s All-Purpose Symbolic Instruction Code (BASIC) programming language verified the message.

Scenario Statement

You are working with a team of genetics researchers. You’re told the exciting and confusing news that there may be a message from God hidden in our DNA! Through science and prayer your team has been able to work out the following:

  • The message is hidden somewhere in the attached sequence of ~120,000 base pairs
  • God writes in stochastic binary, with 0 represented by either A or C, and 1 represented by either T or G
  • God encodes in ASCII

Can you find the message?

The scenario statement is taken from the original text of the data programming challenge sponsored by the Clinical Developers Club.

Methodology

A synopsis of the transformations necessary to convert a DNA sequence into a representation expressing the DNA as 8-bit ASCII characters has been crafted.

Algorithm

  1. Extract the FASTA sequence identifier and determine DNA sequence length
  2. Convert DNA bases (A, C, G, T) to their binary (0, 1) representation (A : 0, C : 0, G: 1, T : 1)
  3. Add ASCII 0x30 (zero) characters to the beginning of the DNA sequence making its length a multiple of 8 (8 bits equals 1 byte in ASCII alphabet)
  4. Translate DNA sequence in its binary form to its ASCII character equivalent
    1. For each iteration adjust the offset of the starting position by 0, 1, 2, 3, 4, 5, and 6, simulating how the beginning of a DNA base-pair is determined
    2. Extract 8 bytes from the simulated bit string
    3. Convert binary value to decimal value
    4. Convert decimal value to ASCII character (equivalent of a lookup table)

Before attempting to implement a message decoding solution, however, requires understanding how a computer encodes the data upon which its instructions operate.

alt >ext

A computer’s instructions and data are represented as bytes, composed of 8 bits each, for the total of 256 unique values (0-255).

alt >ext

Converting a byte’s value from binary to decimal, or even hexadecimal, makes it easier to lookup the corresponding ASCII character. Since each DNA letter represents a single bit and 8 bits form a byte, our algorithm must handle this transformation.

alt >ext

With the decimal value of a sequence of 8 DNA letters the appropriate ASCII character can be determined. Only the printable character set between 32 (0x20) and 126 (0x7E) is of interest.

alt >ext

In this example the decimal value 43 maps to the character ‘+’ (plus symbol). The binary string ‘00101011’ or ‘AAGAGAGG’ in a FASTA data file are equivalent to ASCII character code 43 in the base-10 number system.

Implementation

The message decoders presented in this section decode the hidden message contained within the FASTA data file. Retrieve the processing code and data files using the following command:

$ git clone \
  https://gitlab.com/gregorydhorne/data-challenge-message-decoder

GNU BASH and *NIX Command-Line Tools

A Docker container has been built to develop the GNU BASH version of the message decoder script. In general, the *nix shell scripts are written to avoid shell specific idioms. Therefore, the script might work with other *nix shells with little or no modification.

To retrieve and use BASH-in-the-Box execute the command:

$ docker pull gdhorne/bash-in-the-box

The message decoder script can optionally capture generated output for subsequent review. The message can be decoded with the following commands:

$ cd data-challenge-message-decoder
$ docker run --rm -it -v ${PWD}:/opt bash \
  ./code/message_decoder.sh [> ./data/message.log]

where [> ./data/message.log] is the optional capture file; omit the brackets, or using the BASH shell if installed on your own computer and it is the default shell:

$ cd data-challenge-message-decoder
$ ./code/message_decoder.sh [> ../data/message.log]

If you do not redirect the output to a file, by default it is displayed on the terminal.

BASIC

For simplicity a BASIC interpeter implemented with JavaScript has been used to develop the BASIC version of the message decoder programme. In general, the programme is written to avoid dialect specific idioms so any BASIC interpreter adhering to the standard language definition should be able to execute it.

The message decoder programme, message_decoder.bas, can be uploaded via the Load button and executed via the Run button.

Due to variations in file I/O statements between dialects of the BASIC programming language the output is not captured for review. To view the decoded message use the Show Output button which opens a viewport; the default console is too narrow to present the decoded message in an easy to read format.

If you want to rebuild the message decoder programme written in the BASIC programming language, use the following commands:

$ cd data-challenge-message-decoder
$ docker run --rm -it -v ${PWD}:/opt bash \
  ./code/prepare_data_statements.sh
$ docker run --rm -it -v ${PWD}:/opt bash \
  ./code/combine_code_and_data.sh

or using the BASH shell if installed on your own computer and it is the default shell:

$ cd data-challenge-message-decoder
$ ./code/prepare_data_statements.sh
$ ./code/combine_code_and_data.sh

The freshly generated script, message_decoder.bas, will be in the code subdirectory.

On May 01, 2014, the BASIC programming language celebrated its 50th anniversary since its arrival at Dartmouth College.

Results & Analysis

A relatively simple algorithm was sufficient to solve this data programming challenge. For a much shorter DNA sequence it is possible to manually apply the algorithm with minimal effort.

During the fourth frame shift (frame shift 3) a 30 line message in the English language was found. In totality 7 frame shifts were applied to the FASTA data file. A log file captured the message to preserve a record of the sleuthing.


Exit Servant
Is this a dagger which I see before me,
The handle toward my hand? Come, let me clutch thee.
I have thee not, and yet I see thee still.
Art thou not, fatal vision, sensible
To feeling as to sight? or art thou but
A dagger of the mind, a false creation,
Proceeding from the heat-oppressed brain?
I see thee yet, in form as palpable
As this which now I draw.
Thou marshall'st me the way that I was going;
And such an instrument I was to use.
Mine eyes are made the fools o' the other senses,
Or else worth all the rest; I see thee still,
And on thy blade and dudgeon gouts of blood,
Which was not so before. There's no such thing:
It is the bloody business which informs
Thus to mine eyes. Now o'er the one halfworld
Nature seems dead, and wicked dreams abuse
The curtain'd sleep; witchcraft celebrates
Pale Hecate's offerings, and wither'd murder,
Alarum'd by his sentinel, the wolf,
Whose howl's his watch, thus with his stealthy pace.
With Tarquin's ravishing strides, towards his design
Moves like a ghost. Thou sure and firm-set earth,
Hear not my steps, which way they walk, for fear
Thy very stones prate of my whereabout,
And take the present horror from the time,
Which now suits with it. Whiles I threat, he lives:
Words to the heat of deeds too cold breath gives.

The decision to use *nix command-line tools with a heritage dating over 50 years, and a programming language of similar vintage was undertaken as a novelty as well as to demonstrate their current utility. It was uncertain a solution with a reasonable running time could be designed and constructed using either of these approaches, particularly in the case of BASIC. Initially, a solution written in less than 80 lines of explanatory comments and processing code with traditional *nix tools (shell programming language and a few command-line utilities: GNU BASH, bc, cat, cut, head, nl, and seq), decoded the DNA-encoded message. As a verification, a second implementation of the algorithm written in slightly less than 80 lines of BASIC comments and processing code confirmed the presence of a human-readable message.