Problem Set 7 - Due Fri Dec 11 at 23:59

Back to the problem set 7 page

Task 3:

The task 3 rubric shows how this task will be graded, and can be used as a checklist to make sure you are done with the task.

Put all of your work for this task into the file ``.

This task involves writing five fruitful recursive functions which will operate on strings representing DNA sequences. Genetic information can be represented as a string containing the letters 'A', 'T', 'G', and 'C', which represent the four natural DNA components, called "bases." A single strand of DNA is one such string, while normally two complimentary strings would be paired together. For each 'A' in one string, there would be a matching 'T' in the other, and vice versa, while for each 'G', there would be a matching 'C' (and again vice versa). So for example, the strings:

GATTACA
CTAATGT

represent two complimentary DNA sequences that could be matched together. The functions you have to write for this task are as follows:

For this task you are provided with a function matchingBase which given one base will return the matching base (i.e., given 'A' it returns 'T', etc.).

All parts of this task must be accomplished using fruitful recursive functions, and you are not allowed to use any loops in this task.

Part A: countBases

countBases takes two arguments: the sequence of bases to inspect, and the base to look for. It returns how many copies of that base exist within the given sequence (an integer).

Examples:

>>> countBases("GATTACA", "A")
3
>>> countBases("TATATA", "T")
3
>>> countBases("TATATA", "G")
0
>>> countBases("T", "T")
1
>>> countBases("", "T")
0

Part B: symmetricStrand

symmetricStrand accepts just one argument: the sequence to create a complimentary strand for. It returns a string of the same length representing the complimentary strand, where each 'A' has been swapped for a 'T', each 'T' for an 'A', each 'G' for a 'C', and each 'C' for a 'G'.

Examples:

>>> symmetricStrand("GACT")
'CTGA'
>>> symmetricStrand("ATC")
'TAG'
>>> symmetricStrand("CATAGAG")
'GTATCTC'

Part C: onlyTA

onlyTA accepts one argument: the sequence to process. It returns a new string where each 'G' and 'C' base has been removed, leaving only the 'A' and 'T' bases.

Examples:

>>> onlyTA("ATGCACTA")
'ATATA'
>>> onlyTA("AT")
'AT'
>>> onlyTA("GCG")
''
>>> onlyTA("GAGCTCG")
'AT'

Part D: unmatchedCount

unmatchedCount takes two arguments representing two sequences of DNA (which will always have the same length). Its job is to return the number of positions where the two sequences are mismatched, meaning that the base from one sequence is not the correct matching base for the base from the other sequence.

Examples:

>>> unmatchedCount("AAA", "TTT")
0
>>> unmatchedCount("AAA", "TTC")
1
>>> unmatchedCount("AAA", "GGG")
3
>>> unmatchedCount("GATTACA", "CTAATGT")
0
>>> unmatchedCount("GATTACA", "CGATTGT")
2

Part E: cutOut

cutOut takes two arguments: a string representing a sequence of DNA bases, and a (usually shorter) target sequence. It should find all places in the first sequence where the target sequence exists and remove those, returning what remains of the first sequence.

Examples:

>>> cutOut("GATTACA", "ATTA")
'GCA'
>>> cutOut("TAGAGCGAT", "AG")
'TCGAT'
>>> cutOut("ATTGCCAG", "C")
'ATTGAG'
>>> cutOut("TATATATAT", "TAT")
'AAT'

Note: as implied by the last example, we remove copies of the target sequence one-by-one, so overlapping copies are not always removed. For that sequence, the removals are:
"TATATATAT" → "TAT | ATATAT", "ATATAT" → "A | TAT | AT".

Hint: The .startswith method of strings can be useful for solving this part.


When you are done with this task, you should submit it via the Ocean server. Remember to follow the submission instructions when you are done with the whole problem set.