Mapper and reducer functions in python - python

I want to know whether there is anything wrong with my mapper and reducer functions below. It is a part of project in Udacity's intro to data science course
def mapper():
dic={}
for line in sys.stdin:
data=line.strip().split(" ")
for i in data:
dic[i]=1
for key, value in dic.iteritems():
print key,'\t', value
Here values are input as string with words separated by a space and function returns a dictionary with each word of the string as the 'key' and it's counting 1 as the 'value'.
def reducer():
dic={}
for line in sys.stdin:
data=line.strip().split('\t')
if data[0] in dic.keys():
dic[data[0]]+=1
else:
dic[data[0]]=data[1]
for key, value in dic.iteritems():
print key,'\t',value
Here values are inputted as a string consisting of the word and count 1 separated by a tab. Both functions are executed differently. I'm not getting the correct output.

It would help if you told us something about the output you expect, but in dic[data[0]]=data[1] the value data[1] is a string so you won't be able to add a number such as 1 to it.
Also, surely the point of a reducer is that it may run multiple times when the input count isn't always going to be 1, you may want to add the actual value rather than just incrementing.
def reducer():
dic=collections.defaultdict(int)
for line in sys.stdin:
key, value=line.strip().split('\t')
dic[key] += int(value)
for key, value in dic.iteritems():
print key,'\t',value

Related

How do I replace lines after using counter on them?

I have been struggling to figure out a way to replace lines from the collections counter that I do not want in the final outcome. Here is the coding:
from collections import Counter
print("Enter lines that you need counted.")
contents = []
def amount():
while True:
try:
line = input()
except EOFError:
break
contents.append(line)
return
amount()
count = Counter(contents)
print(count)
When you input anything, lets say the number 4, it will come out like "Counter({'4': 1})". I am trying to remove some of the characters in the output, such as "{" and " ' ". I tried count = [item.replace('{', " ") for item in count] but it seems to make it so the counter no longer works. Does anyone have any ideas?
Write your own formatting code by looping over the items in the counter.
print(', '.join(f'{key}: {value}' for key, value in count.items())
count.items() returns a sequence of (key, value) pairs from count.
f'{key}: {value}' is an f-string that formats the key and value with : between them. This is used in a generator that returns the formatted string for each pair returned by count.items().
', '.join() concatenates the values in the sequence of formatted strings with , delimiters between them.

How to sort a Python dictionary by a substring contained in the keys, according to the order set in a list?

I'm very new to Python and I'm stuck on a task. First I made a file containing a number of fasta files with sequence names into a dictionary, then managed to select only those I want, based on substrings included in the keys which are defined in list "flu_genes".
Now I'm trying to reorder the items in this dictionary based on the order of substrings defined in the list "flu_genes". I'm completely stuck; I found a way of reordering based on the key order in a list BUT it is not my case, as the order is defined not by the keys but by a substring within the keys.
Should also add that in this case the substring its at the end with format "_GENE", however it could be in the middle of the string with the same format, perhaps "GENE", therefore I'd rather not rely on a code to find the substring at the end of the string.
I hope this is clear enough and thanks in advance for any help!
"full_genome.fasta"
>A/influenza/1/1_NA
atgcg
>A/influenza/1/1_NP
ctgat
>A/influenza/1/1_FluB
agcta
>A/influenza/1/1_HA
tgcat
>A/influenza/1/1_FluC
agagt
>A/influenza/1/1_M
tatag
consensus = {}
flu_genes = ['_HA', '_NP', '_NA', '_M']
with open("full_genome.fasta", 'r') as myseq:
for line in myseq:
line = line.rstrip()
if line.startswith('>'):
key = line[1:]
else:
if key in consensus:
consensus[key] += line
else:
consensus[key] = line
flu_fas = {key : val for key, val in consensus.items() if any(ele in key for ele in flu_genes)}
print("Dictionary after removal of keys : " + str(flu_fas))
>>>Dictionary after removal of keys : {'>A/influenza/1/1_NA': 'atgcg', '>A/influenza/1/1_NP': 'ctgat', '>A/influenza/1/1_HA': 'tgcat', '>A/influenza/1/1_M': 'tatag'}
#reordering by keys order (not going to work!) as in: https://try2explore.com/questions/12586065
reordered_dict = {k: flu_fas[k] for k in flu_genes}
A dictionary is fundamentally unsorted, but as an implementation detail of python3 it remembers its insertion order, and you're not going to change anything later, so you can do what you're doing.
The problem is, of course, that you're not working with the actual keys. So let's just set up a list of the keys, and sort that according to your criteria. Then you can do the other thing you did, except using the actual keys.
flu_genes = ['_HA', '_NP', '_NA', '_M']
def get_gene_index(k):
for index, gene in enumerate(flu_genes):
if k.endswith(gene):
return index
raise ValueError('I thought you removed those already')
reordered_keys = sorted(flu_fas.keys(), key=get_gene_index)
reordered_dict = {k: flu_fas[k] for k in reordered_keys}
for k, v in reordered_dict.items():
print(k, v)
A/influenza/1/1_HA tgcat
A/influenza/1/1_NP ctgat
A/influenza/1/1_NA atgcg
A/influenza/1/1_M tatag
Normally, I wouldn't do an n-squared sort, but I'm assuming the lines in the data file is much larger than the number of flu_genes, making that essentially a fixed constant.
This may or may not be the best data structure for your application, but I'll leave that to code review.
It's because you are trying to reorder it with non-existent dictionary keys. Your keys are
['>A/influenza/1/1_NA', '>A/influenza/1/1_NP', '>A/influenza/1/1_HA', '>A/influenza/1/1_M']
which doesn't match the list
['_HA', '_NP', '_NA', '_M']
you first need to get transform them to make them match and since we know the pattern that it's at the end of the string starting with an underscore, we can split at underscores and get the last match.
consensus = {}
flu_genes = ['_HA', '_NP', '_NA', '_M']
with open("full_genome.fasta", 'r') as myseq:
for line in myseq:
line = line.rstrip()
if line.startswith('>'):
sequence = line
gene = line.split('_')[-1]
key = f"_{gene}"
else:
consensus[key] = {
'sequence': sequence,
'data': line
}
flu_fas = {key : val for key, val in consensus.items() if any(ele in key for ele in flu_genes)}
print("Dictionary after removal of keys : " + str(flu_fas))
reordered_dict = {k: flu_fas[k] for k in flu_genes}

Print certain amount of tuples in a list without name

so my exercise was to print 10 most common words in a text file.
Assuming I opened the file and created a dictionary that contains seperated words with indexes.
Normally, I do this:
li=list()
for key,value in d.items():
tpl=(value,key)
li.append(tpl)
li=sorted(li,reverse=True)
for key,value in li[:10]:
print('Ten most common words: ',value,key)
But prof gave me a single line of code that can replace almost all those lines:
print(sorted([(value,key) for key,value in d.items()],reverse=True))
However I can't find a way to print only 10 tuples since the list has no name, I can't use the for loop to print. Can you help me out?
Separate the list creation from the print():
li = sorted([(value, key) for key, value in d.items()], reverse=True)
Now you can iterate through li.
for item in li[:10]:
print(item)

summing dictionary values from reading another file

My assignment is to:
Read the protein sequence from FILE A and calculate the molecular weight of
this protein using the dictionary created above.
So far, I have the code below:
import pprint
my_dict= {'A':'089Da', 'R':'174Da','N':'132Da','D':'133Da','B':'133Da','C':'121Da','Q':'146Da','E':'147Da',
'Z':'147Da','G':'075Da','H':'155Da','I':'131Da','L':'131Da','K':'146Da','M':'149Da',
'F':'165Da','P':'115Da','S':'105Da','T':'119Da','W':'204Da','Y':'181Da','V':'117Da'}
new=sorted(my_dict.items(), key=lambda x:x[1])
print("AA", " ", "MW")
for key,value in new:
print(key, " ", value)
with open('lysozyme.fasta','r') as content:
fasta = content.read()
for my_dict in fasta:
In which the top part of the code is my dictionary created. The task is to i.e open the rile and read 'MWAAAA' in the file, and then sum up the values associated with those keys using the dictionary I created. I'm not sure how to proceed after the for loop. Do I use an append function? Would appreciate any advice, thanks!
after read your file, you can check char by char:
for char in fasta:
print(char)
output:
M
W
A
A
A
A
then use the char as a key for retrieve value of your dict
summ += my_dict[char]

How to count the frequency of words repeated in a text file?

I am supposed to use functions. Basically, the task consists in copying all the words from a text file to a dictionary and count the number of times it is repeated.
So if the key which is the word is in the dictionary, we count or else add to dictionary with count 1.
Here is a code i tried. However nothing prints:
def wordCount(file1):
file1 = open('declarationofInd.txt','r')
mydict = {}
file1.strip()
mydict[key] = file1
mydict.keys()
print mydict
I think you want to count the number of times a word appears in a text doc.
file=open('yourfilehere')
text=file.read().split()
mydict={}
for word in text:
if word not in mydict.keys():
mydict[word]=1
else:
count=mydict[word]
mydict[word]=count+1
print(mydict)
If this is what you are intending to create, then this should work for your intentions. If you are doing this not in IDLE or command prompt, then you should call the function, preferably in a new file.
By the way, I would advise you to make your question clearer as well as research the topic more before posting.

Categories

Resources