I have been struggling to figure out a way to replace lines from the collections counter that I do not want in the final outcome. Here is the coding:
from collections import Counter
print("Enter lines that you need counted.")
contents = []
def amount():
while True:
try:
line = input()
except EOFError:
break
contents.append(line)
return
amount()
count = Counter(contents)
print(count)
When you input anything, lets say the number 4, it will come out like "Counter({'4': 1})". I am trying to remove some of the characters in the output, such as "{" and " ' ". I tried count = [item.replace('{', " ") for item in count] but it seems to make it so the counter no longer works. Does anyone have any ideas?
Write your own formatting code by looping over the items in the counter.
print(', '.join(f'{key}: {value}' for key, value in count.items())
count.items() returns a sequence of (key, value) pairs from count.
f'{key}: {value}' is an f-string that formats the key and value with : between them. This is used in a generator that returns the formatted string for each pair returned by count.items().
', '.join() concatenates the values in the sequence of formatted strings with , delimiters between them.
Related
My assignment is to:
Read the protein sequence from FILE A and calculate the molecular weight of
this protein using the dictionary created above.
So far, I have the code below:
import pprint
my_dict= {'A':'089Da', 'R':'174Da','N':'132Da','D':'133Da','B':'133Da','C':'121Da','Q':'146Da','E':'147Da',
'Z':'147Da','G':'075Da','H':'155Da','I':'131Da','L':'131Da','K':'146Da','M':'149Da',
'F':'165Da','P':'115Da','S':'105Da','T':'119Da','W':'204Da','Y':'181Da','V':'117Da'}
new=sorted(my_dict.items(), key=lambda x:x[1])
print("AA", " ", "MW")
for key,value in new:
print(key, " ", value)
with open('lysozyme.fasta','r') as content:
fasta = content.read()
for my_dict in fasta:
In which the top part of the code is my dictionary created. The task is to i.e open the rile and read 'MWAAAA' in the file, and then sum up the values associated with those keys using the dictionary I created. I'm not sure how to proceed after the for loop. Do I use an append function? Would appreciate any advice, thanks!
after read your file, you can check char by char:
for char in fasta:
print(char)
output:
M
W
A
A
A
A
then use the char as a key for retrieve value of your dict
summ += my_dict[char]
I want to know whether there is anything wrong with my mapper and reducer functions below. It is a part of project in Udacity's intro to data science course
def mapper():
dic={}
for line in sys.stdin:
data=line.strip().split(" ")
for i in data:
dic[i]=1
for key, value in dic.iteritems():
print key,'\t', value
Here values are input as string with words separated by a space and function returns a dictionary with each word of the string as the 'key' and it's counting 1 as the 'value'.
def reducer():
dic={}
for line in sys.stdin:
data=line.strip().split('\t')
if data[0] in dic.keys():
dic[data[0]]+=1
else:
dic[data[0]]=data[1]
for key, value in dic.iteritems():
print key,'\t',value
Here values are inputted as a string consisting of the word and count 1 separated by a tab. Both functions are executed differently. I'm not getting the correct output.
It would help if you told us something about the output you expect, but in dic[data[0]]=data[1] the value data[1] is a string so you won't be able to add a number such as 1 to it.
Also, surely the point of a reducer is that it may run multiple times when the input count isn't always going to be 1, you may want to add the actual value rather than just incrementing.
def reducer():
dic=collections.defaultdict(int)
for line in sys.stdin:
key, value=line.strip().split('\t')
dic[key] += int(value)
for key, value in dic.iteritems():
print key,'\t',value
I have a file i am trying to replace parts of a line with another word.
it looks like bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212
i need to delete everything but bob123#bobscarshop.com, but i need to match 23rh32o3hro2rh2 with 23rh32o3hro2rh2:poniacvibe , from a different text file and place poniacvibe infront of bob123#bobscarshop.com
so it would look like this bob123#bobscarshop.com:poniacvibe
I've had a hard time trying to go about doing this, but i think i would have to split the bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212 with data.split(":") , but some of the lines have a (:) in a spot that i don't want the line to be split at, if that makes any sense...
if anyone could help i would really appreciate it.
ok, it looks to me like you are using a colon : to separate your strings.
in this case you can use .split(":") to break your strings into their component substrings
eg:
firststring = "bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212"
print(firststring.split(":"))
would give:
['bobkeiser', 'bob123#bobscarshop.com', '0.0.0.0.0', '23rh32o3hro2rh2', '234212']
and assuming your substrings will always be in the same order, and the same number of substrings in the main string you could then do:
firststring = "bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212"
firstdata = firststring.split(":")
secondstring = "23rh32o3hro2rh2:poniacvibe"
seconddata = secondstring.split(":")
if firstdata[3] == seconddata[0]:
outputdata = firstdata
outputdata.insert(1,seconddata[1])
outputstring = ""
for item in outputdata:
if outputstring == "":
outputstring = item
else
outputstring = outputstring + ":" + item
what this does is:
extract the bits of the strings into lists
see if the "23rh32o3hro2rh2" string can be found in the second list
find the corresponding part of the second list
create a list to contain the output data and put the first list into it
insert the "poniacvibe" string before "bob123#bobscarshop.com"
stitch the outputdata list back into a string using the colon as the separator
the reason your strings need to be the same length is because the index is being used to find the relevant strings rather than trying to use some form of string type matching (which gets much more complex)
if you can keep your data in this form it gets much simpler.
to protect against malformed data (lists too short) you can explicitly test for them before you start using len(list) to see how many elements are in it.
or you could let it run and catch the exception, however in this case you could end up with unintended results, as it may try to match the wrong elements from the list.
hope this helps
James
EDIT:
ok so if you are trying to match up a long list of strings from files you would probably want something along the lines of:
firstfile = open("firstfile.txt", mode = "r")
secondfile= open("secondfile.txt",mode = "r")
first_raw_data = firstfile.readlines()
firstfile.close()
second_raw_data = secondfile.readlines()
secondfile.close()
first_data = []
for item in first_raw_data:
first_data.append(item.replace("\n","").split(":"))
second_data = []
for item in second_raw_data:
second_data.append(item.replace("\n","").split(":"))
output_strings = []
for item in first_data:
searchstring = item[3]
for entry in second_data:
if searchstring == entry[0]:
output_data = item
output_string = ""
output_data.insert(1,entry[1])
for data in output_data:
if output_string == "":
output_string = data
else:
output_string = output_string + ":" + data
output_strings.append(output_string)
break
for entry in output_strings:
print(entry)
this should achieve what you're after and as prove of concept will print the resulting list of stings for you.
if you have any questions feel free to ask.
James
Second edit:
to make this output the results into a file change the last two lines to:
outputfile = open("outputfile.txt", mode = "w")
for entry in output_strings:
outputfile.write(entry+"\n")
outputfile.close()
The program is supposed to read a given file, count the occurrence of each word with a dictionary, then create a file called report.txt and output the list of words and their frequencies
infile = open('text file.txt','r')
dictionary = {}
# count words' frequency
for i in range(1,14):
temp = infile.readline().strip().split()
for item in temp:
if dictionary.has_key(item) == False:
dictionary[item] = 1
elif dictionary.has_key:
temp2 = dictionary.get(item)
dictionary[item] = temp2 + 1
infile.close()
outfile = open('report.txt','w')
outfile.write( for words in dictionary:
print '%15s :' %words, dictionary[words])
everything works at the counting part, but
just right at the last part of writing the output, I realize I can't put a for loop in the write method
You need to put the write inside the for loop:
for words in dictionary:
outfile.write('%15s : %s\n' % (words, dictionary[words]))
Alternatively you can use a comprehension, but they're a bit ninja and can be harder to read:
outfile.write('\n'.join(['%15s : %s' % key_value for key_value in dictionary.items()]))
As has been said already in the accepted answer, you need the write inside the for loop. However, when using files it is also good practice to perform your actions within a with context as this will automatically handle the closing of the file. e.g.
with open('report.txt','w') as outfile:
for words in dictionary:
outfile.write('%15s : %s\n' % (words, dictionary[words]))
Your code contains several deficiencies:
You don't use has_key and you don't compare to True / False directly - it is redundant and bad style (in any language)
if dictionary.has_key(item) == False:
should be
`if not item in dictionary`
It is worth mentioning that using positive test first will be more efficient - because you'll probably have more than 1 occurrence of most words in the file
dictionary.has_key returns a reference to has_key method - which in boolean equivalent is True (your code accidentally works, because regardless of the 1st conditions second is always True). Simple else would be enough
The last 2 statements in the condition may be just rewritten as
dictionary[item] += 1
That said, you may use collections.Counter to count words
dictionary = Counter()
for lines in source_file:
dictionary.update(line.split())
(BTW, strip before split is redundant)
I have a file, that have lots of sequences of letters.
Some of these sequences might be equal, so I would like to compare them, all to all.
I'm doing something like this but this isn't exactly want I wanted:
for line in fl:
line = line.split()
for elem in line:
if '>' in elem:
pass
else:
for el in line:
if elem == el:
print elem, el
example of the file:
>1
GTCGTCGAAGCATGCCGGGCCCGCTTCGTGTTCGCTGATA
>2
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA
>3
GTCGTCGAAAGAGGCTT-GCCCGCCACGCGCCCGCTGATA
>4
GTCGTCGAAAGAGGCTT-GCCCGCTACGCGCCCCCTGATA
>5
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA
>6
GTCGTCGAAAGAGTCTGACCGCTTCTCGCCCGCTGATACG
>7
GTCGTCGAAAGAGGTCT-GACCGCTTCTCGCCCGCTGATA
So what I want if to known if any sequence is totally equal to 1, or to 2, and so on.
If the goal is to simply group like sequences together, then simply sorting the data will do the trick. Here is a solution that uses BioPython to parse the input FASTA file, sorts the collection of sequences, uses the standard Python itertools.groupby function to merge ids for equal sequences, and outputs a new FASTA file:
from itertools import groupby
from Bio import SeqIO
records = list(SeqIO.parse(file('spoo.fa'),'fasta'))
def seq_getter(s): return str(s.seq)
records.sort(key=seq_getter)
for seq,equal in groupby(records, seq_getter):
ids = ','.join(s.id for s in equal)
print '>%s' % ids
print seq
Output:
>3
GTCGTCGAAAGAGGCTT-GCCCGCCACGCGCCCGCTGATA
>4
GTCGTCGAAAGAGGCTT-GCCCGCTACGCGCCCCCTGATA
>2,5
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA
>7
GTCGTCGAAAGAGGTCT-GACCGCTTCTCGCCCGCTGATA
>6
GTCGTCGAAAGAGTCTGACCGCTTCTCGCCCGCTGATACG
>1
GTCGTCGAAGCATGCCGGGCCCGCTTCGTGTTCGCTGATA
In general for this type of work you may want to investigate Biopython which has lots of functionality for parsing and otherwise dealing with sequences.
However, your particular problem can be solved using a dict, an example of which Manoj has given you.
Comparing long sequences of letters is going to be pretty inefficient. It will be quicker to compare the hash of the sequences. Python offers two built in data types that use hash: set and dict. It's best to use dict here as we can store the line numbers of all the matches.
I've assumed the file has identifiers and labels on alternate lines, so if we split the file text on new lines we can take one line as the id and the next as the sequence to match.
We then use a dict with the sequence as a key. The corresponding value is a list of ids which have this sequence. By using defaultdict from collections we can easily handle the case of a sequence not being in the dict; if the key hasn't be used before defaultdict will automatically create a value for us, in this case an empty list.
So when we've finished working through the file the values of the dict will effectively be a list of lists, each entry containing the ids which share a sequence. We can then use a list comprehension to pull out the interesting values, i.e. entries where more than one id is used by a sequence.
from collections import defaultdict
lines = filetext.split("\n")
sequences = defaultdict(list)
while (lines):
id = lines.pop(0)
data = lines.pop(0)
sequences[data].append(id)
results = [match for match in sequences.values() if len(match) > 1]
print results
The following script will return a count of sequences. It returns a dictionary with the individual, distinct sequences as keys and the numbers (the first part of each line) where these sequences occur.
#!/usr/bin/python
import sys
from collections import defaultdict
def count_sequences(filename):
result = defaultdict(list)
with open(filename) as f:
for index, line in enumerate(f):
sequence = line.replace('\n', '')
line_number = index + 1
result[sequence].append(line_number)
return result
if __name__ == '__main__':
filename = sys.argv[1]
for sequence, occurrences in count_sequences(filename).iteritems():
print "%s: %s, found in %s" % (sequence, len(occurrences), occurrences)
Sample output:
etc#etc:~$ python ./fasta.py /path/to/my/file
GTCGTCGAAAGAGGCTT-GCCCGCTACGCGCCCCCTGATA: 1, found in ['4']
GTCGTCGAAAGAGGCTT-GCCCGCCACGCGCCCGCTGATA: 1, found in ['3']
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA: 2, found in ['2', '5']
GTCGTCGAAAGAGGTCT-GACCGCTTCTCGCCCGCTGATA: 1, found in ['7']
GTCGTCGAAGCATGCCGGGCCCGCTTCGTGTTCGCTGATA: 1, found in ['1']
GTCGTCGAAAGAGTCTGACCGCTTCTCGCCCGCTGATACG: 1, found in ['6']
Update
Changed code to use dafaultdict and for loop. Thanks #KennyTM.
Update 2
Changed code to use append rather than +. Thanks #Dave Webb.