Frequencies in Python

Frequencies in Python - python

def frequencies(data):
data.sort()
count = 0
previous = data[0]
print("data\tfrequency") # '\t' is the TAB character
for d in data:
if d == previous:
# same as the previous, so just increment the count
count += 1
else:
# we've found a new item so print out the old and reset the count
print(str(previous) + "\t" + str(count))
count = 1
previous = d
So I have this frequency code, but its leaving off the last number in my list every time.
It may have something to do with where I start previous or possibly where I reset previous to d at the end.

For the last group of elements, you never print them out, because you never find something different after it. You would need to repeat the printout thing after your loop.
But that is rather academic; in real world, you would be much more likely to use Counter:
from collections import Counter
counter = Counter(data)
for key in counter:
print("%s\t%d" % (key, counter[key]))

You can count items in a list/sequence using count. So your code can be simplified to look like this:
def frequencies(data):
unique_items = set(data)
for item in unique_items:
print('%s\t%s' % (item, data.count(item)))

Related

Finding the substring with the most repeats in a dictionary with dna sequences

The substring has to be with 6 characters. The number I'm gettig is smaller than it should be.
first I've written code to get the sequences from a file, then put them in a dictionary, then written 3 nested for loops: the first iterates over the dictionary and gets a sequence in each iteration. The second takes each sequence and gets a substring with 6 characters from it. In each iteration, the second loop increases the index of the start of the string (the long sequence) by 1. The third loop takes each substring from the second loop, and counts how many times it appears in each string (long sequence).
I tried rewriting the code many times. I think I got very close. I checked if the loops actually do their iterations, and they do. I even checked manually to see if the counts for a substring in random sequences are the same as the program gives, and they are. Any idea? maybe a different approach? what debugger do you use for Python?
I added a file with 3 shortened sequences for testing. Maybe try smaller substring: say with 3 characters instead of 6: rep_len = 3
The code
matches = []
count = 0
final_count = 0
rep_len = 6
repeat = ''
pos = 0
seq_count = 0
seqs = {}
f = open(r"file.fasta")
# inserting each sequences from the file into a dictionary
for line in f:
line = line.rstrip()
if line[0] == '>':
seq_count += 1
name = seq_count
seqs[name] = ''
else:
seqs[name] += line
for key, seq in seqs.items(): # getting one sequence in each iteration
for pos in range(len(seq)): # setting an index and increasing it by 1 in each iteration
if pos <= len(seq) - rep_len: # checking no substring from the end of the sequence are selected
repeat = seq[pos:pos + rep_len] # setting a substring
if repeat not in matches: # checking if the substring was already scanned
matches.append(repeat) # adding the substring to previously checked substrings' list
for key1, seq2 in seqs.items(): # iterating over each sequence
count += seq2.count(repeat) # counting the substring's repetitions
if count > final_count: # if the count is greater than the previously saved greatest number
final_count = count # the new value is saved
count = 0
print('repetitions: ', final_count) # printing
sequences.fasta

The code is not very clear, so it is a bit difficult to debug. I suggest rewriting.
Anyway, I (currently) just noted one small mistake:
if pos < len(seq) - rep_len:
Should be
if pos <= len(seq) - rep_len:
Currently, the last character in each sequence is ignored.
EDIT:
Here some rewriting of your code that is clearer and might help you investigate the errors:
rep_len = 6
seq_count = 0
seqs = {}
filename = "dna2.txt"
# Extract the data into a dictionary
with open(filename, "r") as f:
for line in f:
line = line.rstrip()
if line[0] == '>':
seq_count += 1
name = seq_count
seqs[name] = ''
else:
seqs[name] += line
# Store all the information, so that you can reuse it later
counter = {}
for key, seq in seqs.items():
for pos in range(len(seq)-rep_len):
repeat = seq[pos:pos + rep_len]
if repeat in counter:
counter[repeat] += 1
else:
counter[repeat] = 1
# Sort the counter to have max occurrences first
sorted_counter = sorted(counter.items(), key = lambda item:item[1], reverse=True )
# Display the 5 max occurrences
for i in range(5):
key, rep = sorted_counter[i]
print("{} -> {}".format(key, rep))
# GCGCGC -> 11
# CCGCCG -> 11
# CGCCGA -> 10
# CGCGCG -> 9
# CGTCGA -> 9

It might be easier to use Counter from the collections module in Python. Also check out the NLTK library.
An example:
from collections import Counter
from nltk.util import ngrams
sequence = "cggttgcaatgagcgtcttgcacggaccgtcatgtaagaccgctacgcttcgatcaacgctattacgcaagccaccgaatgcccggctcgtcccaacctg"
def reps(substr):
"Counts repeats in a substring"
return sum([i for i in Counter(substr).values() if i>1])
def make_grams(sent, n=6):
"splits a sentence into n-grams"
return ["".join(seq) for seq in (ngrams(sent,n))]
grams = make_grams(sequence) # splits string into substrings
max_length = max(list(map(reps, grams))) # gets maximum repeat count
result = [dna for dna in grams if reps(dna) == max_length]
print(result)
Output: ['gcgtct', 'cacgga', 'acggac', 'tgtaag', 'agaccg', 'gcttcg', 'cgcaag', 'gcaagc', 'gcccgg', 'cccggc', 'gctcgt', 'cccaac', 'ccaacc']
And if the question is look for the string with the most repeated character:
repeat_count = [max(Counter(a).values()) for a in result] # highest character repeat count
result_dict = {dna:ct for (dna,ct) in zip(result, repeat_count)}
another_result = [dna for dna in result_dict.keys() if result_dict[dna] == max(repeat_count)]
print(another_result)
Output: ['cccggc', 'cccaac', 'ccaacc']

Reading a text file to print frequency of letters in decreasing order - Python 3

I am doing python basic challenges this is one of them. What all I needed to do is to read through a file and print out the frequency of letters in decreasing order. I am able to do this but I wanted to enhance the program by also printing out the frequency percentage alongside with the letter - frequency - freq%. Something like this: o - 46 - 10.15%
This is what I did so far:
def exercise11():
import string
while True:
try:
fname = input('Enter the file name -> ')
fop = open(fname)
break
except:
print('This file does not exists. Please try again!')
continue
counts = {}
for line in fop:
line = line.translate(str.maketrans('', '', string.punctuation))
line = line.translate(str.maketrans('', '', string.whitespace))
line = line.translate(str.maketrans('', '', string.digits))
line = line.lower()
for ltr in line:
if ltr in counts:
counts[ltr] += 1
else:
counts[ltr] = 1
lst = []
countlst = []
freqlst = []
for ltrs, c in counts.items():
lst.append((c, ltrs))
countlst.append(c)
totalcount = sum(countlst)
for ec in countlst:
efreq = (ec/totalcount) * 100
freqlst.append(efreq)
freqlst.sort(reverse=True)
lst.sort(reverse=True)
for ltrs, c, in lst:
print(c, '-', ltrs)
exercise11()
As you can see I am able to calculate and sort the freq% on a different list but I am not able to include it in the tuple of the lst[] list alongside with the letter, freq. Is there any way to solve this problem?
Also if you have any other suggestions for my code. Please do mention.
Output Screen
Modification
Applying a simple modification as mentioned by #wwii I got the desired output. All I had to do is add one more parameter to the print statement while iterating the lst[] list. Previously I tried to make another list for the freq%, sort and then tried to insert it to the letters-count tuple in a list which didn't work out.
for ltrs, c, in lst:
print(c, '-', ltrs, '-', round(ltrs/totalcount*100, 2), '%')
Output Screen

Your count data is in a dictionary of {letter:count} pairs.
You can use the dictionary to calculate the total count like this:
total_count = sum(counts.values())
Then don't calculate the percentage till you are iterating over the counts...
for letter, count in counts.items():
print(f'{letter} - {count} - {100*count/total}') #Python v3.6+
#print('{} - {} - {}'.format(letter, count, 100*count/total) #Python version <3.6+
Or if you want to put it all in a list so you can sort it:
data = []
for letter, count in counts.items():
data.append((letter,count,100*count/total)
Using operator.itemgetter for the sort key function can help code readability.
import operator
letter = operator.itemgetter(0)
count = operator.itemgetter(1)
frequency = operator.itemgetter(2)
data.sort(key=letter)
data.sort(key=count)
data.sort(key=frequency)

Tuples are immutable which is probably the issue you are finding. The other issue is the simple form of the sort function; A more-advanced sort function would serve you well. See below:
The list-of-tuples format of lst, but because tuples are immutable whereas lists are mutable, opting to change lst to a list-of-lists is a valid approach. Then, since lst is a list-of-lists with each element consisting of 'letter,count,frequency%', the sort function with lambda can be used to sort by whichever index you'd like. The following is to be inserted after your for line in fop: loop.
lst = []
for ltrs, c in counts.items():
lst.append([ltrs,c])
totalcount = sum([x[1] for x in lst]) # sum all 'count' values in a list comprehension
for elem in lst:
elem.append((elem[1]/totalcount)*100) # now that each element in 'lst' is a mutable list, you can append the calculated frequency to the respective element in lst
lst.sort(reverse=True,key=lambda lst:lst[2]) # sort in-place in reverse order by index 2.

The items in freqlst,countlist, and lst are related to each other by their position. If any are sorted that relationship is lost.
zipping the lists together before sorting will maintain the relationship.
Will pick up from your list initialization lines.
lst = []
countlst = []
freqlst = []
for ltr, c in counts.items():
#change here, lst now only contains letters
lst.append(ltr)
countlst.append(c)
totalcount = sum(countlst)
for ec in countlst:
efreq = (ec/totalcount) * 100
freqlst.append(efreq)
#New stuff here: Note this only works in python 3+
zipped = zip(lst, countlst, freqlst)
zipped = sorted(zipped, key=lambda x: x[1])
for ltr, c, freq in zipped:
print("{} - {} - {}%".format(ltr, c, freq)) # love me the format method :)
Basically, zip combines lists together into a list of tuples. Then you can use a lambda function to sort those tuples (very common stack question)

I think I was able to achieve what you wanted by using lists instead of tuples. Tuples cannot be modified, but if you really want to know how click here
(I also added the possibility to quit the program)
Important: Never forget to comment your code
The code:
def exercise11():
import string
while True:
try:
fname = input('Enter the file name -> ')
print('Press 0 to quit the program') # give the User the option to quit the program easily
if fname == '0':
break
fop = open(fname)
break
except:
print('This file does not exists. Please try again!')
continue
counts = {}
for line in fop:
line = line.translate(str.maketrans('', '', string.punctuation))
line = line.translate(str.maketrans('', '', string.whitespace))
line = line.translate(str.maketrans('', '', string.digits))
line = line.lower()
for ltr in line:
if ltr in counts:
counts[ltr] += 1
else:
counts[ltr] = 1
lst = []
countlst = []
freqlst = []
for ltrs, c in counts.items():
# add a zero as a place holder &
# use square brakets so you can use a list that you can modify
lst.append([c, ltrs, 0])
countlst.append(c)
totalcount = sum(countlst)
for ec in countlst:
efreq = (ec/totalcount) * 100
freqlst.append(efreq)
freqlst.sort(reverse=True)
lst.sort(reverse=True)
# count the total of the letters
counter = 0
for ltrs in lst:
counter += ltrs[0]
# calculate the percentage for each letter
for letter in lst:
percentage = (letter[0] / counter) * 100
letter[2] += float(format(percentage, '.2f'))
for i in lst:
print('The letter {} is repeated {} times, which is {}% '.format(i[1], i[0], i[2]))
exercise11()

<?php
$fh = fopen("text.txt", 'r') or die("File does not exist");
$line = fgets($fh);
$words = count_chars($line, 1);
foreach ($words as $key=>$value)
{
echo "The character <b>' ".chr($key)." '</b> was found <b>$value</b> times. <br>";
}
?>

Adding to a Python Dictionary

def countFrequency(L):
fdict = {}
for x in range(0, len(L)):
for key, value in fdict:
if L[x] == fdict[str(x)]:
value = value + 1
else:
fdict[L[x]] = 1
return fdict
I'm trying to count the frequency of occurrences of a particular symbol in a given string and create a dictionary out of this. For some reason, the function just returns an empty dictionary. I think the problem arises with adding a new value to the dictionary, but not sure how to troubleshoot it/fix it.
input: countFrequency('MISSISSIPPI')
output: {}

Do this:
def countFrequency(L):
fdict = {}
for x in range(0, len(L)):
if str(L[x]) in fdict.keys():
fdict[str(L[x])] = fdict[str(L[x])] + 1
else:
fdict[str(L[x])] = 1
return fdict

The code in the inner loop:
if L[x] == fdict[str(x)]:
value = value + 1
else:
fdict[L[x]] = 1
is unreachable, because the dictionary is empty initally, and hence you will never add any value to fdict.

you can do it likes this:
def countFrequency(L):
fdict = {}
for symbol in L: #gets each letter in l
if symbol in fdict: #check if dhe dictionary already has an value for the symbol
fdict[symbol] += 1
else:
fdict[symbol] = 1
return fdic

Another issue is that the value value is never assigned to anything or used, so even if we were to go through the loop we would never increment a value in the dictionary.

Python, append within a loop

So I need to save the results of a loop and I'm having some difficulty. I want to record my results to a new list, but I get "string index out of range" and other errors. The end goal is to record the products of digits 1-5, 2-6, 3-7 etc, eventually keeping the highest product.
def product_of_digits(number):
d= str(number)
for integer in d:
s = 0
k = []
while s < (len(d)):
j = (int(d[s])*int(d[s+1])*int(d[s+2])*int(d[s+3])*int(d[s+4]))
s += 1
k.append(j)
print(k)
product_of_digits(n)

Similar question some time ago. Hi Chauxvive
This is because you are checking until the last index of d as s and then doing d[s+4] and so on... Instead, you should change your while loop to:
while s < (len(d)-4):

Time complexity of a huge list in python 2.7

I've a list which has approximately 177071007 items.
and i'm trying to perform the following operations
a) get the first and last occurance of a unique item in the list.
b) the number of occurances.
def parse_data(file, op_file_test):
ins = csv.reader(open(file, 'rb'), delimiter = '\t')
pc = list()
rd = list()
deltas = list()
reoccurance = list()
try:
for row in ins:
pc.append(int(row[0]))
rd.append(int(row[1]))
except:
print row
pass
unique_pc = set(pc)
unique_pc = list(unique_pc)
print "closing file"
#takes a long time from here!
for a in range(0, len(unique_pc)):
index_first_occurance = pc.index(unique_pc[a])
index_last_occurance = len(pc) - 1 - pc[::-1].index(unique_pc[a])
delta_rd = rd[index_last_occurance] - rd[index_first_occurance]
deltas.append(int(delta_rd))
reoccurance.append(pc.count(unique_pc[a]))
print unique_pc[a] , delta_rd, reoccurance[a]
print "printing to file"
map_file = open(op_file_test,'a')
for a in range(0, len(unique_pc)):
print >>map_file, "%d, %d, %d" % (unique_pc[a], deltas[a], reoccurance)
map_file.close()
However the complexity is in the order of O(n).
Would there be a possibility to make the for loop 'run fast', by that i mean, do you think yielding would make it fast? or is there any other way? unfortunately, i don't have numpy

Try the following:
from collections import defaultdict
# Keep a dictionary of our rd and pc values, with the value as a list of the line numbers each occurs on
# e.g. {'10': [1, 45, 79]}
pc_elements = defaultdict(list)
rd_elements = defaultdict(list)
with open(file, 'rb') as f:
line_number = 0
csvin = csv.reader(f, delimiter='\t')
for row in csvin:
try:
pc_elements[int(row[0])].append(line_number)
rd_elements[int(row[1])].append(line_number)
line_number += 1
except ValueError:
print("Not a number")
print(row)
line_number += 1
continue
for pc, indexes in pc_elements.iteritems():
print("pc {0} appears {1} times. First on row {2}, last on row {3}".format(
pc,
len(indexes),
indexes[0],
indexes[-1]
))
This works by creating a dictionary, when reading the TSV with the pc value as the the key and a list of occurrences as the value. By the nature of a dict the key must be unique so we avoid the set and the list values are only being used to keep the rows that key occurs on.
Example:
pc_elements = {10: [4, 10, 18, 101], 8: [3, 12, 13]}
would output:
"pc 10 appears 4 times. First on row 4, last on row 101"
"pc 8 appears 3 times. First on row 3, last on row 13"

As you scan items from your input file, put the items into a collections.defaultdict(list) where the key is the item and the value is a list of occurence indices. It will take linear time to read the file and build up this data structure and constant time to get the first and last occurrence index of an item, and constant time to get the number of occurrences of an item.
Here's how it might work
mydict = collections.defaultdict(list)
for item, index in itemfilereader: # O(n)
mydict[item].append(index)
# first occurrence of item, O(1)
mydict[item][0]
# last occurrence of item, O(1)
mydict[item][-1]
# number of occurrences of item, O(1)
len(mydict[item])

Maybe it's worth chaning the data structure used. I'd use a dict that uses pc as key and the occurence as values.
lookup = dict{}
counter = 0
for line in ins:
values = lookup.setdefault(int(line[0]),[])
values.append(tuple(counter,int(line[1])))
counter += 1
for key, val in lookup.iteritems():
value_of_first_occurence = lookup[key][1][1]
value_of_last_occurence = lookup[key][-1][1]
first_occurence = lookup[key][1][0]
last_occurence = lookup[key][-1][0]
value = lookup[key][0]

Try replacing list by dicts, lookup in a dict is much faster than in a long list.
That could be something like this:
def parse_data(file, op_file_test):
ins = csv.reader(open(file, 'rb'), delimiter = '\t')
# Dict of pc -> [rd first occurence, rd last occurence, list of occurences]
occurences = {}
for i in range(0, len(ins)):
row = ins[i]
try:
pc = int(row[0])
rd = int(row[1])
except:
print row
continue
if pc not in occurences:
occurences[pc] = [rd, rd, i]
else:
occurences[pc][1] = rd
occurences[pc].append(i)
# (Remove the sorted is you don't need them sorted but need them faster)
for value in sorted(occurences.keys()):
print "value: %d, delta: %d, occurences: %s" % (
value, occurences[value][1] - occurences[value][0],
", ".join(occurences[value][2:])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Frequencies in Python - python

You can count items in a list/sequence using count. So your code can be simplified to look like this: def frequencies(data): unique_items = set(data) for item in unique_items: print('%s\t%s' % (item, data.count(item)))

Related

Finding the substring with the most repeats in a dictionary with dna sequences

Reading a text file to print frequency of letters in decreasing order - Python 3

Adding to a Python Dictionary

Python, append within a loop

Time complexity of a huge list in python 2.7

Categories

Resources