how to count the biggest consecutive occurences of substring in string? - python

I'm doing an exercise (cs50 - DNA) where I have to count specific consecutive substrings (STRS) mimicking DNA sequences, I'm finding myself overcomplicating my code and I'm having a hard time figuring out how to proceed.
I have a list of substrings:
strs = ['AGATC', 'AATG', 'TATC']
And a String with a random sequence of letters:
AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
I want to count the biggest consecutive substrings that match each strs.
So:
'AGATC' - AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
'AATG' - AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
'TATC' - AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
resulting in [4, 1, 5]
(Note that this isn't the best example since there are no random repeating patterns scatered around but I think it illustrates what I'm looking for)
I know that I should be something of the likes of re.match(rf"({strs}){2,}", string) because str.count(strs) will give me ALL consecutive and non consecutive items.
My code so far:
#!/usr/bin/env python3
import csv
import sys
from cs50 import get_string
# sys.exit to terminate the program
# sys.exit(2) UNIX default for wrong args
if len(sys.argv) != 3:
print("Usage: python dna.py data.csv sequence.txt")
sys.exit(2)
# open file, make it into a list, get STRS, remove header
with open(sys.argv[1], "r") as database:
data = list(csv.reader(database))
STRS = data[0]
data.pop(0)
# remove "name" so only thing remaining are STRs
STRS.pop(0)
# open file to compare agaist db
with open(sys.argv[2], "r") as seq:
sequence = seq.read()
sequenceCount = []
# for each STR count the occurences
# sequence.count(s) returns all
for s in STRS:
sequenceCount.append(sequence.count(s))
print(STRS)
print(sequenceCount)
"""
sequenceCount = {}
# for each STR count the occurences
for s in STRS:
sequenceCount[s] = sequence.count(s)
for line in data:
print(line)
for item in line[1:]:
continue
# rf"({STRS}){2,}"
"""

Regular expression for finding repeating strings is like r"(AGATC)+".
For example,
import re
sequence = "AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG"
pattern = "AGATC"
r = re.search(r"({})+".format(pattern), sequence)
if r:
print("start at", r.start())
print("end at", r.end())
If a match is found, then you can access the starting and ending position by .start and .end methods. You can calculate the repetition using them.
If you need to find all matches in the sequence, then you can use re.finditer, which gives you match objects iteratively.
You can loop over target patterns and find the longest one.

Here using two for loops; one to grab each string (sequence) from strs, and the other to iterate over our dna strand to match each string from strs against it, and a while loop is used if a match was found to keep looking for consecutive (back2back) matches. (Added inline comments to give brief explanations on each step)
dna = 'AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATAGATCTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG'
strs = ['AGATC', 'AATG', 'TATC']
def seq_finder(sequence, dna):
start = 0 # Will allow us to skip scanned sequences
counter = [0] * len(sequence) # Create a list of zeros to store sequence occurrences
for idx, seq in enumerate(sequence): # Iterate over every entry in our sequence "strs"
k = len(seq)
holder = 0 # A temporarily holder that will store #occurrences of *consecutive* sequences
for i in range(start, len(dna)): # For each sequence, iterate over our "dna" strand
if dna[i:i+k] == strs[idx]: # If match is found:
holder += 1 # Increment our holder by 1
while dna[i:i+k] == dna[i+k:i+k*2]: # If our match has an identical match ahead (consecutively):
holder += 1 # Increment our holder by 1
i += k # Start the next list indexing from our new match
start = i + 1 # To skip repetitive iterations over same matches
if holder > counter[idx]:
counter[idx] = holder # Only replace counter if new holder > old holder
holder = 0 # Reset the holder when we existed our of our while loop (finished finding consecutives)
return counter

Related

cs50 dna pset6 Works on small database but not on large

This is my solution to CS50 pset6 DNA problem in python. It works fine on small database but gives an
Index error: List Index Out of range.
I tried print to see where is the error.. It prints out large database as well. Not sure what to do next.
import csv
import sys
def main():
# TODO: Check for command-line usage
if len(sys.argv) != 3:
print("Usage: python dna.py database.csv sequence.txt")
sys.exit(1)
# TODO: Read database file into a variable
dna_database =[]
with open(sys.argv[1], "r") as dna_data_file:
reader = csv.DictReader(dna_data_file)
for row in reader:
dna_database.append(row)
# TODO: Read DNA sequence file into a variable
with open(sys.argv[2], "r") as load_sequence:
sequence = load_sequence.read()
# TODO: Find longest match of each STR in DNA sequence
STR = list(dna_database[0].keys())[1:]
STR_match ={}
for i in range(len(dna_database)):
# print(dna_database)
STR_match[STR[i]] = longest_match(sequence,STR[i])
# TODO: Check database for matching profiles
for i in range(len(dna_database)):
matches = 0
for j in range(len(STR)):
if int(STR_match[STR[j]]) == int(dna_database[i][STR[j]]):
matches += 1
if matches == len(STR):
print(dna_database[i]['name'])
sys.exit(0)
print("No Match")
return
def longest_match(sequence, subsequence):
"""Returns length of longest run of subsequence in sequence."""
# Initialize variables
longest_run = 0
subsequence_length = len(subsequence)
sequence_length = len(sequence)
# Check each character in sequence for most consecutive runs of subsequence
for i in range(sequence_length):
# Initialize count of consecutive runs
count = 0
# Check for a subsequence match in a "substring" (a subset of characters) within sequence
# If a match, move substring to next potential match in sequence
# Continue moving substring and checking for matches until out of consecutive matches
while True:
# Adjust substring start and end
start = i + count * subsequence_length
end = start + subsequence_length
# If there is a match in the substring
if sequence[start:end] == subsequence:
count += 1
# If there is no match in the substring
else:
break
# Update most consecutive matches found
longest_run = max(longest_run, count)
# After checking for runs at each character in seqeuence, return longest run found
return longest_run
main()

How to extract words from repeating strings

Here I have a string in a list:
['aaaaaaappppppprrrrrriiiiiilll']
I want to get the word 'april' in the list, but not just one of them, instead how many times the word 'april' actually occurs the string.
The output should be something like:
['aprilaprilapril']
Because the word 'april' occurred three times in that string.
Well the word actually didn't occurred three times, all the characters did. So I want to order these characters to 'april' for how many times did they appeared in the string.
My idea is basically to extract words from some random strings, but not just extracting the word, instead to extract all of the word that appears in the string. Each word should be extracted and the word (characters) should be ordered the way I wanted to.
But here I have some annoying conditions; you can't delete all the elements in the list and then just replace them with the word 'april'(you can't replace the whole string with the word 'april'); you can only extract 'april' from the string, not replacing them. You can't also delete the list with the string. Just think of all the string there being very important data, we just want some data, but these data must be ordered, and we need to delete all other data that doesn't match our "data chain" (the word 'april'). But once you delete the whole string you will lose all the important data. You don't know how to make another one of these "data chains", so we can't just put the word 'april' back in the list.
If anyone know how to solve my weird problem, please help me out, I am a beginner python programmer. Thank you!
One way is to use itertools.groupby which will group the characters individually and unpack and iterate them using zip which will iterate n times given n is the number of characters in the smallest group (i.e. the group having lowest number of characters)
from itertools import groupby
'aaaaaaappppppprrrrrriiiiiilll'
result = ''
for each in zip(*[list(g) for k, g in groupby('aaaaaaappppppprrrrrriiiiiilll')]):
result += ''.join(each)
# result = 'aprilaprilapril'
Another possible solution is to create a custom counter that will count each unique sequence of characters (Please be noted that this method will work only for Python 3.6+, for lower version of Python, order of dictionaries is not guaranteed):
def getCounts(strng):
if not strng:
return [], 0
counts = {}
current = strng[0]
for c in strng:
if c in counts.keys():
if current==c:
counts[c] += 1
else:
current = c
counts[c] = 1
return counts.keys(), min(counts.values())
result = ''
counts=getCounts('aaaaaaappppppprrrrrriiiiiilll')
for i in range(counts[1]):
result += ''.join(counts[0])
# result = 'aprilaprilapril'
How about using regex?
import re
word = 'april'
text = 'aaaaaaappppppprrrrrriiiiiilll'
regex = "".join(f"({c}+)" for c in word)
match = re.match(regex, text)
if match:
# Find the lowest amount of character repeats
lowest_amount = min(len(g) for g in match.groups())
print(word * lowest_amount)
else:
print("no match")
Outputs:
aprilaprilapril
Works like a charm
Here is a more native approach, with plain iteration.
It has a time complexity of O(n).
It uses an outer loop to iterate over the character in the search key, then an inner while loop that consumes all occurrences of that character in the search string while maintaining a counter. Once all consecutive occurrences of the current letter have been consumes, it updates a the minLetterCount to be the minimum of its previous value or this new count. Once we have iterated over all letters in the key, we return this accumulated minimum.
def countCompleteSequenceOccurences(searchString, key):
left = 0
minLetterCount = 0
letterCount = 0
for i, searchChar in enumerate(key):
while left < len(searchString) and searchString[left] == searchChar:
letterCount += 1
left += 1
minLetterCount = letterCount if i == 0 else min(minLetterCount, letterCount)
letterCount = 0
return minLetterCount
Testing:
testCasesToOracles = {
"aaaaaaappppppprrrrrriiiiiilll": 3,
"ppppppprrrrrriiiiiilll": 0,
"aaaaaaappppppprrrrrriiiiii": 0,
"aaaaaaapppppppzzzrrrrrriiiiiilll": 0,
"pppppppaaaaaaarrrrrriiiiiilll": 0,
"zaaaaaaappppppprrrrrriiiiiilll": 3,
"zzzaaaaaaappppppprrrrrriiiiiilll": 3,
"aaaaaaappppppprrrrrriiiiiilllzzz": 3,
"zzzaaaaaaappppppprrrrrriiiiiilllzzz": 3,
}
key = "april"
for case, oracle in testCasesToOracles.items():
result = countCompleteSequenceOccurences(case, key)
assert result == oracle
Usage:
key = "april"
result = countCompleteSequenceOccurences("aaaaaaappppppprrrrrriiiiiilll", key)
print(result * key)
Output:
aprilaprilapril
A word will only occur as many times as the minimum letter recurrence. To account for the possibility of having repeated letters in the word (for example, appril, you need to factor this count out. Here is one way of doing this using collections.Counter:
from collections import Counter
def count_recurrence(kernel, string):
# we need to count both strings
kernel_counter = Counter(kernel)
string_counter = Counter(string)
# now get effective count by dividing the occurence in string by occurrence
# in kernel
effective_counter = {
k: int(string_counter.get(k, 0)/v)
for k, v in kernel_counter.items()
}
# min occurence of kernel is min of effective counter
min_recurring_count = min(effective_counter.values())
return kernel * min_recurring_count

Find longest substring in alphabetical order

I want to write a program that prints the longest substring in alphabetical order.
And in case of ties, it prints the first substring.
Here is what I wrote
import sys
s1 = str(sys.argv[1])
alpha = "abcdefghijklmnopqrstuvwxyz"
def longest_substring(s1):
for i in range(len(alpha)):
for k in range(len(alpha)):
if alpha[i:k] in s1:
return alpha[i:k]
print("Longest substring in alphabetical order:", longest_substring(s1))
However, it does not work and I do not know how to do the second part.
Can you help me, please?
Here is what your code should look like to achieve what you want:
#!/usr/bin/env python3.6
import sys
s1 = str(sys.argv[1])
alpha = "abcdefghijklmnopqrstuvwxyz"
subs = []
def longest_substring(s1):
for i in range(len(alpha)):
for k in range(len(alpha)):
if alpha[i:k] in s1:
subs.append(alpha[i:k])
return max(subs, key=len)
print("Longest substring in alphabetical order:", longest_substring(s1))
You were returning right out of the function on the first alphabetically ordered substring you found. In my code, we add them to a list then print out the longest one.
Assume that substring contains 2 or more characters in alphabetical order. So that you should not only return the first occurrence but collect all and find longest. I try to keep your idea the same, but this is not the most efficient way:
def longest_substring(s1):
res = []
for i in range(len(alpha) - 2):
for k in range(i + 2, len(alpha)):
if alpha[i:k] in s1:
res.append(alpha[i:k])
return max(res, key=len)
You re-write a version of itertools.takewhile to take a binary compare function instead of the unary one.
def my_takewhile(predicate, starting_value, iterable):
last = starting_value
for cur in iterable:
if predicate(last, cur):
yield cur
last = cur
else:
break
Then you can lowercase the word (since "Za" isn't in alphabetical order, but any [A-Z] compares lexicographically before any [a-z]) and get all the substrings.
i = 0
substrings = []
while i < len(alpha):
it = iter(alpha[i:])
substring = str(my_takewhile(lambda x,y: x<y, chr(0), it))
i += len(substring)
substrings.append(substring)
Then just find the longest substring in substrings.
result = max(substrings, key=len)
Instead of building a list of all possible substring slices and then checking which one exists in the string, you can build a list of all consecutive substrings, and then take the one with the maximum length.
This is easily done by grouping the characters using the difference between the ord of that character and an increasing counter; successive characters will have a constant difference. itertools.groupby is used to perform the grouping:
from itertools import groupby, count
alpha = "abcdefghijklmnopqrstuvwxyz"
c = count()
lst_substrs = [''.join(g) for _, g in groupby(alpha, lambda x: ord(x)-next(c))]
substr = max(lst_substrs, key=len)
print(substr)
# abcdefghijklmnopqrstuvwxyz
As #AdamSmith commented, the above assumes the characters are always in alphabetical order. In the case they may not be, one can enforce the order by checking that items in the group are alphabetical:
from itertools import groupby, count, tee
lst = []
c = count()
for _, g in groupby(alpha, lambda x: ord(x)-next(c)):
a, b = tee(g)
try:
if ord(next(a)) - ord(next(a)) == -1:
lst.append(''.join(b))
except StopIteration:
pass
lst.extend(b) # add each chr from non-alphabetic iterator (could be empty)
substr = max(lst, key=len)
back up and look at this problem again.
1. you are looking for a maximum and should basically (pseudo code):
set a max to ""
loop through sequences
if new sequence is bigger the max, then replace max
find the sequences you can be more efficient if you only step though the input characters once.
Here is a version of this:
def longest_substring(s1):
max_index, max_len = 0, 0 # keep track of the longest sequence here
last_c = s1[0] # previous char
start, seq_len = 0, 1 # tracking current seqence
for i, c in enumerate(s1[1:]):
if c >= last_c: # can we extend sequence in alpha order
seq_len += 1
if seq_len > max_len: # found longer
max_index, max_len = start, seq_len
else: # this char starts new sequence
seq_len = 0
start = i + 1
last_c = c
return s1[max_index:max_index+max_len]
s = 'azcbobobegghakl'
def max_alpha_subStr(s):
'''
INPUT: s, a string of lowercase letters
OUTPUT: longest substing of s in which the
letters occur in alphabetical order
'''
longest = s[0] # set variables 'longest' and 'current' as 1st letter in s
current = s[0]
for i in s[1:]: # begin iteration from 2nd letter to the end of s
if i >= current[-1]: # if the 'current' letter is bigger
# than the letter before it
current += i # add that letter to the 'current' letter(s) and
if len(current) > len(longest): # check if the 'current' length of
# letters are longer than the letters in'longest'
longest = current # if 'current' is the longest, make 'longest'
# now equal 'current'
else: # otherwise the current letter is lesser
# than the letter before it and
current = i # restart evaluating from the point of iteration
return print("Longest substring in alphabetical order is: ", longest)
max_alpha_subStr(s)

10 ,most frequent words in a string Python

I need to display the 10 most frequent words in a text file, from the most frequent to the least as well as the number of times it has been used. I can't use the dictionary or counter function. So far I have this:
import urllib
cnt = 0
i=0
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt")
uniques = []
for line in txtFile:
words = line.split()
for word in words:
if word not in uniques:
uniques.append(word)
for word in words:
while i<len(uniques):
i+=1
if word in uniques:
cnt += 1
print cnt
Now I think I should look for every word in the array 'uniques' and see how many times it is repeated in this file and then add that to another array that counts the instance of each word. But this is where I am stuck. I don't know how to proceed.
Any help would be appreciated. Thank you
The above problem can be easily done by using python collections
below is the Solution.
from collections import Counter
data_set = "Welcome to the world of Geeks " \
"This portal has been created to provide well written well" \
"thought and well explained solutions for selected questions " \
"If you like Geeks for Geeks and would like to contribute " \
"here is your chance You can write article and mail your article " \
" to contribute at geeksforgeeks org See your article appearing on " \
"the Geeks for Geeks main page and help thousands of other Geeks. " \
# split() returns list of all the words in the string
split_it = data_set.split()
# Pass the split_it list to instance of Counter class.
Counters_found = Counter(split_it)
#print(Counters)
# most_common() produces k frequently encountered
# input values and their respective counts.
most_occur = Counters_found.most_common(4)
print(most_occur)
You're on the right track. Note that this algorithm is quite slow because for each unique word, it iterates over all of the words. A much faster approach without hashing would involve building a trie.
# The following assumes that we already have alice30.txt on disk.
# Start by splitting the file into lowercase words.
words = open('alice30.txt').read().lower().split()
# Get the set of unique words.
uniques = []
for word in words:
if word not in uniques:
uniques.append(word)
# Make a list of (count, unique) tuples.
counts = []
for unique in uniques:
count = 0 # Initialize the count to zero.
for word in words: # Iterate over the words.
if word == unique: # Is this word equal to the current unique?
count += 1 # If so, increment the count
counts.append((count, unique))
counts.sort() # Sorting the list puts the lowest counts first.
counts.reverse() # Reverse it, putting the highest counts first.
# Print the ten words with the highest counts.
for i in range(min(10, len(counts))):
count, word = counts[i]
print('%s %d' % (word, count))
from string import punctuation #you will need it to strip the punctuation
import urllib
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt")
counter = {}
for line in txtFile:
words = line.split()
for word in words:
k = word.strip(punctuation).lower() #the The or you You counted only once
# you still have words like I've, you're, Alice's
# you could change re to are, ve to have, etc...
if "'" in k:
ks = k.split("'")
else:
ks = [k,]
#now the tally
for k in ks:
counter[k] = counter.get(k, 0) + 1
#and sorting the counter by the value which holds the tally
for word in sorted(counter, key=lambda k: counter[k], reverse=True)[:10]:
print word, "\t", counter[word]
import urllib
import operator
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt").readlines()
txtFile = " ".join(txtFile) # this with .readlines() replaces new lines with spaces
txtFile = "".join(char for char in txtFile if char.isalnum() or char.isspace()) # removes everything that's not alphanumeric or spaces.
word_counter = {}
for word in txtFile.split(" "): # split in every space.
if len(word) > 0 and word != '\r\n':
if word not in word_counter: # if 'word' not in word_counter, add it, and set value to 1
word_counter[word] = 1
else:
word_counter[word] += 1 # if 'word' already in word_counter, increment it by 1
for i,word in enumerate(sorted(word_counter,key=word_counter.get,reverse=True)[:10]):
# sorts the dict by the values, from top to botton, takes the 10 top items,
print "%s: %s - %s"%(i+1,word,word_counter[word])
output:
1: the - 1432
2: and - 734
3: to - 703
4: a - 579
5: of - 501
6: she - 466
7: it - 440
8: said - 434
9: I - 371
10: in - 338
This methods ensures that only alphanumeric and spaces are in the counter. Doesn't matter that much tho.
Personally I'd make my own implementation of collections.Counter. I assume you know how that object works, but if not I'll summarize:
text = "some words that are mostly different but are not all different not at all"
words = text.split()
resulting_count = collections.Counter(words)
# {'all': 2,
# 'are': 2,
# 'at': 1,
# 'but': 1,
# 'different': 2,
# 'mostly': 1,
# 'not': 2,
# 'some': 1,
# 'that': 1,
# 'words': 1}
We can certainly sort that based on frequency by using the key keyword argument of sorted, and return the first 10 items in that list. However that doesn't much help you because you don't have Counter implemented. I'll leave THAT part as an exercise for you, and show you how you might implement Counter as a function rather than an object.
def counter(iterable):
d = {}
for element in iterable:
if element in d:
d[element] += 1
else:
d[element] = 1
return d
Not difficult, actually. Go through each element of an iterable. If that element is NOT in d, add it to d with a value of 1. If it IS in d, increment that value. It's more easily expressed by:
def counter(iterable):
d = {}
for element in iterable:
d.setdefault(element, 0) += 1
Note that in your use case, you probably want to strip out the punctuation and possibly casefold the whole thing (so that someword gets counted the same as Someword rather than as two separate words). I'll leave that to you as well, but I will point out str.strip takes an argument as to what to strip out, and string.punctuation contains all the punctuation you're likely to need.
You can also do it through pandas dataframes and get result in convinient form as a table: "word-its freq." ordered.
def count_words(words_list):
words_df = pn.DataFrame(words_list)
words_df.columns = ["word"]
words_df_unique = pn.DataFrame(pn.unique(words_list))
words_df_unique.columns = ["unique"]
words_df_unique["count"] = 0
i = 0
for word in pn.Series.tolist(words_df_unique.unique):
words_df_unique.iloc[i, 1] = len(words_df.word[words_df.word == word])
i+=1
res = words_df_unique.sort_values('count', ascending = False)
return(res)
To do the same operation on a pandas data frame, you may use the following through Counter function from Collections:
from collections import Counter
cnt = Counter()
for text in df['text']:
for word in text.split():
cnt[word] += 1
# Find most common 10 words from the Pandas dataframe
cnt.most_common(10)

How to find total number of positive and negative words from a text?

I want to find the total number of positive and negative words matched from a given text. I have list of positive words in positive.txt file and list of negative words in negative.txt file. If a word is matched from positive word list, then I want a simple integer variable where the value is incremented by 1, same for the negative matched word. From my given code I am getting a paragraph which is under #class=[story-hed]. This is the text which I want to compare with the list of positive and negative words as well as total count of words. My code is,
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from dawn.items import DawnItem
class dawnSpider(BaseSpider):
name = "dawn"
allowed_domains = ["dawn.com"]
start_urls = [
"http://dawn.com/"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//h3[#class="story-hed"]//a/text()').extract()
items=[]
for site in sites:
item=DawnItem()
item['title']=site
items.append(item)
return items
The standalone code below could do the trick:
from collections import Counter
def readwords( filename ):
f = open(filename)
words = [ line.rstrip() for line in f.readlines()]
return words
positive = readwords('positive.txt')
negative = readwords('negative.txt')
paragraph = 'this is really bad and in fact awesome. really awesome.'
count = Counter(paragraph.split())
pos = 0
neg = 0
for key, val in count.iteritems():
key = key.rstrip('.,?!\n') # removing possible punctuation signs
if key in positive:
pos += val
if key in negative:
neg += val
print pos, neg
Here is what I have in the two input files:
positive.txt:
good
awesome
negative.txt:
bad
ugly
and the output is:
2 1
To implement this in scrapy, you might want to use an item pipeline http://doc.scrapy.org/en/latest/topics/item-pipeline.html
for key, val in count.iteritems():==>only it works in Python 3 below version if you're using python 3 above versions use
for key, val in count.item()
key = key.rstrip('.,?!\n') # removing possible punctuation signs
if key in positive:
pos += val
if key in negative:
neg += val
First you may want to read the files. Assuming you have a word per line you can read all the words with the following code:
postive = [l.strip() for l in open("possitive.txt")]
Once done, you can create a dict which will hold the word as key and the count as value. For initiating the dict to zero you can use:
positive_count = dict.fromkeys(postive, 0)
Finally you hust iterate all the items and increment the count if world is found:
for item in items:
if item in positive_count:
postive_count[item] +=1
And finally you can print the results with:
for item, value in postive_counts.iteritems():
print "Word %s count %d" % (item, value)
For negative will be the same, just ommited to simplify the answer.
This depends on the size of the word lists. If they are smallish (less than a few kb), then read them into a list:
with open(positive_wordlist_file_name) as fd:
positive_words = [line.strip() for line in fd]
Once you have two word lists, you can then got through the text with them - line by line if you can. Split those into words, and then use the "in" operator to check them in the list. I'd use a couple of co-routines in a class for it:
class WordCounter:
# You can probably read word lists and store them here
def positive_word_counter(self):
"""Co-routine that will count positive words. I'll leave it to reader
to make a similar negative word one"""
self.positive_words = 0
while True:
words = yield
matched = [word for word in words if word in self.positive_words]
self.positive_words += len(matched)
def read_text(text):
"""Text - some iterable of lines - an file handle, or list or whatever."""
#expand on this split with other word separators - or use re.split with the word boundary instead
line_words = (line.strip().split(' ,') for line in text)
#Create and prime coroutines
positive_counter = self.positive_word_counter()
positive_counter.next()
negative_counter = self.negative_word_counter()
negative_counter.next()
#Now fire it in
[[positive_counter.next(words), negative_counter.next(words)] for words in line_words]
#You should now be able to read positive/negative words from this object

Categories

Resources