counting the unique words in a text file - python

Some of the unique words in the text file does not count and I've had no idea what's wrong in my code.
file = open('tweets2.txt','r')
unique_count = 0
lines = file.readlines()
line = lines[3]
per_word = line.split()
for i in per_word:
if line.count(i) == 1:
unique_count=unique_count + 1
print(unique_count)
file.close()
Here is the text file:
"I love REDACTED and Fiesta and all but can REDACTED host more academic-related events besides strand days???"
The output of this code is:
16
The expected output of the code came from the text file should be:
17
"i will crack a raw egg on my head if REDACTED move the resumption of classes to Jan 7. im not even kidding."
The output of this code is:
20
The expected output of the code came from the text file should be:
23

If you want to count the number of unique whitespace delimited tokens (case-sensitive) in the entire file then:
with open('myfile.txt') as infile:
print(len(set(infile.read().split())))

Maybe count() works with chars not words, instead use python way with set() function to clear duplicated words?
per_word = set(line.split())
print (len(per_word))

You are counting each word as a substring in the whole line because you do:
for i in per_word:
if line.count(i) == 1:
So now some words are repeated as substrings, but not as words. For example, the first word is "i". line.count("i") gives 7 (it is also in "if", "im", etc.) so you don't count it as a unique word (even though it is). If you do:
for i in per_word:
if per_word.count(i) == 1:
then you will count each word as a whole word and get the output you need.
Anyway this is very inefficient (O(n^2)) as you iterate over each word and then count iterates over the whole list again to count it. Either use a set as suggested in other answers or use a Counter:
from collections import Counter
unique_count = 0
line = "i will crack a raw egg on my head if REDACTED move the resumption of classes to Jan 7. im not even kidding."
per_word = line.split()
counter = Counter(per_word)
for count in counter.values():
if count == 1:
unique_count += 1
# Or simply
unique_count = sum(count == 1 for count in counter.values())
print(unique_count)

Related

String searching in text file and dict values combinations

i'm a total beginner to python, i'm studying it at university and professor gave us some work to do before the exam. Currently it's been almost 2 weeks that i'm stuck with this program, the rule is that we can't use any library.
Basically I have this dictionary with several possibility of translations from ancient language to english, a dictionary from english to italian (only 1 key - 1 value pairs), a text file in an ancient language and another text file in Italian. Until now what i've done is basically scan the ancient language file and search for corresponding strings with dictionary (using .strip(".,:;?!") method), now i saved those corresponding strigs that contain at least 2 words in a list of strings.
Now comes the hard part: basically i need to try all possible combination of translations (values from ancient language to English) and then take these translations from english to italian the the other dictionary and check if that string exists in the Italian file, if yes i save the result and the paragraph where has been found (result in different paragraphs doesn't count, must be in the same I've made a small piece of code to count the paragraphs).
I'm having issues here for the following reasons:
In the strings that i've found how I'm supposed to replace the words and keep the punctuation? Because the return result must contain all the punctuation otherwise the output result will be wrong
If the string is contained but in 2 different lines of the text how should i proceed in order to make it work? For example i have a string of 5 words, at the end of a line i found the first 2 words corresponding but the remaining 3 words are the first 3 words of the next line.
As mentioned before the dict from ancient language to english is huge and can have up to 7 values (translations) for each key (ancient langauge), is there any efficient way to try all the combinations while searching if the string exists in a text file? This is probably the hardest part.
Probably the best way to process this is word by word scan every time and in case the sequence is broken i reset it somehow and keep scanning the text file...
Any idea?
Here you have commented code of what i've managed to do until now:
k = 2 #Random value, the whole program gonna be a function and the "k" value will be different each time
file = [ line.strip().split(';') for line in open('lexicon-GR-EN.csv', encoding="utf8").readlines() ] #Opening CSV file with possible translations from ancient Greek to English
gr_en = { words[0]: tuple(words[1:]) for words in file } #Creating a dictionary with the several translations (values)
file = open('lexicon-EN-IT.csv', encoding="utf8") # Opening 2nd CSV file
en_it = {} # Initializing dictionary
for row in file: # Scanning each row of the CSV file (From English to Italian)
L = row.rstrip("\n").split(';') # Clearing newline char and splitting the words
x = L[0]
t1 = L[1]
en_it[x] = t1 # Since in this CSV file all the words are 1 - 1 is not necesary any check for the length (len(L) is always 2 basically)
file = open('odyssey.txt', encoding="utf8") # Opening text file
result = () # Empty tuple
spacechecker = 0 # This is the variable that i need to determine if i'm on a even or odd line, if odd the line will be scanned normaly otherwise word order and words will be reversed
wordcount = 0 # Counter of how many words have been found
paragraph = 0 # Paragraph counter, starts at 0
paragraphspace = 0 # Another paragraph variable, i need this to prevent double-space to count as paragraph
string = "" # Empty string to store corresponding sequences
foundwords = [] # Empty list to store words that have been found
completed_sequences = [] # Empty list, here will be stored all completed sequences of words
completed_paragraphs = [] # Paragraph counter, this shows in which paragraph has been found each sequence of completed_sequences
for index, line in enumerate(file.readlines()): # Starting line by line scan of the txt file
words = line.split() # Splitting words
if not line.isspace() and index == 0: # Since i don't know nothing about the "secret tests" that will be conducted with this program i've set this check for the start of the first paragraph to prevent errors: if first line is not space
paragraph += 1 # Add +1 to paragraph counter
spacechecker += 1 # Add +1 to spacechecker
elif not line.isspace() and paragraphspace == 1: # Checking if the previous line was space and the current is not
paragraphspace = 0 # Resetting paragraphspace (precedent line was space) value
spacechecker += 1 # Increasing the spacechecker +1
paragraph +=1 # This means we're on a new paragraph so +1 to paragraph
elif line.isspace() and paragraphspace == 1: # Checking if the current line is space and the precedent line was space too.
continue # Do nothing and cycle again
elif line.isspace(): # Checking if the current line is space
paragraphspace += 1 # Increase paragraphspace (precedent line was space variable) +1
continue
else:
spacechecker += 1 # Any other case increase spacechecker +1
if spacechecker % 2 == 1: # Check if spacechecker is odd
for i in range(len(words)): # If yes scan the words in normal order
if words[i].strip(",.!?:;-") in gr_en != "[unavailable]": # If words[i] without any special char is in dictionary
currword = words[i] # If yes, we will call it "currword"
foundwords.append(currword) # Add currword to the foundwords list
wordcount += 1 # Increase wordcount +1
elif (words[i].strip(",.!?:;-") in gr_en == "[unavailable]" and wordcount >= k) or (currword not in gr_en and wordcount >= k): #Elif check if it's not in dictionary but wordcount has gone over k
string = " ".join(foundwords) # We will put the foundwords list in a string
completed_sequences.append(string) # And add this string to the list of strings of completed_sequences
completed_paragraphs.append(paragraph) # Then add the paragraph of that string to the list of completed_paragraphs
result = list(zip(completed_sequences, completed_paragraphs)) # This the output format required, a tuple with the string and the paragraph of that string
wordcount = 0
foundwords.clear() # Clearing the foundwords list
else: # If none of the above happened (word is not in dictionary and wordcounter still isn't >= k)
wordcount = 0 # Reset wordcount to 0
foundwords.clear() # Clear foundwords list
continue # Do nothing and cycle again
else: # The case of spacechecker being not odd,
words = words[::-1] # Reverse the word order
for i in range(len(words)): # Scanning the row of words
currword = words[i][::-1] # Currword in this case will be reversed since the words in even lines are written in reverse.
if currword.strip(",.!?:;-") in gr_en != "[unavailable]": # If currword without any special char is in dictionary
foundwords.append(currword) # Append it to the foundwords list
wordcount += 1 # Increase wordcount +1
elif (currword.strip(",.!?:;-") in gr_en == "[unavailable]" and wordcount >= k) or (currword.strip(",.!?:;-") not in gr_en and wordcount >= k): #Elif check if it's not in dictionary but wordcount has gone over k
string = " ".join(foundwords) # Add the words that has been found to the string
completed_sequences.append(string) # Append the string to completed_sequences list
completed_paragraphs.append(paragraph) # Append the paragraph of the strings to the completed_paragraphs list
result = list(zip(completed_sequences, completed_paragraphs)) # Adding to the result the tuple combination of strings and corresponding paragraphs
wordcount = 0 # Reset wordcount
foundwords.clear() # Clear foundwords list
else: # In case none of above happened
wordcount = 0 # Reset wordcount to 0
foundwords.clear() # Clear foundwords list
continue # Do nothing and cycle again
I'd probably take the following approach to solving this:
Try to collapse down the 2 word dictionaries into one (ancient_italian below), removing English from the equation. For example, if ancient->English has {"canus": ["dog","puppy", "wolf"]} and English->Italian has {"dog":"cane"} then you can create a new dictionary {"canus": "cane"}. (Of course if the English->Italian dict has all 3 English words, you need to either pick one, or display something like cane|cucciolo|lupo in the output).
Come up with a regular expression that can distinguish between words, and the separators (punctuation), and output them in order into a list (word_list below). I.e something like ['ecce', '!', ' ', 'magnus', ' ', 'canus', ' ', 'esurit', '.']
Step through this list, generating a new list. Something like:
translation = []
for item in word_list:
if item.isalpha():
# It's a word - translate it and add to the list
translation.append(ancient_italian[item])
else:
# It's a separator - add to the list as-is
translaton.append(item)
Finally join the list back together: ''.join(translation)
I'm unable to reply to your comment on the answer by match, but this may help:
For one, its not the most elegant approach but should work:
GR_IT = {}
for greek,eng in GR_EN.items():
for word in eng:
try:
GR_IT[greek] = EN_IT[word]
except:
pass
If theres no translation for a word it will be ignored though.
To get a list of words and punctuation split try this:
def repl_punc(s):
punct = ['.',',',':',';','?','!']
for p in punct:
s=s.replace(p,' '+p+' ')
return s
repl_punc(s).split()

KeyError on the same word

I am trying to generate a sentence in the style of the bible. But whenever I run it, it stops at a KeyError on the same exact word. This is confusing as it is only using its own keys and it is the same word every time in the error, despite having random.choice.
This is the txt file if you want to run it: ftp://ftp.cs.princeton.edu/pub/cs226/textfiles/bible.txt
import random
files = []
content = ""
output = ""
words = {}
files = ["bible.txt"]
sentence_length = 200
for file in files:
file = open(file)
content = content + " " + file.read()
content = content.split(" ")
for i in range(100): # I didn't want to go through every word in the bible, so I'm just going through 100 words
words[content[i]] = []
words[content[i]].append(content[i+1])
word = random.choice(list(words.keys()))
output = output + word
for i in range(int(sentence_length)):
word = random.choice(words[word])
output = output + word
print(output)
The KeyError happens on this line:
word = random.choice(words[word])
It always happens for the word "midst".
How? "midst" is the 100th word in the text.
And the 100th position is the first time it is seen.
The consequence is that "midst" itself was never put in words as a key.
Hence the KeyError.
Why does the program reach this word so fast? Partly because of a bug here:
for i in range(100):
words[content[i]] = []
words[content[i]].append(content[i+1])
The bug here is the words[content[i]] = [] statement.
Every time you see a word,
you recreate an empty list for it.
And the word before "midst" is "the".
It's a very common word,
many other words in the text have "the".
And since words["the"] is ["midst"],
the problem tends to happen a lot, despite the randomness.
You can fix the bug of creating words:
for i in range(100):
if content[i] not in words:
words[content[i]] = []
words[content[i]].append(content[i+1])
And then when you select words randomly,
I suggest to add a if word in words condition,
to handle the corner case of the last word in the input.
"midst" is the 101st word in your source text and it is the first time it shows up. When you do this:
words[content[i]].append(content[i+1])
you are making a key:value pair but you aren't guaranteed that that value is going to be equivalent to an existing key. So when you use that value to search for a key it doesn't exist so you get a KeyError.
If you change your range to 101 instead of 100 you will see that your program almost works. That is because the 102nd word is "of" which has already occurred in your source text.
It's up to you how you want to deal with this edge case. You could do something like this:
if i == (100-1):
words[content[i]].append(content[0])
else:
words[content[i]].append(content[i+1])
which basically loops back around to the beginning of the source text when you get to the end.

output differences between 2 texts when lines are dissimilar

I am relatively new to Python so apologies in advance for sounding a bit ditzy sometimes. I'll try took google and attempt your tips as much as I can before asking even more questions.
Here is my situation: I am working with R and stylometry to find out the (likely) authorship of a text. What I'd like to do is see if there is a difference in the stylometry of a novel in the second edition, after one of the (assumed) co-authors died and therefore could not have contributed. In order to research that I need
Text edition 1
Text edition 2
and for python to output
words that appear in text 1 but not in text 2
words that appear in text 2 but not in text 1
And I would like to have the words each time they appear so not just 'the' once, but every time the program encounters it when it differs from the first edition (yep I know I'm asking for a lot sorry)
I have tried approaching this via
file1 = open("FRANKENST18.txt", "r")
file2 = open("FRANKENST31.txt", "r")
file3 = open("frankoutput.txt", "w")
list1 = file1.readlines()
list2 = file2.readlines()
file3.write("here: \n")
for i in list1:
for j in list2:
if i==j:
file3.write(i)
but of course this doesn't work because the texts are two giant balls of texts and not separate lines that can be compared, plus the first text has far more lines than the second one. Is there a way to go from lines to 'words' or the text in general to overcome that? Can I put an entire novel in a string lol? I assume not.
I have also attempted to use difflib, but I've only started coding a few weeks ago and I find it quite complicated. For example, I used fraxel's script as a base for:
from difflib import Differ
s1 = open("FRANKENST18.txt", "r")
s1 = open("FRANKENST31.txt", "r")
def appendBoldChanges(s1, s2):
#"Adds <b></b> tags to words that are changed"
l1 = s1.split(' ')
l2 = s2.split(' ')
dif = list(Differ().compare(l1, l2))
return " ".join(['<b>'+i[2:]+'</b>' if i[:1] == '+' else i[2:] for i in dif
if not i[:1] in '-?'])
print appendBoldChanges
but I couldn't get it to work.
So my question is is there any way to output the differences between texts that are not similar in lines like this? It sounded quite do-able but I've greatly underestimated how difficult I found Python haha.
Thanks for reading, any help is appreciated!
EDIT: posting my current code just in case it might help fellow learners that are googling for answers:
file1 = open("1stein.txt")
originaltext1 = file1.read()
wordlist1={}
import string
text1 = [x.strip(string.punctuation) for x in originaltext1.split()]
text1 = [x.lower() for x in text1]
for word1 in text1:
if word1 not in wordlist1:
wordlist1[word1] = 1
else:
wordlist1[word1] += 1
for k,v in sorted(wordlist1.items()):
#print "%s %s" % (k, v)
col1 = ("%s %s" % (k, v))
print col1
file2 = open("2stein.txt")
originaltext2 = file2.read()
wordlist2={}
import string
text2 = [x.strip(string.punctuation) for x in originaltext2.split()]
text2 = [x.lower() for x in text2]
for word2 in text2:
if word2 not in wordlist2:
wordlist2[word2] = 1
else:
wordlist2[word2] += 1
for k,v in sorted(wordlist2.items()):
#print "%s %s" % (k, v)
col2 = ("%s %s" % (k, v))
print col2
what I hope still to edit and output is something like this:
using the dictionaries' key and value system (applied to col1 and col2): {apple 3, bridge 7, chair 5} - {apple 1, bridge 9, chair 5} = {apple 2, bridge -2, chair 5}?
You want to output:
words that appear in text 1 but not in text 2
words that appear in
text 2 but not in text 1
Interesting. A set difference is what you need.
import re
s1 = open("FRANKENST18.txt", "r").read()
s1 = open("FRANKENST31.txt", "r").read()
words_s1 = re.findall("[A-Za-z]",s1)
words_s2 = re.findall("[A-Za-z]",s2)
set_s1 = set(words_s1)
set_s2 = set(words_s2)
words_in_s1_but_not_in_s2 = set_s1 - set_s2
words_in_s2_but_not_in_s1 = set_s2 - set_s1
words_in_s1 = '\n'.join(words_in_s1_but_not_in_s2)
words_in_s2 = '\n'.join(words_in_s2_but_not_in_s1)
with open("s1_output","w") as s1_output:
s1_output.write(words_in_s1)
with open("s2_output","w") as s2_output:
s2_output.write(words_in_s2)
Let me know if this isn't exactly what you're looking for, but it seems like you want to iterate through lines of a file, which you can do very easily in python. Here's an example, where I omit the newline character at the end of each line, and add the lines to a list:
f = open("filename.txt", 'r')
lines = []
for line in f:
lines.append(f[:-1])
Hope this helps!
I'm not completely sure if you're trying to compare the differences in words as they occur or lines as they occur, however one way you could do this is by using a dictionary. If you want to see which lines change you could split the lines on periods by doing something like:
text = 'this is a sentence. this is another sentence.'
sentences = text.split('.')
This will split the string you have (which contains the entire text I assume) on the periods and will return an array (or list) of all the sentences.
You can then create a dictionary with dict = {}, loop over each sentence in the previously created array, make it a key in the dictionary with a corresponding value (could be anything since most sentences probably don't occur more than once). After doing this for the first version you can go through the second version and check which sentences are the same. Here is some code that will give you a start (assuming version1 contains all the sentences from the first version):
for sentence in version1:
dict[sentence] = 1 #put a counter for e
You can then loop over the second version and check if the same sentence is found in the first, with something like:
for sentence in version2:
if sentence in dict: #if the sentence is in the dictionary
pass
#or do whatever you want here
else: #if the sentence isn't
print(sentence)
Again not sure if this is what you're looking for but hope it helps

Python counting occurrences across multiple lines using loops

I want a quick pythonic method to give me a count in a loop. I am actually too embarrassed to post up my solutions which are currently not working.
Given a sample from a text file structured follows:
script7
BLANK INTERRUPTION
script2
launch4.VBS
script3
script8
launch3.VBS
script5
launch1.VBS
script6
I want a count of all times script[y] is followed by a launch[X]. Launch has a range of values from 1-5, whilst script has range of 1-15.
Using script3 as an example, I would need a count for each of the following in a given file:
script3
launch1
#count this
script3
launch2
#count this
script3
launch3
#count this
script3
launch4
#count this
script3
launch4
#count this
script3
launch5
#count this
I think the sheer number of loops involved here has surpassed my knowledge of Python. Any assistance would be greatly appreciated.
Why not use a multi-line regex - then the script becomes:
import re
# read all the text of the file, and clean it up
with open('counts.txt', 'rt') as f:
alltext = '\n'.join(line.strip() for line in f)
# find all occurrences of the script line followed by the launch line
cont = re.findall('^script(\d)\nlaunch(\d+)\.VBS\n(?mi)',alltext)
# accumulate the counts of each launch number for each script number
# into nested dictionaries
scriptcounts = {}
for scriptnum,launchnum in cont:
# if we haven't seen this scriptnumber before, create the dictionary for it
if scriptnum not in scriptcounts:
scriptcounts[scriptnum]={}
# if we haven't seen this launchnumber with this scriptnumber before,
# initialize count to 0
if launchnum not in scriptcounts[scriptnum]:
scriptcounts[scriptnum][launchnum] = 0
# incremement the count for this combination of script and launch number
scriptcounts[scriptnum][launchnum] += 1
# produce the output in order of increasing scriptnum/launchnum
for scriptnum in sorted(scriptcounts.keys()):
for launchnum in sorted(scriptcounts[scriptnum].keys()):
print "script%s\nlaunch%s.VBS\n# count %d\n"%(scriptnum,launchnum,scriptcounts[scriptnum][launchnum])
The output (in the format you requested) is, for example:
script2
launch1.VBS
# count 1
script2
launch4.VBS
# count 1
script5
launch1.VBS
# count 1
script8
launch3.VBS
# count 3
re.findall() returns a list of all the matches - each match is a list of the () parts of the pattern except the (?mi) which is a directive to tell the regular expression matcher to work across line ends \n and to match case insensitive. The regex pattern as it stands e.g. fragment 'script(\d)' pulls out the digit following the script/launch into the match - this could as easily include 'script' by being '(script\d)', similarly '(launch\d+\.VBS)' and only the printing would need modification to handle this variation.
HTH
barny
Here is my solution using defaultdict with Counters and regex with lookahead.
import re
from collections import Counter, defaultdict
with open('in.txt', 'r') as f:
# make sure we have only \n as lineend and no leading or trailing whitespaces
# this makes the regex less complex
alltext = '\n'.join(line.strip() for line in f)
# find keyword script\d+ and capture it, then lazy expand and capture everything
# with lookahead so that we stop as soon as and only if next word is 'script' or
# end of the string
scriptPattern = re.compile(r'(script\d+)(.*?)(?=script|\n?$)', re.DOTALL)
# just find everything that matches launch\d+
launchPattern = re.compile(r'launch\d+')
# create a defaultdict with a counter for every entry
scriptDict = defaultdict(Counter)
# go through all matches
for match in scriptPattern.finditer(alltext):
script, body = match.groups()
# update the counter of this script
scriptDict[script].update(launchPattern.findall(body))
# print the results
for script in sorted(scriptDict):
counter = scriptDict[script]
if len(counter):
print('{} launches:'.format(script))
for launch in sorted(counter):
count = counter[launch]
print('\t{} {} time(s)'.format(launch, count))
else:
print('{} launches nothing'.format(script))
Using the string on regex101 (see link above) I get the following result:
script2 launches:
launch4 1 time(s)
script3 launches nothing
script5 launches:
launch1 1 time(s)
script6 launches nothing
script7 launches nothing
script8 launches:
launch3 1 time(s)
Here's an approach which uses nested dictionaries. Please tell me if you would like the output to be in a different format:
#!/usr/bin/env python3
import re
script_dict={}
with open('infile.txt','r') as infile:
scriptre = re.compile(r"^script\d+$")
for line in infile:
line = line.rstrip()
if scriptre.match(line) is not None:
script_dict[line] = {}
infile.seek(0) # go to beginning
launchre = re.compile(r"^launch\d+\.[vV][bB][sS]$")
current=None
for line in infile:
line = line.rstrip()
if line in script_dict:
current=line
elif launchre.match(line) is not None and current is not None:
if line not in script_dict[current]:
script_dict[current][line] = 1
else:
script_dict[current][line] += 1
print(script_dict)
You could use setdefault method
code:
dic={}
with open("a.txt") as inp:
check=0
key_string=""
for line in inp:
if check:
if line.strip().startswith("launch") and int(line.strip()[6])<6:
print "yes"
dic[key_string]=dic.setdefault(key_string,0)+1
check=0
if line.strip().startswith("script"):
key_string=line.strip()
check=1
For your given input the output would be
output:
{"script3":6}

How to count the chinese word frequency in a tokenized list?

I am using python 2.7.
I would like to count the frequency of words in Chinese.
How can i do this with the use of my tokenized list? I would like to locate where the sentences is in the next step.
So, hopefully, I can count the word frequency and also give me the starting point and the ending point of each words at the same tme.
I tried to count the word frequency from the input file which is nothing to do with my tokenization. But it also give me a wrong result.
For the counter part, it shows me this :
Counter({u'\u7684': 1}) , but my expected result is Counter({'的': 27})
#coding=UTF-8
userinput = raw_input('Enter the name of a file')
import codecs
f= codecs.open(userinput,"r","UTF-8")
str=f.read()
f.close()
import jieba
result=jieba.tokenize(str)
for tk in result:
print "word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2])
from collections import Counter
with open(userinput) as inf:
cnt = Counter()
for word in [u'的']:
cnt[word] += 1
print (cnt)
This is not correct:
for word in [u'的']:
cnt[word] += 1
You need to run your loop over the words in file:
for word in open(userinput,'r').read().split():
cnt[word] += 1
for word in [u'的']:
cnt[word] += 1
This is the entirety of your accumulation loop. You are looping over the single character u'的'. I assume that's not what you want to do.
Counter works best when you feed it an iterable. Forget this cnt += 1 stuff, that's slow and treating the counter like a defaultdict. Feed it an entire iterable at once:
cnt = Counter(inf.read().split())
It also seems like you are needlessly opening up this file a second time; since you already tokenized it above into result, why not just:
cnt = Counter(tk[0] for tk in result)

Categories

Resources