How I can use a list comprehension for this example?
l_words = list_word(domain)
words = list(l_words)
l_lines = [line for count, line in enumerate(open(file_found,"rb"))]
lines = list(l_lines)
# from here
for bline in lines:
for word in words:
try:
aline = bline.decode()
line = return_only_ascii(aline)
if line == "\n":
break
if line.find(word) != -1:
login += 1
# to here
I need to find 83 words in each .txt file, of course for each line I need to see if the word is in each line.
Related
def inverted_index(doc):
words = word_count(doc)
ln = 0
for word in words:
temp = []
with open(doc) as file:
for line in file:
ln += 1
li = line.split()
if word in li:
temp.append(ln)
words[word] = temp
return words
I am trying to create an inverted index from a text file, where words is a dictionary with all the 19000 unique words in the file. The text file has around 5000+ lines. I want to iterate through the file and dictionary to create the inverted index that has the word followed by line numbers that the word appears but it is taking too long to compile as it is nested for loop. So is there a more efficient way to do this?
Here is my approach to solve this, please read the notes below code for some pragmatic tips.
def inverted_index(doc):
# this will open the file
file = open(doc, encoding='utf8')
f = file.read()
file.seek(0)
# Get number of lines in file
lines = 1
for word in f:
if word == '\n':
lines += 1
print("Number of lines in file is: ", lines) # Just for debuggin, please remove in PROD version
d = {}
for i in range(lines):
line = file.readline()
l = line.lower().split(' ')
for item in l:
if item not in d:
d[item] = [i+1]
if item in d:
d[item].append(i+1)
return d
print(inverted_index('file.txt'))
I would suggest removing stopwords first before creating the inverted index for any meaningful analysis. You can use nltk package for that.
I am trying to read a quote from a text file and find any duplicated words that appear next to each other. The following is the quote:
"He that would make his own liberty liberty secure,
must guard even his enemy from oppression;
for for if he violates this duty, he
he establishes a precedent that will reach to himself."
-- Thomas Paine
The output should be the following:
Found word: "Liberty" on line 1
Found word: "for" on line 3
Found word: "he" on line 4
I have written the code to read the text from the file but I am having trouble with the code to identify the duplicates. I have tried enumerating each word in the file and checking if the word at one index is equal to the the word at the following index. However, I am getting an index error because the loop continues outside of the index range. Here's what I've come up with so far:
import string
file_str = input("Enter file name: ")
input_file = open(file_str, 'r')
word_list = []
duplicates = []
for line in input_file:
line_list = line_str.split()
for word in line_list:
if word != "--":
word_list.append(word)
for idx, word in enumerate(word_list):
print(idx, word)
if word_list[idx] == word_list[idx + 1]:
duplicates.append(word)
Any help with the current method I'm trying would be appreciated, or suggestions for another method.
When you record the word_list you are losing information about which line the word is on.
Perhaps better would be to determine duplicates as you read the lines.
line_number = 1
for line in input_file:
line_list = line_str.split()
previous_word = None
for word in line_list:
if word != "--":
word_list.append(word)
if word == previous_word:
duplicates.append([word, line_number])
previous_word = word
line_number += 1
This should do the trick OP. In the for loop over the word list it only goes up to the second to last element now. This won't keep track of the line numbers though, I would use Phillip Martin's solution for that.
import string
file_str = input("Enter file name: ")
input_file = open(file_str, 'r')
word_list = []
duplicates = []
for line in input_file:
line_list = line.split()
for word in line_list:
if word != "--":
word_list.append(word)
#Here is the change I made > <
for idx, word in enumerate(word_list[:-1]):
print(idx, word)
if word_list[idx] == word_list[idx + 1]:
duplicates.append(word)
print duplicates
Here's another approach.
from itertools import tee, izip
from collections import defaultdict
dups = defaultdict(set)
with open('file.txt') as f:
for no, line in enumerate(f, 1):
it1, it2 = tee(line.split())
next(it2, None)
for word, follower in izip(it1, it2):
if word != '--' and word == follower:
dups[no].add(word)
which yields
>>> dups
defaultdict(<type 'set'>, {1: set(['liberty']), 3: set(['for'])})
which is a dictionary which holds a set of pair-duplicates for each line, e.g.
>>> dups[3]
set(['for'])
(I don't know why you expect "he" to be found on line four, it is certainly not doubled in your sample file.)
I open a dictionary and pull specific lines the lines will be specified using a list and at the end i need to print a complete sentence in one line.
I want to open a dictionary that has a word in each line
then print a sentence in one line with a space between the words:
N = ['19','85','45','14']
file = open("DICTIONARY", "r")
my_sentence = #?????????
print my_sentence
If your DICTIONARY is not too big (i.e. can fit your memory):
N = [19,85,45,14]
with open("DICTIONARY", "r") as f:
words = f.readlines()
my_sentence = " ".join([words[i].strip() for i in N])
EDIT: A small clarification, the original post didn't use space to join the words, I've changed the code to include it. You can also use ",".join(...) if you need to separate the words by a comma, or any other separator you might need. Also, keep in mind that this code uses zero-based line index so the first line of your DICTIONARY would be 0, the second would be 1, etc.
UPDATE:: If your dictionary is too big for your memory, or you just want to consume as little memory as possible (if that's the case, why would you go for Python in the first place? ;)) you can only 'extract' the words you're interested in:
N = [19, 85, 45, 14]
words = {}
word_indexes = set(N)
counter = 0
with open("DICTIONARY", "r") as f:
for line in f:
if counter in word_indexes:
words[counter] = line.strip()
counter += 1
my_sentence = " ".join([words[i] for i in N])
you can use linecache.getline to get specific line numbers you want:
import linecache
sentence = []
for line_number in N:
word = linecache.getline('DICTIONARY',line_number)
sentence.append(word.strip('\n'))
sentence = " ".join(sentence)
Here's a simple one with more basic approach:
n = ['2','4','7','11']
file = open("DICTIONARY")
counter = 1 # 1 if you're gonna count lines in DICTIONARY
# from 1, else 0 is used
output = ""
for line in file:
line = line.rstrip() # rstrip() method to delete \n character,
# if not used, print ends with every
# word from a new line
if str(counter) in n:
output += line + " "
counter += 1
print output[:-1] # slicing is used for a white space deletion
# after last word in string (optional)
I am relatively new to Python and I am currently working on a compression program that uses lists containing positions of words in a lists and a list of words that make up the sentence. So far I have written my program inside two functions, the first function; 'compression', gets the words that make up the sentence and the positions of those words. My second function is called 'recreate', this function uses he lists to recreate the sentence. The recreated senetence is then stored in a file called recreate.txt. My issue is that the positions of words and the words that make up the sentence are not being written to their respective files and the 'recreate' file is not being created and written to. Any help would be greatly appreciated. Thanks :)
sentence = input("Input the sentence that you wish to be compressed")
sentence.lower()
sentencelist = sentence.split()
d = {}
plist = []
wds = []
def compress():
for i in sentencelist:
if i not in wds:
wds.append(i)
for i ,j in enumerate(sentencelist):
if j in (d):
plist.append(d[j])
else:
plist.append(i)
print (plist)
tsk3pos = open ("tsk3pos.txt", "wt")
for item in plist:
tsk3pos.write("%s\n" % item)
tsk3pos.close()
tsk3wds = open ("tsk3wds.txt", "wt")
for item in wds:
tsk3wds.write("%s\n" % item)
tsk3wds.close()
print (wds)
def recreate(compress):
compress()
num = list()
wds = list()
with open("tsk3wds.txt", "r") as txt:
for line in txt:
words += line.split()
with open("tsk3pos.txt", "r") as txt:
for line in txt:
num += [int(i) for i in line.split()]
recreate = ' '.join(words[pos] for pos in num)
with open("recreate.txt", "wt") as txt:
txt.write(recreate)
UPDATED
I have fixed all other problems except the recreate function which will not make the 'recreate' file and will not recreate the sentence with the words, although
it recreates the sentence with the positions.
def recreate(compress): #function that will be used to recreate the compressed sentence.
compress()
num = list()
wds = list()
with open("words.txt", "r") as txt: #with statement opening the word text file
for line in txt: #iterating over each line in the text file.
words += line.split() #turning the textfile into a list and appending it to num
with open("tsk3pos.txt", "r") as txt:
for line in txt:
num += [int(i) for i in line.split()]
recreate = ' '.join(wds[pos] for pos in num)
with open("recreate.txt", "wt") as txt:
txt.write(recreate)
main()
def main():
print("Do you want to compress an input or recreate a compressed input?")
user = input("Type 'a' if you want to compress an input. Type 'b' if you wan to recreate an input").lower()
if user not in ("a","b"):
print ("That's not an option. Please try again")
elif user == "a":
compress()
elif user == "b":
recreate(compress)
main()
main()
A simpler ( yet less efficient ) approach :
recreate_file_object = open ( "C:/FullPathToWriteFolder/recreate.txt" , "w" )
recreate_file_object.write ( recreate )
recreate_file_object.close ( )
I would like to define a function scaryDict() which takes one parameter (a textfile) and returns the words from the textfile in alphabetical order, basically produce a dictionary but does not print any one or two letter words.
Here is what I have so far...it isn't much but I don't know the next step
def scaryDict(fineName):
inFile = open(fileName,'r')
lines = inFile.read()
line = lines.split()
myDict = {}
for word in inFile:
myDict[words] = []
#I am not sure what goes between the line above and below
for x in lines:
print(word, end='\n')
You are doing fine till line = lines.split(). But your for loop must loop through the line array, not the inFile.
for word in line:
if len(word) > 2: # Make sure to check the word length!
myDict[word] = 'something'
I'm not sure what you want with the dictionary (maybe get the word count?), but once you have it, you can get the words you added to it by,
allWords = myDict.keys() # so allWords is now a list of words
And then you can sort allWords to get them in alphabetical order.
allWords.sort()
I would store all of the words into a set (to eliminate dups), then sort that set:
#!/usr/bin/python3
def scaryDict(fileName):
with open(fileName) as inFile:
return sorted(set(word
for line in inFile
for word in line.split()
if len(word) > 2))
scaryWords = scaryDict('frankenstein.txt')
print ('\n'.join(scaryWords))
Also keep in mind as of 2.5 the 'with' file contains an enter and exit methods which can prevent some issues (such as that file never getting closed)
with open(...) as f:
for line in f:
<do something with line>
Unique set
Sort the set
Now you can put it all together.
sorry that i am 3 years late : ) here is my version
def scaryDict():
infile = open('filename', 'r')
content = infile.read()
infile.close()
table = str.maketrans('.`/()|,\';!:"?=-', 15 * ' ')
content = content.translate(table)
words = content.split()
new_words = list()
for word in words:
if len(word) > 2:
new_words.append(word)
new_words = list(set(new_words))
new_words.sort()
for word in new_words:
print(word)