I want to store the words of a text file in a dictionary.
My code is
word=0
char=0
i=0
a=0
d={}
with open("m.txt","r") as f:
for line in f:
w=line.split()
d[i]=w[a]
i=i+1
a=a+1
word=word+len(w)
char=char+len(line)
print(word,char)
print(d)
my text file is
jdfjdnv dj g gjv,kjvbm
but the problem is that the dictionary is storing only the first word of the text file .how to store the rest of the words.please help
How many lines does your text file have? If it only has one line your loop executes only once, splits whole line into separate words, then saves one word in Python dict. If you want to save all words from this text file with one line you need to add another loop. Like this:
for word in line.split():
d[i] = word
i += 1
You only store the first word because you only have one line in the file, and your only for loop is over the lines.
Generally, if you are going to key the dictionary by index, you can just use the list you are already making:
w = []
char = 0
with open("m.txt", "r") as f:
for line in f:
char += len(line)
w.extend(line.split())
word = sum(map(len, w))
Related
Hello I am struggling with this:
I have a txt file that once open looks like this:
Jim, task1\n
Marc, task3\n
Tom, task4\n
Jim, task2\n
Jim, task6\n
And I want to check how many duplicate names there are. I am interested only in the first field (i.e person name).
I tried to look for an answer on this website, but I could not find anything that helped, as in my case I don't know which name is duplicate, since this file txt will be updated frequently.
As I am new to Python/programming is there a simple way to solve this without using any dictionaries or list comprehensions or without importing modules?
Thank you
same_word_count = 0
with open('tasks.txt','r') as file2:
content = file2.readlines()
for line in content:
split_data = line.split(', ')
user = split_data[0]
word = user
if word == user:
same_word_count -= 1
print(same_word_count)
You can do the following.
word = "Word" # word you want to count
count = 0
with open("temp.txt", 'r') as f:
for line in f:
words = line.split()
for i in words:
if(i==word):
count=count+1
print("Occurrences of the word", word, ":", count)
Or you can get list of all words occurrences
# Open the file in read mode
text = open("sample.txt", "r")
# Create an empty dictionary
d = dict()
# Loop through each line of the file
for line in text:
# Remove the leading spaces and newline character
line = line.strip()
# Convert the characters in line to
# lowercase to avoid case mismatch
line = line.lower()
# Split the line into words
words = line.split(" ")
# Iterate over each word in line
for word in words:
# Check if the word is already in dictionary
if word in d:
# Increment count of word by 1
d[word] = d[word] + 1
else:
# Add the word to dictionary with count 1
d[word] = 1
# Print the contents of dictionary
for key in list(d.keys()):
print(key, ":", d[key])
I have a dictionary dict with some words (2000) and I have a huge text, like Wikipedia corpus, in text format. For each word that is both in the dictionary and in the text file, I would like to replace it with word_1.
with open("wiki.txt",'r') as original, open("new.txt",'w') as mod:
for line in original:
new_line = line
for word in line.split():
if (dict.get(word.lower()) is not None):
new_line = new_line.replace(word,word+"_1")
mod.write(new_line)
This code creates a new file called new.txt with the words that appear in the dictionary replaced as I want.
This works for short files, but for the longer that I am using as input, it "freezes" my computer.
Is there a more efficient way to do that?
Edit for Adi219:
Your code seems working, but there is a problem:
if a line is like that: Albert is a friend of Albert and in my dictionary I have Albert, after the for cycle, the line will be like this:Albert_1_1 is a friend of Albert_1. How can I replace only the exact word that I want, to avoid repetitions like _1_1_1_1?
Edit2:
To solve the previous problem, I changed your code:
with open("wiki.txt", "r") as original, open("new.txt", "w") as mod:
for line in original:
words = line.split()
for word in words:
if dict.get(word.lower()) is not None:
mod.write(word+"_1 ")
else:
mod.write(word+" ")
mod.write("\n")
Now everything should work
A few things:
You could remove the declaration of new_line. Then, change new_line = new_line.replace(...) line with line = line.replace(...). You would also have to write(line) afterwards.
You could add words = line.split() and use for word in words: for the for loop, as this removes a call to .split() for every iteration through the words.
You could (manually(?)) split your large .txt file into multiple smaller files and have multiple instances of your program running on each file, and then you could combine the multiple outputs into one file. Note: You would have to remember to change the filename for each file you're reading/writing to.
So, your code would look like:
with open("wiki.txt", "r") as original, open("new.txt", "w") as mod:
for line in original:
words = line.split()
for word in words:
if dict.get(word.lower()) is not None:
line = line.replace(word, word + "_1")
mod.write(line)
I am trying to stem words from a file that contains about 90000 lines (each line has three to several hundred words.
I want to append the lines to a list after stemming the words. I was able to insert the stemmed words into a list, which contains one line. I want to insert the words into the list while keeping the 90000 lines. Any ideas?
clean_sentence = []
with open(folder_path+text_file_name, 'r', encoding='utf-8') as f:
for line in f:
sentence = line.split()
for word in sentence:
if word.endswith('er'):
clean_sentence.append(word[:-2])
else:
clean_sentence.append(word)
x = ' '.join(clean_sentence)
with open('StemmingOutFile.txt','w', encoding="utf8") as StemmingOutFile:
StemmingOutFile.write(x)
The file is not in English, but here is an example that illustrates the issue at hand: current code yields:
why don't you like to watch TV? are there any more fruits? why not?
I want the output file to be:
why don't you like to watch TV?
are there any more fruits?
why not?
Read the file in lines:
with open('file.txt','r') as f:
lines = f.read().splitlines()
and then do the stemming:
new_lines = []
for line in lines:
new_lines.append(' '.join[stemmed(word) for word in line])
where stemmed is a function as follows:
def stemmed(word):
return word[:-2] if word.endswith('er') else word
Then write each line of new_lines in StemmingOutFile.txt.
I am attempting to count the number of 'difficult words' in a file, which requires me to count the number of letters in each word. For now, I am only trying to get single words, one at a time, from a file. I've written the following:
file = open('infile.txt', 'r+')
fileinput = file.read()
for line in fileinput:
for word in line.split():
print(word)
Output:
t
h
e
o
r
i
g
i
n
.
.
.
It seems to be printing one character at a time instead of one word at a time. I'd really like to know more about what is actually happening here. Any suggestions?
Use splitlines():
fopen = open('infile.txt', 'r+')
fileinput = fopen.read()
for line in fileinput.splitlines():
for word in line.split():
print(word)
fopen.close()
Without splitlines():
You can also use with statement to open the file. It closes the file automagically:
with open('infile.txt', 'r+') as fopen:
for line in fopen:
for word in line.split():
print(word)
A file supports the iteration protocol, which for bigger files is much better than reading the whole content in memory in one go
with open('infile.txt', 'r+') as f:
for line in f:
for word in line.split():
print(word)
Assuming you are going to define a filter function, you could do something along the line
def is_difficult(word):
return len(word)>5
with open('infile.txt', 'r+') as f:
words = (w for line in f for w in line.split() if is_difficult(w))
for w in words:
print(w)
which, with an input file of
ciao come va
oggi meglio di domani
ieri peggio di oggi
produces
meglio
domani
peggio
Your code is giving you single characters because you called .read() which store all the content as a single string so when you for line in fileinput you are iterating over the string char by char, there is no good reason to use read and splitlines you as can simple iterate over the file object, if you did want a list of lines you would call readlines.
If you want to group words by length use a dict using the length of the word as the key, you will want to also remove punctuation from the words which you can do with str.strip:
def words(n, fle):
from collections import defaultdict
d = defaultdict(list)
from string import punctuation
with open(fle) as f:
for line in f:
for word in line.split():
word = word.strip(punctuation)
_len = len(word)
if _len >= n:
d[_len].append(word)
return d
Your dict will contain all the words in the file grouped by length and all at least n characters long.
I have a two file: the first one includes terms and their frequency:
table 2
apple 4
pencil 89
The second file is a dictionary:
abroad
apple
bread
...
I want to check whether the first file contains any words from the second file. For example both the first file and the second file contains "apple".
I am new to python.
I try something but it does not work. Could you help me ? Thank you
for line in dictionary:
words = line.split()
print words[0]
for line2 in test:
words2 = line2.split()
print words2[0]
Something like this:
with open("file1") as f1,open("file2") as f2:
words=set(line.strip() for line in f1) #create a set of words from dictionary file
#why sets? sets provide an O(1) lookup, so overall complexity is O(N)
#now loop over each line of other file (word, freq file)
for line in f2:
word,freq=line.split() #fetch word,freq
if word in words: #if word is found in words set then print it
print word
output:
apple
It may help you :
file1 = set(line.strip() for line in open('file1.txt'))
file2 = set(line.strip() for line in open('file2.txt'))
for line in file1 & file2:
if line:
print line
Here's what you should do:
First, you need to put all the dictionary words in some place where you can easily look them up. If you don't do that, you'd have to read the whole dictionary file every time you want to check one single word in the other file.
Second, you need to check if each word in the file is in the words you extracted from the dictionary file.
For the first part, you need to use either a list or a set. The difference between these two is that list keeps the order you put the items in it. A set is unordered, so it doesn't matter which word you read first from the dictionary file. Also, a set is faster when you look up an item, because that's what it is for.
To see if an item is in a set, you can do: item in my_set which is either True or False.
I have your first double list in try.txt and the single list in try_match.txt
f = open('try.txt', 'r')
f_match = open('try_match.txt', 'r')
print f
dictionary = []
for line in f:
a, b = line.split()
dictionary.append(a)
for line in f_match:
if line.split()[0] in dictionary:
print line.split()[0]