Reading a text file and replacing it to value in dictionary - python

I have a dictionary made in python. I also have a text file where each line is a different word. I want to check each line of the text file against the keys of the dictionary and if the line in the text file matches the key I want to write that key's value to an output file. Is there an easy way to do this. Is this even possible?
for example I am reading my file in like this:
test = open("~/Documents/testfile.txt").read()
tokenising it and for each word token I want to look it up a dictionary, my dictionary is setup like this:
dic = {"a": ["ah0", "ey1"], "a's": ["ey1 z"], "a.": ["ey1"], "a.'s": ["ey1 z"]}
If I come across the letter 'a' in my file, I want it to output ["ah0", "ey1"].

you can try:
for line in all_lines:
for val in dic:
if line.count(val) > 0:
print(dic[val])
this will look through all lines in the file and if the line contains a letter from dic, then it will print the items associated with that letter in the dictionary (you will have to do something like all_lines = test.readlines() to get all the lines in a list) the dic[val] gives the list assined to the value ["ah0", "ey1"] so you do not just have to print it but you can use it in other places

you can give this a try:
#dictionary to match keys againts words in text filee
dict = {"a": ["ah0", "ey1"], "a's": ["ey1 z"], "a.": ["ey1"], "a.'s": ["ey1 z"]}
# Read from text filee
open_file = open('sampletext.txt', 'r')
lines = open_file.readlines()
open_file.close()
#search the word extracted from textfile, if found in dictionary then print list into the file
for word in lines:
if word in dict:
write_to_file = open('outputfile.txt', 'w')
write_to_file.writelines(str(dict[word]))
write_to_file.close()
Note: you may need to strip the newline "\n" if the textfile you read from have multiple lines

Related

How to make a dictionary from a given txt file?

Task: given a txt file with adjective \t synonym, synonym, synonym, etc. in a line, several lines are given. I need to create a dictionary, where adjective will be a key and synonyms - a value. My code:
#necessary for command line + regex
import sys
import re
#open file for reading
filename = sys.argv[1]
infile = open(filename, "r")
#a
#create a dictionary, where an adjective in a line is a key
#and synonyms are the value
dictionary = {}
#for each line in infile
for line in infile:
#creating a list with keys, a key is everything before the tab
adjectives = re.findall(r"w+\t$", line)
print(adjectives)
#creating a list of values, a value is everything after the tab
synonyms = re.findall(r"^\tw+\n$", line)
print(synonyms)
#combining both lists into a dictionary, where adj are keys, synonyms - values
dictionary = dict(zip(adjectives, synonyms))
print(dictionary)
#close the file
infile.close()
The output shows me the empty brackets... Could someone help to fix?
Instead of regular expressions, use split() to split strings using delimiters. First split it using \t to separate the adjective from the synonyms, then split the synonyms into a list using ,.
Then you need to add a new key in the dictionary, not replace the entire dictionary.
for line in infile:
line = line.strip() # remove newline
adjective, synonyms = line.split("\t")
synonyms = synonyms.split(",")
dictionary[adjective] = synonyms
print(dictionary)

JSON File: Separate Word Count for Different Objects with Python

For a current research project, I am planning to count the unique words of different objects in a JSON file. Ideally, the output file should show separate word count summaries (counting the occurence of unique words) for the texts in "Text Main", "Text Pro" and "Text Con". Is there any smart tweak to make this happen?
At the moment, I am receiving the following error message:
File "index.py", line 10, in <module>
text = data["Text_Main"]
TypeError: list indices must be integers or slices, not str
The JSON file has the following structure:
[
{"Stock Symbol":"A",
"Date":"05/11/2017",
"Text Main":"Text sample 1",
"Text Pro":"Text sample 2",
"Text Con":"Text sample 3"}
]
And the corresponding code looks like this:
# Import relevant libraries
import string
import json
import csv
import textblob
# Open JSON file and slice by object
file = open("Glassdoor_A.json", "r")
data = json.load(file)
text = data["Text_Main"]
# Create an empty dictionary
d = dict()
# Loop through each line of the file
for line in text:
# Remove the leading spaces and newline character
line = line.strip()
# Convert the characters in line to
# lowercase to avoid case mismatch
line = line.lower()
# Remove the punctuation marks from the line
line = line.translate(line.maketrans("", "", string.punctuation))
# Split the line into words
words = line.split(" ")
# Iterate over each word in line
for word in words:
# Check if the word is already in dictionary
if word in d:
# Increment count of word by 1
d[word] = d[word] + 1
else:
# Add the word to dictionary with count 1
d[word] = 1
# Print the contents of dictionary
for key in list(d.keys()):
print(key, ":", d[key])
# Save results as CSV
with open('Glassdoor_A.csv', 'w', newline='') as file:
writer = csv.writer(file)
writer.writerow(["Word", "Occurences", "Percentage"])
writer.writerows([key, d[key])
Well, firstly the key should be "Text Main" and secondly you need to access the first dict in the list. So just extract the text variable like this:
text = data[0]["Text Main"]
This should fix the error message.
Your JSON file has an object inside a list. In order to access the content you want, first you have to access the object via data[0]. Then you can access the string field. I would change the code to:
# Open JSON file and slice by object
file = open("Glassdoor_A.json", "r")
data = json.load(file)
json_obj = data[0]
text = json_obj["Text_Main"]
or you can access that field in a single line with text = data[0]["Text_Main"] as quamrana stated.

For any user-given text file, the program will read, analyze, and write each word with the line numbers where the word is found in an output file

For any user-given text file, the program will read, analyze, and write each word with the line numbers where the word is found in an output file. A word may appear in multiple lines. A word shows more than once at a line, the line number will be only recorded one time.
Ask a user to enter the name of a text file. Using try/except for invalid user input. Then the program reads the contents of the text file and create a dictionary in which the key-value pairs are described as follows:
Key. The key are the individual words found in the file.
Value. Each value is a list that contains the line numbers in the file where the word (the key) is found. Be aware that a list may have only one element.
Once the dictionary has been built, the program should create another text file, named “word_index.txt”. Next, write the contents of the dictionary to the file as an alphabetical listing of the words that are stored as keys in the dictionary (sorting the keys), along with the line numbers where the words appear in the original file.
see my code below
import string
fname = input('Enter a file name: ')
try:
fhand = open(fname)
except:
print('File cannot be opened:', fname)
exit()
counts = dict()
L_N=0
for line in fhand:
line= line.rstrip()
line = line.translate(line.maketrans(' ', ' ',string.punctuation))
line = line.lower()
words = line.split()
L_N+=1
for word in words:
if word not in counts:
counts[word]= [L_N]
else:
if L_N not in counts[word]:
counts[word].append(L_N)
for h in range(len(counts)):
print(counts)
out_file = open('word_index.txt', 'w')
out_file.write('Text file being analyzed is: '+str(fname)+ '\n\n')
out_file.close()
The outcome should print the results once but I am having an issue where it is printing multiple times at once.
Since you're printing your dictionary in the for loop it will be printed n times - n is the length of dictionary - so you need just to remove this loop:
for h in range(len(counts)):
print(counts)
And add this instead:
here we just loop through dictionary to get every pairs
for key, value in counts.items():
print('key: ', key, 'value: ', value)

Replace words of a long document in Python

I have a dictionary dict with some words (2000) and I have a huge text, like Wikipedia corpus, in text format. For each word that is both in the dictionary and in the text file, I would like to replace it with word_1.
with open("wiki.txt",'r') as original, open("new.txt",'w') as mod:
for line in original:
new_line = line
for word in line.split():
if (dict.get(word.lower()) is not None):
new_line = new_line.replace(word,word+"_1")
mod.write(new_line)
This code creates a new file called new.txt with the words that appear in the dictionary replaced as I want.
This works for short files, but for the longer that I am using as input, it "freezes" my computer.
Is there a more efficient way to do that?
Edit for Adi219:
Your code seems working, but there is a problem:
if a line is like that: Albert is a friend of Albert and in my dictionary I have Albert, after the for cycle, the line will be like this:Albert_1_1 is a friend of Albert_1. How can I replace only the exact word that I want, to avoid repetitions like _1_1_1_1?
Edit2:
To solve the previous problem, I changed your code:
with open("wiki.txt", "r") as original, open("new.txt", "w") as mod:
for line in original:
words = line.split()
for word in words:
if dict.get(word.lower()) is not None:
mod.write(word+"_1 ")
else:
mod.write(word+" ")
mod.write("\n")
Now everything should work
A few things:
You could remove the declaration of new_line. Then, change new_line = new_line.replace(...) line with line = line.replace(...). You would also have to write(line) afterwards.
You could add words = line.split() and use for word in words: for the for loop, as this removes a call to .split() for every iteration through the words.
You could (manually(?)) split your large .txt file into multiple smaller files and have multiple instances of your program running on each file, and then you could combine the multiple outputs into one file. Note: You would have to remember to change the filename for each file you're reading/writing to.
So, your code would look like:
with open("wiki.txt", "r") as original, open("new.txt", "w") as mod:
for line in original:
words = line.split()
for word in words:
if dict.get(word.lower()) is not None:
line = line.replace(word, word + "_1")
mod.write(line)

Reading text file into dic file results in incomplete dic file

The following file 2016_01_22_Reps.txt is a list of expanded contractions that I want to put into a python dic file;
“can't":"cannot","could've":"could have","could've":"could have","didn't":"did not","doesn't":"does not", “don't":"do not"," hadn't":"had not", "hasn't":"has not","haven't":"have not","I'll":"I will","I'm":"I am","I've":"I have","isn't":"is not","I'll":"I
Note that the contents are a single line, not multiple lines.
My code is as follows;
reps = open('2016_01_22_Reps.txt', 'r')
Reps1dic={}
for line in reps:
x=line.split(",")
a=x[0]
b=x[1]
c=len(b)-1
b=b[0:c]
Reps1dic[a]=b
print (Reps1dic)
The output to Reps1dic stops after first two pairs of contractions. Contents are as follows;
{‘2016_01_22Reps = {“can\’t”:”cannot”‘ : ‘”could\’ve”:”could have’}
Instructions and explanation of why the complete file contents are not written to the dic file will be most appreciated.
The problem is that your values are all on one line, so your for line in reps only goes through the one iteration. Do something like this:
with open('2016_01_22_Reps.txt', 'r') as reps:
Reps1dic={}
contents = reps.read()
pairs = contents.split(',')
for pair in pairs:
parts = pair.split(':')
a = parts[0].replace('"', '').strip()
b = parts[1].replace('"', '').strip()
Reps1dic[a] = b
print(Reps1dic)
where you split the line and then iterate over that list instead of the lines in the file. I also used the with keyword to open your file - it's much better practice.

Categories

Resources