Task: given a txt file with adjective \t synonym, synonym, synonym, etc. in a line, several lines are given. I need to create a dictionary, where adjective will be a key and synonyms - a value. My code:
#necessary for command line + regex
import sys
import re
#open file for reading
filename = sys.argv[1]
infile = open(filename, "r")
#a
#create a dictionary, where an adjective in a line is a key
#and synonyms are the value
dictionary = {}
#for each line in infile
for line in infile:
#creating a list with keys, a key is everything before the tab
adjectives = re.findall(r"w+\t$", line)
print(adjectives)
#creating a list of values, a value is everything after the tab
synonyms = re.findall(r"^\tw+\n$", line)
print(synonyms)
#combining both lists into a dictionary, where adj are keys, synonyms - values
dictionary = dict(zip(adjectives, synonyms))
print(dictionary)
#close the file
infile.close()
The output shows me the empty brackets... Could someone help to fix?
Instead of regular expressions, use split() to split strings using delimiters. First split it using \t to separate the adjective from the synonyms, then split the synonyms into a list using ,.
Then you need to add a new key in the dictionary, not replace the entire dictionary.
for line in infile:
line = line.strip() # remove newline
adjective, synonyms = line.split("\t")
synonyms = synonyms.split(",")
dictionary[adjective] = synonyms
print(dictionary)
Related
Currently this is what i have, once i found the word fail , it needs to search above the fail string and display up until it hits another string i set. and then inquire user for the string to search within that area. Am i correct in creating a new text file to store the lines that contains the string from user ?
from typing import Any
from collections import Counter
errorList = []
with open('File1.txt', 'r') as f:
data = f.readlines()
for line in data:
if line.__contains__('Fail '):
errorList.append(line)
errorList = [i[30:56] for i in errorList]
print("Failed are = ", errorList)
string = input("What string do you like to search for in this test case? ")
f = open('File1.txt')
f1 = open('StringSearch.txt', 'a')
for line in f.readlines():
if string in line:
f1.write(line)
I have a dictionary made in python. I also have a text file where each line is a different word. I want to check each line of the text file against the keys of the dictionary and if the line in the text file matches the key I want to write that key's value to an output file. Is there an easy way to do this. Is this even possible?
for example I am reading my file in like this:
test = open("~/Documents/testfile.txt").read()
tokenising it and for each word token I want to look it up a dictionary, my dictionary is setup like this:
dic = {"a": ["ah0", "ey1"], "a's": ["ey1 z"], "a.": ["ey1"], "a.'s": ["ey1 z"]}
If I come across the letter 'a' in my file, I want it to output ["ah0", "ey1"].
you can try:
for line in all_lines:
for val in dic:
if line.count(val) > 0:
print(dic[val])
this will look through all lines in the file and if the line contains a letter from dic, then it will print the items associated with that letter in the dictionary (you will have to do something like all_lines = test.readlines() to get all the lines in a list) the dic[val] gives the list assined to the value ["ah0", "ey1"] so you do not just have to print it but you can use it in other places
you can give this a try:
#dictionary to match keys againts words in text filee
dict = {"a": ["ah0", "ey1"], "a's": ["ey1 z"], "a.": ["ey1"], "a.'s": ["ey1 z"]}
# Read from text filee
open_file = open('sampletext.txt', 'r')
lines = open_file.readlines()
open_file.close()
#search the word extracted from textfile, if found in dictionary then print list into the file
for word in lines:
if word in dict:
write_to_file = open('outputfile.txt', 'w')
write_to_file.writelines(str(dict[word]))
write_to_file.close()
Note: you may need to strip the newline "\n" if the textfile you read from have multiple lines
I'm trying to take text files, count the word usage of each word as key-value pairs in dictionaries, and write each dictionary to their own file. Then, I want to add all of the dictionaries together into one master dictionary, and then write that to its own text file. When I run the program, I keep getting a TypeError with the save_the_dictionary function, since it's getting passed a dictionary instead of a string; however, I thought that my save_the_dictionary function changes each key-value pair into strings before they are written to the file, but that doesn't seem to be the case. Any help with this would be greatly appreciate. Here is my code:
import os
from nltk.tokenize import sent_tokenize, word_tokenize
class Document:
def tokenize(self, text):
dictionary = {}
for line in text:
all_words = line.upper()
words = word_tokenize(all_words)
punctuation = '''!()-[]{};:'"\,<>./?##$%^&*_~'''
cleaned_words = []
for word in words:
if word not in punctuation:
cleaned_words.append(word)
for word in cleaned_words:
if word in dictionary:
dictionary[word] += 1
else:
dictionary[word] = 1
return dictionary
def save_the_dictionary(self, dictionary, filename): #This save function writes a new file, and turns each key and its corresponding value into strings and writes them into a text file-
newfile = open(filename, "w") #, it also adds formatting by tabbing over after the key, writing the value, and then making a new line. Then it closes the file.
for key, value in dictionary.items():
newfile.write(str(key) + "/t" + str(value) + "/n")
file.close()
# The main idea of this method is that it first converts all the text to uppercase strips all of the formatting from the file that it is reading, then it splits the text into a list,-
# using both whitespace and the characters above as delimiters. After that, it goes through the entire list pulled from the text file, and sees if it is in the dictionary variable. If-
# it is in there, it adds 1 to the value associated with that key. If it is not found within the dictionary variable, it adds to as a key to the dictionary variable, and sets its value-
# to 1.
#The above document class will only be used within the actual vectorize function.
def vectorize(filepath):
all_files = os.listdir(filepath)
full_dictionary = {}
for file in all_files:
doc = Document()
full_path = filepath + "\\" + file
textfile = open(full_path, "r", encoding="utf8")
text = textfile.read()
compiled_dictionary = doc.tokenize(text)
final_path = filepath + "\\final" + file
doc.save_the_dictionary(final_path, compiled_dictionary)
for line in text:
all_words = line.upper()
words = word_tokenize(all_words)
punctuation = '''!()-[]{};:'"\,<>./?##$%^&*_~'''
cleaned_words = []
for word in words:
if word not in punctuation:
cleaned_words.append(word)
for word in cleaned_words:
if word in dictionary:
full_dictionary[word] += 1
else:
full_dictionary[word] = 1
Document().save_the_dictionary(filepath + "\\df.txt", full_dictionary)
vectorize("C:\\Users\\******\\Desktop\\*******\\*****\\*****\\Text files")
For a current research project, I am planning to count the unique words of different objects in a JSON file. Ideally, the output file should show separate word count summaries (counting the occurence of unique words) for the texts in "Text Main", "Text Pro" and "Text Con". Is there any smart tweak to make this happen?
At the moment, I am receiving the following error message:
File "index.py", line 10, in <module>
text = data["Text_Main"]
TypeError: list indices must be integers or slices, not str
The JSON file has the following structure:
[
{"Stock Symbol":"A",
"Date":"05/11/2017",
"Text Main":"Text sample 1",
"Text Pro":"Text sample 2",
"Text Con":"Text sample 3"}
]
And the corresponding code looks like this:
# Import relevant libraries
import string
import json
import csv
import textblob
# Open JSON file and slice by object
file = open("Glassdoor_A.json", "r")
data = json.load(file)
text = data["Text_Main"]
# Create an empty dictionary
d = dict()
# Loop through each line of the file
for line in text:
# Remove the leading spaces and newline character
line = line.strip()
# Convert the characters in line to
# lowercase to avoid case mismatch
line = line.lower()
# Remove the punctuation marks from the line
line = line.translate(line.maketrans("", "", string.punctuation))
# Split the line into words
words = line.split(" ")
# Iterate over each word in line
for word in words:
# Check if the word is already in dictionary
if word in d:
# Increment count of word by 1
d[word] = d[word] + 1
else:
# Add the word to dictionary with count 1
d[word] = 1
# Print the contents of dictionary
for key in list(d.keys()):
print(key, ":", d[key])
# Save results as CSV
with open('Glassdoor_A.csv', 'w', newline='') as file:
writer = csv.writer(file)
writer.writerow(["Word", "Occurences", "Percentage"])
writer.writerows([key, d[key])
Well, firstly the key should be "Text Main" and secondly you need to access the first dict in the list. So just extract the text variable like this:
text = data[0]["Text Main"]
This should fix the error message.
Your JSON file has an object inside a list. In order to access the content you want, first you have to access the object via data[0]. Then you can access the string field. I would change the code to:
# Open JSON file and slice by object
file = open("Glassdoor_A.json", "r")
data = json.load(file)
json_obj = data[0]
text = json_obj["Text_Main"]
or you can access that field in a single line with text = data[0]["Text_Main"] as quamrana stated.
I have a dictionary dict with some words (2000) and I have a huge text, like Wikipedia corpus, in text format. For each word that is both in the dictionary and in the text file, I would like to replace it with word_1.
with open("wiki.txt",'r') as original, open("new.txt",'w') as mod:
for line in original:
new_line = line
for word in line.split():
if (dict.get(word.lower()) is not None):
new_line = new_line.replace(word,word+"_1")
mod.write(new_line)
This code creates a new file called new.txt with the words that appear in the dictionary replaced as I want.
This works for short files, but for the longer that I am using as input, it "freezes" my computer.
Is there a more efficient way to do that?
Edit for Adi219:
Your code seems working, but there is a problem:
if a line is like that: Albert is a friend of Albert and in my dictionary I have Albert, after the for cycle, the line will be like this:Albert_1_1 is a friend of Albert_1. How can I replace only the exact word that I want, to avoid repetitions like _1_1_1_1?
Edit2:
To solve the previous problem, I changed your code:
with open("wiki.txt", "r") as original, open("new.txt", "w") as mod:
for line in original:
words = line.split()
for word in words:
if dict.get(word.lower()) is not None:
mod.write(word+"_1 ")
else:
mod.write(word+" ")
mod.write("\n")
Now everything should work
A few things:
You could remove the declaration of new_line. Then, change new_line = new_line.replace(...) line with line = line.replace(...). You would also have to write(line) afterwards.
You could add words = line.split() and use for word in words: for the for loop, as this removes a call to .split() for every iteration through the words.
You could (manually(?)) split your large .txt file into multiple smaller files and have multiple instances of your program running on each file, and then you could combine the multiple outputs into one file. Note: You would have to remember to change the filename for each file you're reading/writing to.
So, your code would look like:
with open("wiki.txt", "r") as original, open("new.txt", "w") as mod:
for line in original:
words = line.split()
for word in words:
if dict.get(word.lower()) is not None:
line = line.replace(word, word + "_1")
mod.write(line)