For a current research project, I am planning to count the unique words of different objects in a JSON file. Ideally, the output file should show separate word count summaries (counting the occurence of unique words) for the texts in "Text Main", "Text Pro" and "Text Con". Is there any smart tweak to make this happen?
At the moment, I am receiving the following error message:
File "index.py", line 10, in <module>
text = data["Text_Main"]
TypeError: list indices must be integers or slices, not str
The JSON file has the following structure:
[
{"Stock Symbol":"A",
"Date":"05/11/2017",
"Text Main":"Text sample 1",
"Text Pro":"Text sample 2",
"Text Con":"Text sample 3"}
]
And the corresponding code looks like this:
# Import relevant libraries
import string
import json
import csv
import textblob
# Open JSON file and slice by object
file = open("Glassdoor_A.json", "r")
data = json.load(file)
text = data["Text_Main"]
# Create an empty dictionary
d = dict()
# Loop through each line of the file
for line in text:
# Remove the leading spaces and newline character
line = line.strip()
# Convert the characters in line to
# lowercase to avoid case mismatch
line = line.lower()
# Remove the punctuation marks from the line
line = line.translate(line.maketrans("", "", string.punctuation))
# Split the line into words
words = line.split(" ")
# Iterate over each word in line
for word in words:
# Check if the word is already in dictionary
if word in d:
# Increment count of word by 1
d[word] = d[word] + 1
else:
# Add the word to dictionary with count 1
d[word] = 1
# Print the contents of dictionary
for key in list(d.keys()):
print(key, ":", d[key])
# Save results as CSV
with open('Glassdoor_A.csv', 'w', newline='') as file:
writer = csv.writer(file)
writer.writerow(["Word", "Occurences", "Percentage"])
writer.writerows([key, d[key])
Well, firstly the key should be "Text Main" and secondly you need to access the first dict in the list. So just extract the text variable like this:
text = data[0]["Text Main"]
This should fix the error message.
Your JSON file has an object inside a list. In order to access the content you want, first you have to access the object via data[0]. Then you can access the string field. I would change the code to:
# Open JSON file and slice by object
file = open("Glassdoor_A.json", "r")
data = json.load(file)
json_obj = data[0]
text = json_obj["Text_Main"]
or you can access that field in a single line with text = data[0]["Text_Main"] as quamrana stated.
Related
Currently this is what i have, once i found the word fail , it needs to search above the fail string and display up until it hits another string i set. and then inquire user for the string to search within that area. Am i correct in creating a new text file to store the lines that contains the string from user ?
from typing import Any
from collections import Counter
errorList = []
with open('File1.txt', 'r') as f:
data = f.readlines()
for line in data:
if line.__contains__('Fail '):
errorList.append(line)
errorList = [i[30:56] for i in errorList]
print("Failed are = ", errorList)
string = input("What string do you like to search for in this test case? ")
f = open('File1.txt')
f1 = open('StringSearch.txt', 'a')
for line in f.readlines():
if string in line:
f1.write(line)
I'm trying to get this code to work and keep getting
AttributeError: 'str' object has no attribute 'txt'
my code is as written below, I am new to this so any help would be greatly appreciated. I for the life of me cannot figure out what I am doing wrong.
def countFrequency(alice):
# Open file for reading
file = open(alice.txt, "r")
# Create an empty dictionary to store the words and their frequency
wordFreq = {}
# Read file line by line
for line in file:
# Split the line into words
words = line.strip().split()
# Iterate through the list of words
for i in range(len(words)):
# Remove punctuations and special symbols from the word
for ch in '!"#$%&()*+,-./:;<=>?<#[\\]^_`{|}~' :
words[i] = words[i].replace(ch, "")
# Convert the word to lowercase
words[i] = words[i].lower()
# Add the word to the dictionary with a frequency of 1 if it is not already in the dictionary
if words[i] not in wordFreq:
wordFreq[words[i]] = 1
# Increase the frequency of the word by 1 in the dictionary if it is already in the dictionary
else:
wordFreq[words[i]] += 1
# Close the file
file.close()
# Return the dictionary
return wordFreq
if __name__ == "__main__":
# Call the function to get frequency of the words in the file
wordFreq = countFrequency("alice.txt")
# Open file for writing
outFile = open("most_frequent_alice.txt", "w")
# Write the number of unique words to the file
outFile.write("Total number of unique words in the file: " + str(len(wordFreq)) + "\n")
# Write the top 20 most used words and their frequency to the file
outFile.write("\nTop 20 most used words and their frequency:\n\n")
outFile.write("{:<20} {}\n" .format("Word", "Frequency"))
wordFreq = sorted(wordFreq.items(), key = lambda kv:(kv[1], kv[0]), reverse = True)
for i in range(20):
outFile.write("{:<20} {}\n" .format(wordFreq[i][0], str(wordFreq[i][1])))
# Close the file
outFile.close()
file = open("alice.txt", "r")
You missed the quotation, and you might need to give the correct location of that text file too.
I have a dictionary made in python. I also have a text file where each line is a different word. I want to check each line of the text file against the keys of the dictionary and if the line in the text file matches the key I want to write that key's value to an output file. Is there an easy way to do this. Is this even possible?
for example I am reading my file in like this:
test = open("~/Documents/testfile.txt").read()
tokenising it and for each word token I want to look it up a dictionary, my dictionary is setup like this:
dic = {"a": ["ah0", "ey1"], "a's": ["ey1 z"], "a.": ["ey1"], "a.'s": ["ey1 z"]}
If I come across the letter 'a' in my file, I want it to output ["ah0", "ey1"].
you can try:
for line in all_lines:
for val in dic:
if line.count(val) > 0:
print(dic[val])
this will look through all lines in the file and if the line contains a letter from dic, then it will print the items associated with that letter in the dictionary (you will have to do something like all_lines = test.readlines() to get all the lines in a list) the dic[val] gives the list assined to the value ["ah0", "ey1"] so you do not just have to print it but you can use it in other places
you can give this a try:
#dictionary to match keys againts words in text filee
dict = {"a": ["ah0", "ey1"], "a's": ["ey1 z"], "a.": ["ey1"], "a.'s": ["ey1 z"]}
# Read from text filee
open_file = open('sampletext.txt', 'r')
lines = open_file.readlines()
open_file.close()
#search the word extracted from textfile, if found in dictionary then print list into the file
for word in lines:
if word in dict:
write_to_file = open('outputfile.txt', 'w')
write_to_file.writelines(str(dict[word]))
write_to_file.close()
Note: you may need to strip the newline "\n" if the textfile you read from have multiple lines
Task: given a txt file with adjective \t synonym, synonym, synonym, etc. in a line, several lines are given. I need to create a dictionary, where adjective will be a key and synonyms - a value. My code:
#necessary for command line + regex
import sys
import re
#open file for reading
filename = sys.argv[1]
infile = open(filename, "r")
#a
#create a dictionary, where an adjective in a line is a key
#and synonyms are the value
dictionary = {}
#for each line in infile
for line in infile:
#creating a list with keys, a key is everything before the tab
adjectives = re.findall(r"w+\t$", line)
print(adjectives)
#creating a list of values, a value is everything after the tab
synonyms = re.findall(r"^\tw+\n$", line)
print(synonyms)
#combining both lists into a dictionary, where adj are keys, synonyms - values
dictionary = dict(zip(adjectives, synonyms))
print(dictionary)
#close the file
infile.close()
The output shows me the empty brackets... Could someone help to fix?
Instead of regular expressions, use split() to split strings using delimiters. First split it using \t to separate the adjective from the synonyms, then split the synonyms into a list using ,.
Then you need to add a new key in the dictionary, not replace the entire dictionary.
for line in infile:
line = line.strip() # remove newline
adjective, synonyms = line.split("\t")
synonyms = synonyms.split(",")
dictionary[adjective] = synonyms
print(dictionary)
I'm trying to take text files, count the word usage of each word as key-value pairs in dictionaries, and write each dictionary to their own file. Then, I want to add all of the dictionaries together into one master dictionary, and then write that to its own text file. When I run the program, I keep getting a TypeError with the save_the_dictionary function, since it's getting passed a dictionary instead of a string; however, I thought that my save_the_dictionary function changes each key-value pair into strings before they are written to the file, but that doesn't seem to be the case. Any help with this would be greatly appreciate. Here is my code:
import os
from nltk.tokenize import sent_tokenize, word_tokenize
class Document:
def tokenize(self, text):
dictionary = {}
for line in text:
all_words = line.upper()
words = word_tokenize(all_words)
punctuation = '''!()-[]{};:'"\,<>./?##$%^&*_~'''
cleaned_words = []
for word in words:
if word not in punctuation:
cleaned_words.append(word)
for word in cleaned_words:
if word in dictionary:
dictionary[word] += 1
else:
dictionary[word] = 1
return dictionary
def save_the_dictionary(self, dictionary, filename): #This save function writes a new file, and turns each key and its corresponding value into strings and writes them into a text file-
newfile = open(filename, "w") #, it also adds formatting by tabbing over after the key, writing the value, and then making a new line. Then it closes the file.
for key, value in dictionary.items():
newfile.write(str(key) + "/t" + str(value) + "/n")
file.close()
# The main idea of this method is that it first converts all the text to uppercase strips all of the formatting from the file that it is reading, then it splits the text into a list,-
# using both whitespace and the characters above as delimiters. After that, it goes through the entire list pulled from the text file, and sees if it is in the dictionary variable. If-
# it is in there, it adds 1 to the value associated with that key. If it is not found within the dictionary variable, it adds to as a key to the dictionary variable, and sets its value-
# to 1.
#The above document class will only be used within the actual vectorize function.
def vectorize(filepath):
all_files = os.listdir(filepath)
full_dictionary = {}
for file in all_files:
doc = Document()
full_path = filepath + "\\" + file
textfile = open(full_path, "r", encoding="utf8")
text = textfile.read()
compiled_dictionary = doc.tokenize(text)
final_path = filepath + "\\final" + file
doc.save_the_dictionary(final_path, compiled_dictionary)
for line in text:
all_words = line.upper()
words = word_tokenize(all_words)
punctuation = '''!()-[]{};:'"\,<>./?##$%^&*_~'''
cleaned_words = []
for word in words:
if word not in punctuation:
cleaned_words.append(word)
for word in cleaned_words:
if word in dictionary:
full_dictionary[word] += 1
else:
full_dictionary[word] = 1
Document().save_the_dictionary(filepath + "\\df.txt", full_dictionary)
vectorize("C:\\Users\\******\\Desktop\\*******\\*****\\*****\\Text files")