I am creating a main function which loops through dictionary that has one key for all the values associated with it. I am having trouble because I can not get the dictionary to be all lowercase. I have tried using .lower but to no avail. Also, the program should look at the words of the sentence, determine whether it has seen more of those words in sentences that the user has previously called "happy", "sad", or "neutral", (based on the three dictionaries) and make a guess as to which label to apply to the sentence.
an example output would be like
Sentence: i started screaming incoherently about 15 mins ago, this is B's attempt to calm me down.
0 appear in happy
0 appear in neutral
0 appear in sad
I think this is sad.
You think this is: sad
Okay! Updating.
CODE:
import csv
def read_csv(filename, col_list):
"""This function expects the name of a CSV file and a list of strings
representing a subset of the headers of the columns in the file, and
returns a dictionary of the data in those columns, as described below."""
with open(filename, 'r') as f:
# Better covert reader to a list (items represent every row)
reader = list(csv.DictReader(f))
dict1 = {}
for col in col_list:
dict1[col] = []
# Going in every row of the file
for row in reader:
# Append to the list the row item of this key
dict1[col].append(row[col])
return dict1
def main():
dictx = read_csv('words.csv', ['happy'])
dicty = read_csv('words.csv', ['sad'])
dictz = read_csv('words.csv', ['neutral'])
dictxcounter = 0
dictycounter = 0
dictzcounter = 0
a=str(raw_input("Sentence: ")).split(' ')
for word in a :
for keys in dictx['happy']:
if word == keys:
dictxcounter = dictxcounter + 1
for values in dicty['sad']:
if word == values:
dictycounter = dictycounter + 1
for words in dictz['neutral']:
if word == words:
dictzcounter = dictzcounter + 1
print dictxcounter
print dictycounter
print dictzcounter
Remove this line from your code:
dict1 = dict((k, v.lower()) for k,v in col_list)
It overwrites the dictionary that you built in the loop.
Related
We have two csv files - new.csv and old.csv.
old.csv contains with four rows:
abc done
xyz done
pqr done
rst pending
The new.csv contains four new rows:
abc pending
xyz not_done
pqr pending
rst done
I need to use count two things without using pandas
count1 = number of entries changed from done to pending = 2 (abc, pqr)
count2 = number of entries changed from done to not_done = 1 (xyz)
CASE 1: CSV Files are in the same order
Firstly import the two files into python lists:
oldcsv = []
with open("old.csv") as f:
for line in f:
oldcsv.append(line.strip().split(","))
newcsv = []
with open("new.csv") as f:
for line in f:
newcsv.append(line.strip().split(","))
Now you would simply iterate through both lists simultaneously, using zip(). I am assuming that both CSV files list the entries in the same order.
count1 = 0
count2 = 0
for oldentry, newentry in zip(oldcsv, newcsv):
assert(oldentry[0] == newentry[0]) # Throw error if entry names do not match
if oldentry[1] == "done":
if newentry[1] == "pending":
count1 += 1
elif newentry[1] == "not_done":
count2 += 1
CASE 2: CSV Files are in arbitrary order
Here, given you are going to be needing to look up entries by their names, I would use a dictionary rather than a list to store the old.csv data, mapping the entry names to their values:
# Load old.csv data into a dictionary mapping entry_name: entry_value
old_values = {}
with open("old.csv") as f:
for line in f:
old_entry = line.strip().split(",")
entry_name, old_entry_value = old_entry[0], old_entry[1]
old_values[entry_name] = old_entry_value
count1 = 0
count2 = 0
with open("new.csv") as f:
for line in f:
# For each entry in new_entry, look up the corresponding old entry in old_entries, and compare their values.
new_entry = line.strip().split(",")
entry_name, new_entry_value = new_entry[0], new_entry[1]
old_entry_value = old_values.get(entry_name) # Get the old value for this entry (will be None if there is no old entry)
# Essentially same code as before:
print(f"{entry_name}: old entry status is {old_entry_value} and new entry status is {new_entry_value}")
if old_entry_value == "done":
if new_entry_value == "pending":
print("Incrementing count1")
count1 += 1
elif new_entry_value == "not_done":
print("Incrementing count2")
count2 += 1
print(count1)
print(count2)
This should work, as long as the input data is properly formatted. I am assuming each .csv file has one entry per line, and each line begins with the entry name (e.g. "abc"), then a comma, then the entry value (e.g. "done","not_done").
Here is a pure python straightforward implementation:
import csv
with open("old.csv") as old_fl:
with open("new.csv") as new_fl:
old = csv.reader(old_fl)
new = csv.reader(new_fl)
old_rows = [row for row in old]
new_rows = [row for row in new]
# see if this is really needed
assert len(old_rows) == len(new_rows)
n = len(old_rows)
# assume that left key is identical,
# and in the same order in both files
assert all(old_rows[i][0] == new_rows[i][0] for i in range(n))
# once the data is guaranteed to align,
# just count what you want
done_to_pending = [
f"row[{i}]( {old_rows[i][0]} )"
for i in range(n)
if old_rows[i][1] == "done" and new_rows[i][1] == "pending"
]
done_to_notdone = [
f"row[{i}]( {old_rows[i][0]} )"
for i in range(n)
if old_rows[i][1] == "done" and new_rows[i][1] == "not_done"
]
It uses the python native csv reader so you don't need to parse the csv yourself. Note that there are various assumptions (assert statements) throughout the code - you might need to adjust the code to handle more cases.
I am trying to look up dictionary indices for thousands of strings and this process is very, very slow. There are package alternatives, like KeyedVectors from gensim.models, which does what I want to do in about a minute, but I want to do what the package does more manually and to have more control over what I am doing.
I have two objects: (1) a dictionary that contains key : values for word embeddings, and (2) my pandas dataframe with my strings that need to be transformed into the index value found for each word in object (1). Consider the code below -- is there any obvious improvement to speed or am I relegated to external packages?
I would have thought that key lookups in a dictionary would be blazing fast.
Object 1
embeddings_dictionary = dict()
glove_file = open('glove.6B.200d.txt', encoding="utf8")
for line in glove_file:
records = line.split()
word = records[0]
vector_dimensions = np.asarray(records[1:], dtype='float32')
embeddings_dictionary [word] = vector_dimensions
Object 2 (The slowdown)
no_matches = []
glove_tokenized_data = []
for doc in df['body'][:5]:
doc = doc.split()
ints = []
for word in doc:
try:
# the line below is the problem
idx = list(embeddings_dictionary.keys()).index(word)
except:
idx = 400000 # unknown
no_matches.append(word)
ints.append(idx)
glove_tokenized_data.append(ints)
You've got a mapping of word -> np.array. It appears you want a quick way to map word to its location in the key list. You can do that with another dict.
no_matches = []
glove_tokenized_data = []
word_to_index = dict(zip(embeddings_dictionary.keys(), range(len(embeddings_dictionary))))
for doc in df['body'][:5]:
doc = doc.split()
ints = []
for word in doc:
try:
idx = word_to_index[word]
except KeyError:
idx = 400000 # unknown
no_matches.append(word)
ints.append(idx)
glove_tokenized_data.append(ints)
In the line you marked as a problem, you are first creating a list from the keys and then looking up the word in the list. You're doing this inside the loop so the first thing you could do is take this logic to the top of the block (outside the loop) to avoid repeated processing and second you're doing all this searching now on a list, not a dictionary.
Why not create another dictionary like this on top of the file:
reverse_lookup = { word: index for word, index in enumerate(embeddings_dictionary.keys()) }
and then use this dictionary to look up the index of your word. Something like this:
for word in doc:
if word in reverse_lookup:
ints.append(reverse_lookup[word])
else:
no_matches.append(word)
I am trying to complete my homework assignment for my introductory programming class. I believe that I am approaching a viable solution with my code, but, I am struggling to understand how to switch between adding data to a dictionary, and then displaying that data as a list of tuples. How do I display the frequency of phrases in the form of a list of tuples when I have placed data into a dictionary?
Here is my homework prompt, associated test cases, and the current code I have written:
Write a function called phrase_freq that takes as input two file name strings. The
first file name refers to a file containing a book. The second file name contains
phrases, one phrase to a line. The function uses a dictionary to map each
phrase in the second file to an integer representing the number of times the phrase
appears in the first file. Phrases should be counted regardless of their capitalization
(e.g., if the phrase is "The Goops", then the phrases "The goops", "the Goops", and
"the goops", should be counted). The function returns a list of tuples containing the phrase and the number of times the phrase appears, sorted from largest number of appearances
to the smallest.
#test cases
"""
>>> phrase_freq("goops.txt","small_phrase.txt")
[('The Goops', 11)]
>>> phrase_freq("goops.txt","large_phrase.txt")
[('The Goops', 11), ('a Goop', 9), ('your mother', 5), ('you know', 3), ('your father', 3), ('never knew', 2), ('forget it', 1)]
"""
def phrase_freq(file1, file2):
text_list =[]
text_dict = {}
in_file = open(file1)
for data in in_file:
a_dict += data
in_file.close()
in_file_2 = open(file2)
key = data
val = 0
for other_data in in_file_2:
if key in file2:
a_dict[key] = a_dict[key] + 1
else:
a_dict[key] = 1
in_file_2.close()
return a_dict
Thank you so much!
EDIT:
I have corrected code to this. I had some variable name inconsistencies, some indentation errors, and edited my return statement to format the output properly.
(Ignore the comments for myself)
def phrase_freq(file1, file2):
#1. Initialize a dictionary.
a_dict = {}
#2. Open the file containing the phrases from the book.
in_file = open(file1)
#3. Iterate though the data in the phrases file.
for data in in_file:
#4. Add this data into the dictionary.
a_dict += data
#5. Close this file.
in_file.close()
#6. Open the file containing text from book.
in_file_2 = open(file2)
key = data
val = 0
for other_data in in_file_2:
if key in file2:
a_dict[key] = a_dict[key] + 1
else:
a_dict[key] = 1
in_file_2.close()
return list(a_dict.items())
You can use the .items() method of the dict class to get the key-value pairs from a dictionary. This returns a view object, so you need to build a list from it with the list() constructor:
freq_dict = {'the goops': 11, 'you know': 3}
freq_as_tuples = list(freq_dict.items())
print(freq_as_tuples)
Output:
[('the goops', 11), ('you know', 3)]
Assumming your counting logic is correct, this is how you can get tuples from a dict
I have a very large file that looks like this:
[original file][1]
field number 7 (info) contains ~100 pairs of X=Y separated by ';'.
I first want to split all X=Y pairs.
Next I want to scan one pair at a time, and if X is one of 4 titles and Y is an int- I want to put them them in a dictionary.
After finishing going through the pairs I want to check if the dictionary contains all 4 of my titles, and if so, I want to calculate something and write it into a new file.
This is the part of my code which suppose to do that:
for row in reader:
m = re.split(';',row[7]) # split the info field by ';'
d = {}
nl = []
for c in m: # for each info field, split by '=', if it is one of the 4 fields wanted and the value is int- add it to a dict
t = re.split('=',c)
if (t[0]=='AC_MALE' or t[0]=='AC_FEMALE' or t[0]=='AN_MALE' or t[0]=='AN_FEMALE') and type(t[1])==int:
d[t[0]] = t[1]
if 'AC_MALE' in d and 'AC_FEMALE' in d and 'AN_MALE' in d and 'AN_FEMALE' in d: # if the dict contain all 4 wanted fields- make a new line for the final file
total_ac = int(d['AC_MALE'])+ int(d['AC_FEMALE'])
total_an = int(d['AN_MALE'])+ int(d['AN_FEMALE'])
ac_an = total_ac/total_an
nl.extend([row[0],row[1],row[3],row[4],total_ac,total_an, ac_an])
writer.writerow(nl)
The code is running with no errors but isnt writing anything to the file.
Can someone figure out why?
Thanks!
type(t[1])==int is never true. t[1] is a string, always, because you just split that object from another string. It doesn't matter here if the string contains only digits and could be converted to a int.
Test if you can convert your string to an integer, and if that fails, just move on to the next. If it succeeds, add the value to your dictionary:
for c in m:
t = re.split('=',c)
if (t[0]=='AC_MALE' or t[0]=='AC_FEMALE' or t[0]=='AN_MALE' or t[0]=='AN_FEMALE'):
try:
d[t[0]] = int(t[1])
except ValueError:
# string could not be converted, so move on
pass
Note that you don't need to use re.split(); use the standard str.split() method instead. You don't need to test if all keys are present in your dictionary afterwards, just test if the dictionary contains 4 elements, so has a length of 4. You can also simplify the code to test the key name:
for row in reader:
d = {}
for key_value in row[7].split(','):
key, value = key_value.split('=')
if key in {'AC_MALE', 'AC_FEMALE', 'AN_MALE', 'AN_FEMALE'}:
try:
d[key] = int(value)
except ValueError:
pass
if len(d) == 4:
total_ac = d['AC_MALE'] + d['AC_FEMALE']
total_an = d['AN_MALE'] + d['AN_FEMALE']
ac_an = total_ac / total_an
writer.writerow([
row[0], row[1], row[3], row[4],
total_ac, total_an, ac_an])
I have this CSV file whereby it contain lots of information. I have coded a program which are able to count what are inside the columns of 'Feedback' and the frequency of it.
My problem now is that, after I have produced the items inside 'Feedback' columns, I want to specifically bring out another columns which tally to the 'Feedback' columns.
Some example of the CSV file is as follow:
Feedback Description Status
Others Fire Proct Complete
Complaints Grass Complete
Compliment Wall Complete
... ... ...
With the frequency of the 'Feedback' columns, I now want to show, let's say if I select 'Complaints'. Then I want everything that tally with 'Complaints' from Description to show up.
Something like this:
Complaints Grass
Complaints Table
Complaints Door
... ...
Following is the code I have so far:
import csv, sys, os, shutil
from collections import Counter
reader = csv.DictReader(open('data.csv'))
result = {}
for row in reader:
for column, value in row.iteritems():
result.setdefault(column,[]).append(value)
list = []
for items in result['Feedback']:
if items == '':
items = items
else:
newitem = items.upper()
list.append(newitem)
unique = Counter(list)
for k, v in sorted(unique.items()):
print k.ljust(30),' : ', v
This is only the part whereby it count what's inside the 'Feedback' Columns and the frequency of it.
You could also store a defaultdict() holding a list of entries for each category as follows:
import csv
from collections import Counter, defaultdict
with open('data.csv', 'rb') as f_csv:
csv_reader = csv.DictReader(f_csv)
result = {}
feedback = defaultdict(list)
for row in csv_reader:
for column, value in row.iteritems():
result.setdefault(column, []).append(value)
feedback[row['Feedback'].upper()].append(row['Description'])
data = []
for items in result['Feedback']:
if items == '':
items = items
else:
newitem = items.upper()
data.append(newitem)
unique = Counter(data)
for k, v in sorted(unique.items()):
print "{:20} : {:5} {}".format(k, v, ', '.join(feedback[k]))
This would display your output as:
COMPLAINTS : 2 Grass, Door
COMPLIMENT : 2 Wall, Table
OTHERS1 : 1 Fire Proct
Or on multiple lines if instead you used:
print "{:20} : {:5}".format(k, v)
print ' ' + '\n '.join(feedback[k])
When using the csv library, you should open your file with rb in Python 2.x. Also avoid using list as a variable name as this overwrites the Python list() function.
Note: It is easier to use format() when printing aligned data.
You can do it with the code at the very end of this snippet, which is derived from the code in your question. I modified how the file is read by using a with statement which insures that it is closed when it's no longer needed. I also changed the name of the variable named list you had. because it hides the name of the built-in type and is considered by most to be a poor programming practice. See PEP 8 - Style Guide for Python Code for more on this and related topics.
For testing purposes, I also added a couple more rows of 'Complaints' type of 'Feedback' items.
import csv
from collections import Counter
with open('information.csv') as csvfile:
result = {}
for row in csv.DictReader(csvfile):
for column, value in row.iteritems():
result.setdefault(column, []).append(value)
items = [item.upper() for item in result['Feedback']]
unique = Counter(items)
for k, v in sorted(unique.items()):
print k.ljust(30), ' : ', v
print
for i, feedback in enumerate(result['Feedback']):
if feedback == 'Complaints':
print feedback, ' ', result['Description'][i]
Output:
COMPLAINTS : 3
COMPLIMENT : 1
OTHERS : 1
Complaints Grass
Complaints Table
Complaints Door