Table/Data manipulation with Python Dictionary - python

I need help finishing up this python script. I'm an intern at a company, and this is my first week. I was asked to develop a python script that will take a .csv and put(append) any related columns into one column so that they have only the 15 or so necessary columns with the data in them. For example, if there are zip4, zip5, or postal code columns, they want those to all be underneath the zip code column.
I just started learning python this week as I was doing this project so please excuse my noobish question and vocabulary. I'm not looking for you guys to do this for me. I'm just looking for some guidance. In fact, I want to learn more about python, so anyone who could lead me in the right direction, please help.
I'm using dictionary key and values. The keys are every column in the first row. The values of each key are the remaining rows(second through 3000ish). Right now, I'm only getting one key:value pair. I'm only getting the final row as my array of values, and I'm only getting one key. Also, I'm getting a KeyError message, so my key's aren't being identified correctly. My code so far is underneath. I'm gonna keep working on this, and any help is immensely appreciated! Hopefully, I can by the person who helps me a beer and I can pick their brain a little:)
Thanks for your time
# To be able to read csv formated files, we will frist have to import the csv module
import csv
# cols = line.split(',')# each column is split by a comma
#read the file
CSVreader = csv.reader(open('N:/Individual Files/Jerry/2013 customer list qc, cr, db, gb 9-19-2013_JerrysMessingWithVersion.csv', 'rb'), delimiter=',', quotechar='"')
# define open dictionary
SLSDictionary={}# no empty dictionary. Need column names to compare to.
i=0
#top row are your keys. All other rows are your values
#adjust loop
for row in CSVreader:
# mulitple loops needed here
if i == 0:
key = row[i]
else:
[values] = [row[1:]]
SLSDictionary = dict({key: [values]}) # Dictionary is keys and array of values
i=i+1
#print Dictionary to check errors and make sure dictionary is filled with keys and values
print SLSDictionary
# SLSDictionary has key of zip/phone plus any characters
#SLSDictionary.has_key('zip.+')
SLSDictionary.has_key('phone.+')
#value of key are set equal to x. Values of that column set equal to x
#[x]=value
#IF SLSDictionary has the key of zip plus any characters, move values to zip key
#if true:
# SLSDictionary['zip'].append([x])
#SLSDictionary['phone_home'].append([value]) # I need to append the values of the specific column, not all columns
#move key's values to correct, corresponding key
SLSDictionary['phone_home'].append(SLSDictionary[has_key('phone.+')])#Append the values of the key/column 'phone plus characters' to phone_home key/column in SLSDictionary
#if false:
# print ''
# go to next key
SLSDictionary.has_value('')
if true:
print 'Error: No data in column'
# if there's no data in rows 1-?. Delete column
#if value <= 0:
# del column
print SLSDictionary

Found a couple of errors just quickly looking at it. One thing you need to watch out for is that you're assigning a new value to the existing dictionary every time:
SLSDictionary = dict({key: [values]})
You're re-assigning a new value to your SLSDictionary every time it enters that loop. Thus at the end you only have the bottom-most entry. To add a key to the dictionary you do the following:
SLSDictionary[key] = values
Also you shouldn't need the brackets in this line:
[values] = [row[1:]]
Which should instead just be:
values = row[1:]
But most importantly is that you will only ever have one key because you constantly increment your i value. So it will only ever have one key and everything will constantly be assigned to it. Without a sample of how the CSV looks I can't instruct you on how to restructure the loop so that it will catch all the keys.
Assuming your CSV is like this as you've described:
Col1, Col2, Col3, Col4
Val1, Val2, Val3, Val4
Val11, Val22, Val33, Val44
Val111, Val222, Val333, Val444
Then you probably want something like this:
dummy = [["col1", "col2", "col3", "col4"],
["val1", "val2", "val3", "val4"],
["val11", "val22", "val33", "val44"],
["val111", "val222", "val333", "val444"]]
column_index = []
SLSDictionary = {}
for each in dummy[0]:
column_index.append(each)
SLSDictionary[each] = []
for each in dummy[1:]:
for i, every in enumerate(each):
try:
if column_index[i] in SLSDictionary.keys():
SLSDictionary[column_index[i]].append(every)
except:
pass
print SLSDictionary
Which Yields...
{'col4': ['val4', 'val44', 'val444'], 'col2': ['val2', 'val22', 'val222'], 'col3': ['val3', 'val33', 'val333'], 'col1': ['val1', 'val11', 'val111']}
If you want them to stay in order then change the dictionary type to OrderedDict()

Related

How do you modify a dictionary from a text file when you only need to get specific values?

So say we have some sort of file with maybe like 6 columns, and 6 rows. If I wanted to get one specific column that reads one line, and modify a current dictionary I have, how would I approach that?
The output should be all the data with the the key being the second column, and the 2 values being the first column and 4th column?
Can somebody please help me start this off?
I've tried using:
for line in file:
(key, val) = line.split()
data[int(key)] = val
print (data)
But, obviously this'll fail, since this expects only 2 values. I need 1 key value, and 2 value values.
split returns a list. Use the list instead of expanding into named variables.
data = {}
for line in file:
row = line.strip().split()
data[int(row[1])] = row[0], row[3]
print (data)

unhashable type: 'dict'

I am new in here and want to ask something about removing duplicate data enter, right now I'm still doing my project about face recognition and stuck in remove duplicate data enter that I send to google sheets, this is the code that I use:
if(confidence <100):
id = names[id]
confidence = "{0}%".format (round(100-confidence))
row = (id,datetime.datetime,now().strftime('%Y-%m-%d %H:%M:%S'))
index = 2
sheet.insert_row (row,index)
data = sheet.get_all_records()
result = list(set(data))
print (result)
The message error "unhashable type: 'dict"
I want to post the result in google sheet only once enter
You can't add dictionaries to sets.
What you can do is add the dictionary items to the set. You can cast this to a list of tuples like so:
s = set(tuple(data.items()))
If you need to convert this back to a dictionary after, you can do:
for t in s:
new_dict = dict(t)
According to documentation of gspread get_all_records() returns list of dicts where dict has head row as key and value as cell value. So, you need to iterate through this list compare your ids to find and remove repeating items. Sample code:
visited = []
filtered = []
for row in data:
if row['id'] not in visited:
visited.append(row['id'])
else:
filtered.append(row)
Now, filtered should contain unique items. But instead of id you should put the name of the column which contains repeating value.

Using two columns in an existing SQLite database to create a third column using Python

I have created a database with multiple columns and am wanting to use the data stored in two of the columns (named 'cost' and 'Mwe') to create a new column 'Dollar_per_KWh'. I have created two lists, one contains the rowid and the other contains the new value that I want to populate the new Dollar_per_KWh column. As it iterates through all the rows, the two lists are zipped together into a dictionary containing tuples. I then try to populate the new sqlite column. The code runs and I do not receive any errors. When I print out the dictionary it looks correct.
Issue: the new column in my database is not being updated with the new data and I am not sure why. The values in the new column are showing 'NULL'
Thank you for your help. Here is my code:
conn = sqlite3.connect('nuclear_builds.sqlite')
cur = conn.cursor()
cur.execute('''ALTER TABLE Construction
ADD COLUMN Dollar_per_KWh INTEGER''')
cur.execute('SELECT _rowid_, cost, Mwe FROM Construction')
data = cur.fetchall()
dol_pr_kW = dict()
key = list()
value = list()
for row in data:
id = row[0]
cost = row[1]
MWe = row[2]
value.append(int((cost*10**6)/(MWe*10**3)))
key.append(id)
dol_pr_kW = list(zip(key, value))
cur.executemany('''UPDATE Construction SET Dollar_per_KWh = ? WHERE _rowid_ = ?''', (dol_pr_kW[1], dol_pr_kW[0]))
conn.commit()
Not sure why it isn't working. Have you tried just doing it all in SQL?
conn = sqlite3.connect('nuclear_builds.sqlite')
cur = conn.cursor()
cur.execute('''ALTER TABLE Construction
ADD COLUMN Dollar_per_KWh INTEGER;''')
cur.execute('''UPDATE Construction SET Dollar_per_KWh = cast((cost/MWe)*1000 as integer);''')
It's a lot simpler just doing the calculation in SQL than pulling data to Python, manipulating it, and pushing it back to the database.
If you need to do this in Python for some reason, testing whether this works will at least give you some hints as to what is going wrong with your current code.
Update: I see a few more problems now.
First I see you are creating an empty dictionary dol_pr_kW before the for loop. This isn't necessary as you are re-defining it as a list later anyway.
Then you are trying to create the list dol_pr_kW inside the for loop. This has the effect of over-writing it for each row in data.
I'll give a few different ways to solve it. It looks like you were trying a few different things at once (using dict and list, building two lists and zipping into a third list, etc.) that is adding to your trouble, so I am simplifying the code to make it easier to understand. In each solution I will create a list called data_to_insert. That is what you will pass at the end to the executemany function.
First option is to create your list before the for loop, then append it for each row.
dol_pr_kW = list()
for row in data:
id = row[0]
cost = row[1]
MWe = row[2]
val = int((cost*10**6)/(MWe*10**3))
dol_pr_kW.append(id,val)
#you can do this or instead change above step to dol_pr_kW.append(val,id).
data_to_insert = [(r[1],r[0]) for r in dol_pr_kW]
The second way would be to zip the key and value lists AFTER the for loop.
key = list()
value = list()
for row in data:
id = row[0]
cost = row[1]
MWe = row[2]
value.append(int((cost*10**6)/(MWe*10**3)))
key.append(id)
dol_pr_kW = list(zip(key,value))
#you can do this or instead change above step to dol_pr_kW=list(zip(value,key))
data_to_insert = [(r[1],r[0]) for r in dol_pr_kW]
Third, if you would rather keep it as an actual dict you can do this.
dol_pr_kW = dict()
for row in data:
id = row[0]
cost = row[1]
MWe = row[2]
val = int((cost*10**6)/(MWe*10**3))
dol_pr_kW[id] = val
# convert to list
data_to_insert = [(dol_pr_kW[id], id) for id in dol_per_kW]
Then to execute call
cur.executemany('''UPDATE Construction SET Dollar_per_KWh = ? WHERE _rowid_ = ?''', data_to_insert)
cur.commit()
I prefer the first option since it's easiest for me to understand what's happening at a glance. Each iteration of the for loop just adds a (id, val) to the end of the list. It's a little more cumbersome to build two lists independently and zip them together to get a third list.
Also note that if the dol_pr_kW list had been created correctly, passing (dol_pr_kW[1],dol_pr_kW[0]) to executemany would pass the first two rows in the list instead of reversing (key,value) to (value,key). You need to do a list comprehension to accomplish the swap in one line of code. I just did this as a separate line and assigned it to variable data_to_insert for readability.

Reading and writing to/from csv files

I want my program to read 2 columns (the first and the second one) and add them to an array. They are dependent on eachother - so they need to be written alongside eachother, as in the first row (both columns) next to eachother, and then the second row and so on.
I have managed to write the first column (containing the names) to the array, however have not managed to write the second column to the array.
rownum=1
array=[]
for row in reader:
if row[1] != '' and row[1] != 'Score':
array.append(row[1])
rownum=rownum+1
if rownum==11:
break
I attempted to append more than one row however it returns the error message 'only accepts one argument'.
Any ideas how I can do this so i can reference the score for each name from the csv file
Try using a dictionary.
d = {} #curly braces denote an empty dictionary
for row in reader:
d[row[0]] = row[1]
d, in this case, would be a dictionary with the first column of your csv file as the keys and the second column as the corresponding values.
You can access it very similar to how you access a list. Say you had Brian,80 as one of the entries in your csv file, d["Brian"] would return 80.
EDIT
OP has requested (in the comments) for a more complete version of the code. Assuming OP's code already works, I'll modify that code so it works with a dictionary:
rownum=1
d={} #denotes an empty dictionary
for row in reader:
if row[1] != '' and row[1] != 'Score':
d[row[0]]=row[1] #first column is the key/index, second column is the value
rownum=rownum+1
if rownum==11:
break

Disturbing odd behavior/bug in Python itertools groupby?

I am using itertools.groupby to parse a short tab-delimited textfile. the text file has several columns and all I want to do is group all the entries that have a particular value x in a particular column. The code below does this for a column called name2, looking for the value in variable x. I tried to do this using csv.DictReader and itertools.groupby. In the table, there are 8 rows that match this criteria so 8 entries should be returned. Instead groupby returns two sets of entries, one with a single entry and another with 7, which seems like the wrong behavior. I do the matching manually below on the same data and get the right result:
import itertools, operator, csv
col_name = "name2"
x = "ENSMUSG00000002459"
print "looking for entries with value %s in column %s" %(x, col_name)
print "groupby gets it wrong: "
data = csv.DictReader(open(f), delimiter="\t", fieldnames=fieldnames)
for name, entries in itertools.groupby(data, key=operator.itemgetter(col_name)):
if name == "ENSMUSG00000002459":
wrong_result = [e for e in entries]
print "wrong result has %d entries" %(len(wrong_result))
print "manually grouping entries is correct: "
data = csv.DictReader(open(f), delimiter="\t", fieldnames=fieldnames)
correct_result = []
for row in data:
if row[col_name] == "ENSMUSG00000002459":
correct_result.append(row)
print "correct result has %d entries" %(len(correct_result))
The output I get is:
looking for entries with value ENSMUSG00000002459 in column name2
groupby gets it wrong:
wrong result has 7 entries
wrong result has 1 entries
manually grouping entries is correct:
correct result has 8 entries
what is going on here? If groupby is really grouping, it seems like I should only get one set of entries per x, but instead it returns two. I cannot figure this out. EDIT: Ah got it it should be sorted.
You're going to want to change your code to force the data to be in key order...
data = csv.DictReader(open(f), delimiter="\t", fieldnames=fieldnames)
sorted_data = sorted(data, key=operator.itemgetter(col_name))
for name, entries in itertools.groupby(data, key=operator.itemgetter(col_name)):
pass # whatever
The main use though, is when the datasets are large, and the data is already in key order, so when you have to sort anyway, then using a defaultdict is more efficient
from collections import defaultdict
name_entries = defaultdict(list)
for row in data:
name_entries[row[col_name]].append(row)
According to the documentation, groupby() groups only consecutive occurrences of the same key.
I don't know what your data looks like but my guess is it's not sorted. groupby works on sorted data

Categories

Resources