Python Dictionary to CSV Issue - python

I put together a python script to clean CSV files. The reformatting works, but the data rows the writer writes to the new CSV file are wrong. I am constructing a dictionary of all rows of data before writing using writer.writerows(). When I check the dictionary using print statements, the correct data is appending to the list. However, after appending, the incorrect values are in the dictionary.
import csv
data = []
with open(r'C:\\Data\\input.csv', 'r') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
line_count = 0
street_fields = [] # Store new field names in list
street_fields.append("startdate")
street_fields.append("starttime")
street_fields.append("sitecode")
street_fields.append("recordtime")
street_fields.append("direction")
street_fields.append("turnright")
street_fields.append("wentthrough")
street_fields.append("turnleft")
street_fields.append("pedestrians")
for row in csv_reader: # Read input rows
if line_count == 0:
startdate = row[1] # Get Start Date from B1
line_count += 1
elif line_count == 1:
starttime = row[1] # Get Start Time from B2
line_count += 1
elif line_count == 2:
sitecode = str(row[1]) # Get Site code from B3
line_count += 1
elif line_count == 3:
street_count = len(row) - 3 # Determine number of streets in report
streetnames = []
i = 1
while i < street_count:
streetnames.append(row[i]) # Add streets to list
i += 4
line_count += 1
elif line_count > 4:
street_values = {} # Create dictionary to store new row values
n = 1
for street in streetnames:
turnright = 0 + n
wentthrough = 1 + n
turnleft = 2 + n
pedestrians = 3 + n
street_values["startdate"] = startdate
street_values["starttime"] = starttime
street_values["sitecode"] = sitecode
street_values["recordtime"] = row[0]
street_values["direction"] = street
street_values["turnright"] = int(row[turnright])
street_values["wentthrough"] = int(row[wentthrough])
street_values["turnleft"] = int(row[turnleft])
street_values["pedestrians"] = int(row[pedestrians])
data.append(street_values) # Append row dictionary to list
#print(street_values) ### UNCOMMENT TO SEE CORRECT ROW DATA ###
#print(data) ### UNCOMMENT TO SEE INCORRECT ROW DATA ###
n += 4
line_count += 1
else:
line_count += 1
with open(r'C:\\Data\\output.csv', 'w', newline='', encoding="utf-8") as w_scv_file:
writer = csv.DictWriter(w_scv_file,fieldnames=street_fields)
writer.writerow(dict((fn,fn) for fn in street_fields)) # Write headers to new CSV
writer.writerows(data) # Write data from list of dictionaries
An example of the list of dictionaries created (JSON):
[
{
"startdate":"11/9/2017",
"starttime":"7:00",
"sitecode":"012345",
"recordtime":"7:00",
"direction":"Cloud Dr. From North",
"turnright":0,
"wentthrough":2,
"turnleft":11,
"pedestrians":0
},
{
"startdate":"11/9/2017",
"starttime":"7:00",
"sitecode":"012345",
"recordtime":"7:00",
"direction":"Florida Blvd. From East",
"turnright":4,
"wentthrough":433,
"turnleft":15,
"pedestrians":0
},
{
"startdate":"11/9/2017",
"starttime":"7:00",
"sitecode":"012345",
"recordtime":"7:00",
"direction":"Cloud Dr. From South",
"turnright":15,
"wentthrough":4,
"turnleft":6,
"pedestrians":0
},
{
"startdate":"11/9/2017",
"starttime":"7:00",
"sitecode":"012345",
"recordtime":"7:00",
"direction":"Florida Blvd. From West",
"turnright":2,
"wentthrough":219,
"turnleft":2,
"pedestrians":0
},
{
"startdate":"11/9/2017",
"starttime":"7:00",
"sitecode":"012345",
"recordtime":"7:15",
"direction":"Cloud Dr. From North",
"turnright":1,
"wentthrough":3,
"turnleft":8,
"pedestrians":0
}
]
What actually writes to the CSV:
Note the Direction field and data rows are incorrect. For some reason when it loops through the streetnames list, the last street name and the corresponding row values persist for the individual record time.
Do I need to delete my variables before re-assigning them values?

It looks like you are appending the same dictionary to the list over and over.
In general, when appending a nuber of separate dictionaries to a list, I would use mylist.append(mydict.copy()), otherwise later on when you assign new values within a dictionary of the same name you are really just updating your old dictionary, including entries in your list that point to a dictionary of the same name (see mutable vs immutable objects in python).
In short: If you want the dictionary in the list to be a separate entity from the new one, create a deep copy using dict.copy() when appending it to the list.

Related

finding the same words in two files and leaving out not repeated ones in python

I have to write a program that correlates smoking with lung cancer risk. For that I have data in two files.
My code is computing the data given in the same lines (eg:America,23.3 with Spain,77.9 and
Italy,24.2 with Russia,60.8)
How to modify my code so that it computes the numbers of the same countries and leaves out the countries that occur only in one file (it shouldn't compute Germany, France, China, Korea because they are only in one file)
Thank you so much for your help in advance:)
smoking file:
Country, Percent Cigarette Smokers Data
America,23.3
Italy,24.2
Russia,23.7
France,14.9
England,17.9
Spain,17
Germany,21.7
second file:
Cases Lung Cancer per 100000
Spain,77.9
Russia,60.8
Korea,61.3
America,73.3
China,66.8
Vietnam,64.5
Italy,43.9
and my code:
def readFiles(smoking_datafile, cancer_datafile):
'''
Reads the data from the provided file objects smoking_datafile
and cancer_datafile. Returns a list of the data read from each
in a tuple of the form (smoking_datafile, cancer_datafile).
'''
# init
smoking_data = []
cancer_data = []
empty_str = ''
# read past file headers
smoking_datafile.readline()
cancer_datafile.readline()
# read data files
eof = False
while not eof:
# read line of data from each file
s_line = smoking_datafile.readline()
c_line = cancer_datafile.readline()
# check if at end-of-file of both files
if s_line == empty_str and c_line == empty_str:
eof = True
# check if end of smoking data file only
elif s_line == empty_str:
raise OSError('Unexpected end-of-file for smoking data file')
# check if at end of cancer data file only
elif c_line == empty_str:
raise OSError('Unexpected end-of-file for cancer data file')
# append line of data to each list
else:
smoking_data.append(s_line.strip().split(','))
cancer_data.append(c_line.strip().split(','))
# return list of data from each file
return (smoking_data, cancer_data)
def calculateCorrelation(smoking_data, cancer_data):
'''
Calculates and returns the correlation value for the data
provided in lists smoking_data and cancer_data
'''
# init
sum_smoking_vals = sum_cancer_vals = 0
sum_smoking_sqrd = sum_cancer_sqrd = 0
sum_products = 0
# calculate intermediate correlation values
num_values = len(smoking_data)
for k in range(0,num_values):
sum_smoking_vals = sum_smoking_vals + float(smoking_data[k][1])
sum_cancer_vals = sum_cancer_vals + float(cancer_data[k][1])
sum_smoking_sqrd = sum_smoking_sqrd + \
float(smoking_data[k][1]) ** 2
sum_cancer_sqrd = sum_cancer_sqrd + \
float(cancer_data[k][1]) ** 2
sum_products = sum_products + float(smoking_data[k][1]) * \
float(cancer_data[k][1])
# calculate and display correlation value
numer = (num_values * sum_products) - \
(sum_smoking_vals * sum_cancer_vals)
denom = math.sqrt(abs( \
((num_values * sum_smoking_sqrd) - (sum_smoking_vals ** 2)) * \
((num_values * sum_cancer_sqrd) - (sum_cancer_vals ** 2)) \
))
return numer / denom
Let's just focus on getting the data into a format that is easy to work with. The code below will get you a dictionary of the form ...
smokers_cancer_data = {
'America': {
'smokers': '23.3',
'cancer': '73.3'
},
'Italy': {
'smokers': '24.2',
'cancer': '43.9'
},
...
}
Once you have this you can get any values you need and perform your calculations. See the code below.
def read_data(filename: str) -> dict:
with open(filename, 'r') as file:
next(file) # Skip the header
data = dict();
for line in file:
cleaned_line = line.rstrip()
# Skip blank lines
if cleaned_line:
data_item = (cleaned_line.split(','))
data[data_item[0]] = float(data_item[1])
return data
# Load data into python dictionaries
smokers_data = read_data('smokersData.txt')
cancer_data = read_data('lungCancerData.txt')
# Build one dictionary that is easy to work with
smokers_cancer_data = dict()
for (key, value) in smokers_data.items():
if key in cancer_data:
smokers_cancer_data[key] = {
'smokers': smokers_data[key],
'cancer' : cancer_data[key]
}
print(smokers_cancer_data)
For example, if you want to calculate the sum of the smoker and cancer values.
smokers_total = 0
cancer_total = 0
for (key, value) in smokers_cancer_data.items():
smokers_total += value['smokers']
cancer_total += value['cancer']
This will return a list of all the countries that have datas, along with the data:
l3 = []
with open('smoking.txt','r') as f1, open('cancer.txt','r') as f2:
l1, l2 = f1.readlines(), f2.readlines()
for s1 in l1:
for s2 in l2:
if s1.split(',')[0] == s2.split(',')[0]:
cty = s1.split(',')[0]
smk = s1.split(',')[1].strip()
cnr = s2.split(',')[1].strip()
l3.append(f"{cty}: smoking: {smk}, cancer: {cnr}")
print(l3)
Output:
['Spain: smoking: 77.9, cancer: 17', 'Russia: smoking: 60.8, cancer: 23.7', 'America: smoking: 73.3, cancer: 23.3', 'Italy: smoking: 43.9, cancer24.2']

combine two for loops in to fill same dictionary

I am trying to get two different merchants from a list of dictionaries with priority to merchants who have prices,if no two different merchants are found with prices, merchant 1 or 2 prices are to be filled with data from list,if list is not enough merchant 1 or 2 should be None.
I.e the for loop will return two merchants,priority to merchants who have prices, if that is not enough to fill merchants (1 or 2) get merchants with no prices.finally if still merchant 1 or 2 not created fill them with None value.
here is the code I have so far, it does the job but I believe it can be combined in a more Pythonic way.
import csv
with open('/home/timmy/testing/example/example/test.csv') as csvFile:
reader=csv.DictReader(csvFile)
for row in reader:
dummy_list.append(row)
item=dict()
index = 1
for merchant in dummy_list:
if merchant['price']:
if index==2:
if item['merchant_1']==merchant['name']:
continue
item['merchant_%d'%index] = merchant['name']
item['merchant_%d_price'%index] = merchant['price']
item['merchant_%d_stock'%index] = merchant['stock']
item['merchant_%d_link'%index] = merchant['link']
if index==3:
break
index+=1
for merchant in dummy_list:
if index==3:
break
if index<3:
try:
if item['merchant_1']==merchant['name']:
continue
except KeyError:
pass
item['merchant_%d'%index] = merchant['name']
item['merchant_%d_price'%index] = merchant['price']
item['merchant_%d_stock'%index] = merchant['stock']
item['merchant_%d_link'%index] = merchant['link']
index+=1
while index<3:
item['merchant_%d'%index] = ''
item['merchant_%d_price'%index] = ''
item['merchant_%d_stock'%index] = ''
item['merchant_%d_link'%index] = ''
index+=1
print(item)
here is the contents of the csv file:
price,link,name,stock
,https://www.samsclub.com/sams/donut-shop-100-ct-k-cups/prod19381344.ip,Samsclub,
,https://www.costcobusinessdelivery.com/Green-Mountain-Original-Donut-Shop-Coffee%2C-Medium%2C-Keurig-K-Cup-Pods%2C-100-ct.product.100297848.html,Costcobusinessdelivery,
,https://www.costco.com/The-Original-Donut-Shop%2C-Medium-Roast%2C-K-Cup-Pods%2C-100-count.product.100381350.html,Costco,
57.99,https://www.target.com/p/the-original-donut-shop-regular-medium-roast-coffee-keurig-k-cup-pods-108ct/-/A-13649874,Target,Out of Stock
10.99,https://www.target.com/p/the-original-donut-shop-dark-roast-coffee-keurig-k-cup-pods-18ct/-/A-16185668,Target,In Stock
,https://www.homedepot.com/p/Keurig-Kcup-Pack-The-Original-Donut-Shop-Coffee-108-Count-110030/204077166,Homedepot,Undertermined
As you only want to keep at most 2 merchants, I would process the csv only once keeping separately a list of merchant with prices and a list of merchant without prices, stopping as soon as 2 merchant with prices have been found.
After that loop, I would concatenate those 2 list and a list of two empty merchants and take the first 2 elements of that. That will be enough to guarantee your requirements of 2 distinct merchants with priority to those having prices. Finaly, I would use that to fill the item dict.
Code would be:
import csv
with open('/home/timmy/testing/example/example/test.csv') as csvFile:
reader=csv.DictReader(csvFile)
names_price = set()
names_no_price = set()
merchant_price = []
merchant_no_price = []
item = {}
for merchant in reader:
if merchant['price']:
if not merchant['name'] in names_price:
names_price.add(merchant['name'])
merchant_price.append(merchant)
if len(merchant_price) == 2:
break;
else:
if not merchant['name'] in names_no_price:
names_no_price.add(merchant['name'])
merchant_no_price.append(merchant)
void = { k: '' for k in reader.fieldnames}
merchant_list = (merchant_price + merchant_no_price + [void, void.copy()])[:2]
for index, merchant in enumerate(merchant_list, 1):
item['merchant_%d'%index] = merchant['name']
item['merchant_%d_price'%index] = merchant['price']
item['merchant_%d_stock'%index] = merchant['stock']
item['merchant_%d_link'%index] = merchant['link']

Why the output is the last line of data file test.txt?

f_user = []
twitter_dict = {}
twitter_test = {}
fields="bid,uid,username"
a = fields.split(',')
with open('twitter_test.txt','r') as f:
f_text = f.readlines()
for i in range(len(f_text)):
f_text[i] = f_text[i].split(',')
for j in range(len(a)):
twitter_dict[a[j]] = f_text[i][j][1:-1]
twitter_test[i] = twitter_dict
for i in range(len(twitter_test)):
print twitter_test[i]['username']
for i in range(len(f_text)):
...
twitter_test[i] = twitter_dict
Every single entry of twitter_test is a reference to the same dictionary, twitter_dict. Every position of the list is the same. It doesn't assign a copy of twitter_dict, which would make each entry different.
It's only natural, then, that a loop over twitter_test will print the same thing over and over.

Data Analysis using Python

I have 2 CSV files. One with city name, population and humidity. In second cities are mapped to states. I want to get state-wise total population and average humidity. Can someone help? Here is the example:
CSV 1:
CityName,population,humidity
Austin,1000,20
Sanjose,2200,10
Sacramento,500,5
CSV 2:
State,city name
Ca,Sanjose
Ca,Sacramento
Texas,Austin
Would like to get output(sum population and average humidity for state):
Ca,2700,7.5
Texas,1000,20
The above solution doesn't work because dictionary will contain one one key value. i gave up and finally used a loop. below code is working, mentioned input too
csv1
state_name,city_name
CA,sacramento
utah,saltlake
CA,san jose
Utah,provo
CA,sanfrancisco
TX,austin
TX,dallas
OR,portland
CSV2
city_name population humidity
sacramento 1000 1
saltlake 300 5
san jose 500 2
provo 100 7
sanfrancisco 700 3
austin 2000 4
dallas 2500 5
portland 300 6
def mapping_within_dataframe(self, file1,file2,file3):
self.csv1 = file1
self.csv2 = file2
self.outcsv = file3
one_state_data = 0
outfile = csv.writer(open('self.outcsv', 'w'), delimiter=',')
state_city = read_csv(self.csv1)
city_data = read_csv(self.csv2)
all_state = list(set(state_city.state_name))
for one_state in all_state:
one_state_cities = list(state_city.loc[state_city.state_name == one_state, "city_name"])
one_state_data = 0
for one_city in one_state_cities:
one_city_data = city_data.loc[city_data.city_name == one_city, "population"].sum()
one_state_data = one_state_data + one_city_data
print one_state, one_state_data
outfile.writerows(whatever)
def output(file1, file2):
f = lambda x: x.strip() #strips newline and white space characters
with open(file1) as cities:
with open(file2) as states:
states_dict = {}
cities_dict = {}
for line in states:
line = line.split(',')
states_dict[f(line[0])] = f(line[1])
for line in cities:
line = line.split(',')
cities_dict[f(line[0])] = (int(f(line[1])) , int(f(line[2])))
for state , city in states_dict.iteritems():
try:
print state, cities_dict[city]
except KeyError:
pass
output(CSV1,CSV2) #these are the names of the files
This gives the output you wanted. Just make sure the names of cities in both files are the same in terms of capitalization.

How can I count different values per same key with Python?

I have a code which is able to give me the list like this:
Name id number week number
Piata 4 6
Mali 2 20,5
Goerge 5 4
Gooki 3 24,64,6
Mali 5 45,9
Piata 6 1
Piata 12 2,7,8,27,16 etc..
with the below code:
import csv
from datetime import date
datedict = defaultdict(set)
with open('d:/info.csv', 'r') as csvfile:
filereader = csv.reader(csvfile, 'excel')
#passing the header
read_header = False
start_date=date(year=2009,month=1,day=1)
#print((seen_date - start_date).days)
tdic = {}
for row in filereader:
if not read_header:
read_header = True
continue
# reading the rest rows
name,id,firstseen = row[0],row[1],row[3]
try:
seen_date = datetime.datetime.strptime(firstseen, '%d/%m/%Y').date()
deltadays = (seen_date-start_date).days
deltaweeks = deltadays/7 + 1
key = name,id
currentvalue = tdic.get(key, set())
currentvalue.add(deltaweeks)
tdic[key] = currentvalue
except ValueError:
print('Date value error')
pass
Right now I want to convert my list to a list that give me number of ids for each name and its weeks numbers like the below list:
Name number of ids weeknumbers
Mali 2 20,5,45,9
Piata 3 1,6,2,7,8,27,16
Goerge 1 4
Gooki 1 24,64,6
Can anyone help me with writing the code for this part?
Since it looks like your csv file has headers (which you are currently ignoring) why not use a DictReader instead of the standard reader class? If you don't supply fieldnames the DictReader will assume the first line contains them, which will also save you from having to skip the first line in your loop.
This seems like a great opportunity to use defaultdict and Counter from the collections module.
import csv
from datetime import date
from collections import defaultdict, Counter
datedict = defaultdict(set)
namecounter = Counter()
with open('d:/info.csv', 'r') as csvfile:
filereader = csv.DictReader(csvfile)
start_date=date(year=2009,month=1,day=1)
for row in filereader:
name,id,firstseen = row['name'], row['id'], row['firstseen']
try:
seen_date = datetime.datetime.strptime(firstseen, '%d/%m/%Y').date()
except ValueError:
print('Date value error')
pass
deltadays = (seen_date-start_date).days
deltaweeks = deltadays/7 + 1
datedict[name].add(deltaweeks)
namecounter.update([name]) # Without putting name into a list, update will index each character
This assumes that (name, id) is unique. If this is not the case then you can use anotherdefaultdict for namecounter. I've also moved the try-except statement so it is more explicit in what you are testing.
givent that :
tdict = {('Mali', 5): set([9, 45]), ('Gooki', 3): set([24, 64, 6]), ('Goerge', 5): set([4]), ('Mali', 2): set([20, 5]), ('Piata', 4): set([4]), ('Piata', 6): set([1]), ('Piata', 12): set([8, 16, 2, 27, 7])}
then to output the result above:
names = {}
for ((name, id), more_weeks) in tdict.items():
(ids, weeks) = names.get(name, (0, set()))
ids = ids + 1
weeks = weeks.union(more_weeks)
names[name] = (ids, weeks)
for (name, (id, weeks)) in names.items():
print("%s, %s, %s" % (name, id, weeks)

Categories

Resources