I am new to python and this is a sample code I got online.
I have two big data CSV files one from the database and another from the company metadata. I would like to compare specific columns in both tables and generate a new csv file that shows me where the missing records in the metadata are. Keeping in mind that the two csv files do not have the same number of columns and I want to analyse specific columns in both csv files.
These are the two csv files:
csv1 copied from excel sheet
start_time end_time aitechid hh_village grpdetails1/farmername grpdetails1/farmermobile 2016-11-26T14:01:47.329+03 2016-11-26T14:29:05.042+03 AI00001 2447 KahsuGebru 919115604 2016-11-26T19:34:42.159+03 2016-11-26T20:39:27.430+03 936891238 2473 Moto Aleka 914370833 2016-11-26T12:13:23.094+03 2016-11-26T14:25:19.178+03 914127382 2390 Hagos 914039654 2016-11-30T14:31:28.223+03 2016-11-30T14:56:33.144+03 920784222 384 Mohammed Ali 923456788 2016-11-30T14:22:38.631+03 2016-11-30T15:06:44.199+03 912320358 378 Habtamu Nuru 913856087 2016-11-29T03:41:36.532+03 2016-11-29T16:33:12.632+03 914763134 2301 Are gaining Giday 0 2016-11-29T16:21:05.012+03 2016-11-29T16:37:27.934+03 914763134 2290 G 912345678 2016-11-30T17:23:34.145+03 2016-11-30T18:00:32.142+03 914763134 2291 Haile tesfu 0 2016-11-30T20:37:54.657+03 2016-11-30T20:56:16.472+03 914763134 2300 Negative Abay 933082495 2016-11-30T21:00:22.063+03 2016-11-30T21:18:44.478+03 914763134 2291 Niguel Amare 914270455
csv2 copied from excel sheet
farmermobile 941807851 946741296 9 920212218 915 939555303 961579437 919961811 100004123 972635273 918166831 961579437
I have tried this code but I am not getting the expected output:
import csv
def get_key(row):
return row["!Sample_title"], row["!Sample_geo_accession"]
def load_csv(filename):
"""Put csv data into a dict that maps title/geo to the complete row.
"""
d = {}
with open(filename) as f:
for row in csv.DictReader(f, delimiter=","):
key = get_key(row)
assert key not in d
d[key] = row
return d
def diffs(old, new):
yield from added_or_removed("ADDED", new.keys() - old.keys(), new)
yield from added_or_removed("REMOVED", old.keys() - new.keys(), old)
yield from changed(old, new)
def compare_row(key, old, new):
i = -1
for i, line in enumerate(diffs(old, new)):
if not i:
print("/".join(key))
print(" " + line)
if i >= 0:
print()
def added_or_removed(state, keys, d):
items = sorted((key, d[key]) for key in keys)
for key, value in items:
yield "{:10}: {:30} | {:30}".format(state, key, value)
def changed(old, new):
common_columns = old.keys() & new.keys()
for column in sorted(common_columns):
oldvalue = old[column]
newvalue = new[column]
if oldvalue != newvalue:
yield "{:10}: {:30} | {:30} | {:30}".format(
"CHANGED",
column,
oldvalue.ljust(30),
newvalue.ljust(30))
if __name__ == "__main__":
oldcsv = load_csv("/media/dmogaka/DATA/week4/combine201709.csv")
newcsv = load_csv("/media/dmogaka/DATA/week4/combinedmissingrecords.csv")
# title/geo pairs that occur in both files:
common = oldcsv.keys() & newcsv.keys()
for key in sorted(common):
compare_row(key, oldcsv[key], newcsv[key])
Related
I am attempting to create a data dictionary that does not include all of the columns in the source csv file. I have managed to create one that does include all the columns, but want to exclude some of them.
The code I am using is this:
input_file = csv.DictReader(open(DATA_FILE))
fieldnames = input_file.fieldnames
data_large_countries = {fn: [] for fn in fieldnames}
for line in input_file:
for k, v in line.items():
if (v == ''):
v=0
try:
data_large_countries[k].append(int(v))
except ValueError:
try:
data_large_countries[k].append(float(v))
except ValueError:
data_large_countries[k].append(v)
for k, v in data_large_countries.items():
data_large_countries[k] = np.array(v)
print(data_large_countries.keys())
with the output:
dict_keys(['iso_code', 'continent', 'location', 'date', 'total_cases', 'new_cases', 'new_cases_smoothed', 'total_deaths', 'new_deaths', 'new_deaths_smoothed', 'total_cases_per_million', 'new_cases_per_million', 'new_cases_smoothed_per_million', 'total_deaths_per_million', 'new_deaths_per_million', 'new_deaths_smoothed_per_million', 'reproduction_rate', 'icu_patients', 'icu_patients_per_million', 'hosp_patients', 'hosp_patients_per_million', 'weekly_icu_admissions', 'weekly_icu_admissions_per_million', 'weekly_hosp_admissions', 'weekly_hosp_admissions_per_million', 'total_tests', 'new_tests', 'total_tests_per_thousand', 'new_tests_per_thousand', 'new_tests_smoothed', 'new_tests_smoothed_per_thousand', 'positive_rate', 'tests_per_case', 'tests_units', 'total_vaccinations', 'people_vaccinated', 'people_fully_vaccinated', 'total_boosters', 'new_vaccinations', 'new_vaccinations_smoothed', 'total_vaccinations_per_hundred', 'people_vaccinated_per_hundred', 'people_fully_vaccinated_per_hundred', 'total_boosters_per_hundred', 'new_vaccinations_smoothed_per_million', 'new_people_vaccinated_smoothed', 'new_people_vaccinated_smoothed_per_hundred', 'stringency_index', 'population', 'population_density', 'median_age', 'aged_65_older', 'aged_70_older', 'gdp_per_capita', 'extreme_poverty', 'cardiovasc_death_rate', 'diabetes_prevalence', 'female_smokers', 'male_smokers', 'handwashing_facilities', 'hospital_beds_per_thousand', 'life_expectancy', 'human_development_index', 'excess_mortality_cumulative_absolute', 'excess_mortality_cumulative', 'excess_mortality', 'excess_mortality_cumulative_per_million'])
I only need 6 of these keys in my data dictionary. How do I amend my code to get only the keys I want?
I have to write a program that correlates smoking with lung cancer risk. For that I have data in two files.
My code is computing the data given in the same lines (eg:America,23.3 with Spain,77.9 and
Italy,24.2 with Russia,60.8)
How to modify my code so that it computes the numbers of the same countries and leaves out the countries that occur only in one file (it shouldn't compute Germany, France, China, Korea because they are only in one file)
Thank you so much for your help in advance:)
smoking file:
Country, Percent Cigarette Smokers Data
America,23.3
Italy,24.2
Russia,23.7
France,14.9
England,17.9
Spain,17
Germany,21.7
second file:
Cases Lung Cancer per 100000
Spain,77.9
Russia,60.8
Korea,61.3
America,73.3
China,66.8
Vietnam,64.5
Italy,43.9
and my code:
def readFiles(smoking_datafile, cancer_datafile):
'''
Reads the data from the provided file objects smoking_datafile
and cancer_datafile. Returns a list of the data read from each
in a tuple of the form (smoking_datafile, cancer_datafile).
'''
# init
smoking_data = []
cancer_data = []
empty_str = ''
# read past file headers
smoking_datafile.readline()
cancer_datafile.readline()
# read data files
eof = False
while not eof:
# read line of data from each file
s_line = smoking_datafile.readline()
c_line = cancer_datafile.readline()
# check if at end-of-file of both files
if s_line == empty_str and c_line == empty_str:
eof = True
# check if end of smoking data file only
elif s_line == empty_str:
raise OSError('Unexpected end-of-file for smoking data file')
# check if at end of cancer data file only
elif c_line == empty_str:
raise OSError('Unexpected end-of-file for cancer data file')
# append line of data to each list
else:
smoking_data.append(s_line.strip().split(','))
cancer_data.append(c_line.strip().split(','))
# return list of data from each file
return (smoking_data, cancer_data)
def calculateCorrelation(smoking_data, cancer_data):
'''
Calculates and returns the correlation value for the data
provided in lists smoking_data and cancer_data
'''
# init
sum_smoking_vals = sum_cancer_vals = 0
sum_smoking_sqrd = sum_cancer_sqrd = 0
sum_products = 0
# calculate intermediate correlation values
num_values = len(smoking_data)
for k in range(0,num_values):
sum_smoking_vals = sum_smoking_vals + float(smoking_data[k][1])
sum_cancer_vals = sum_cancer_vals + float(cancer_data[k][1])
sum_smoking_sqrd = sum_smoking_sqrd + \
float(smoking_data[k][1]) ** 2
sum_cancer_sqrd = sum_cancer_sqrd + \
float(cancer_data[k][1]) ** 2
sum_products = sum_products + float(smoking_data[k][1]) * \
float(cancer_data[k][1])
# calculate and display correlation value
numer = (num_values * sum_products) - \
(sum_smoking_vals * sum_cancer_vals)
denom = math.sqrt(abs( \
((num_values * sum_smoking_sqrd) - (sum_smoking_vals ** 2)) * \
((num_values * sum_cancer_sqrd) - (sum_cancer_vals ** 2)) \
))
return numer / denom
Let's just focus on getting the data into a format that is easy to work with. The code below will get you a dictionary of the form ...
smokers_cancer_data = {
'America': {
'smokers': '23.3',
'cancer': '73.3'
},
'Italy': {
'smokers': '24.2',
'cancer': '43.9'
},
...
}
Once you have this you can get any values you need and perform your calculations. See the code below.
def read_data(filename: str) -> dict:
with open(filename, 'r') as file:
next(file) # Skip the header
data = dict();
for line in file:
cleaned_line = line.rstrip()
# Skip blank lines
if cleaned_line:
data_item = (cleaned_line.split(','))
data[data_item[0]] = float(data_item[1])
return data
# Load data into python dictionaries
smokers_data = read_data('smokersData.txt')
cancer_data = read_data('lungCancerData.txt')
# Build one dictionary that is easy to work with
smokers_cancer_data = dict()
for (key, value) in smokers_data.items():
if key in cancer_data:
smokers_cancer_data[key] = {
'smokers': smokers_data[key],
'cancer' : cancer_data[key]
}
print(smokers_cancer_data)
For example, if you want to calculate the sum of the smoker and cancer values.
smokers_total = 0
cancer_total = 0
for (key, value) in smokers_cancer_data.items():
smokers_total += value['smokers']
cancer_total += value['cancer']
This will return a list of all the countries that have datas, along with the data:
l3 = []
with open('smoking.txt','r') as f1, open('cancer.txt','r') as f2:
l1, l2 = f1.readlines(), f2.readlines()
for s1 in l1:
for s2 in l2:
if s1.split(',')[0] == s2.split(',')[0]:
cty = s1.split(',')[0]
smk = s1.split(',')[1].strip()
cnr = s2.split(',')[1].strip()
l3.append(f"{cty}: smoking: {smk}, cancer: {cnr}")
print(l3)
Output:
['Spain: smoking: 77.9, cancer: 17', 'Russia: smoking: 60.8, cancer: 23.7', 'America: smoking: 73.3, cancer: 23.3', 'Italy: smoking: 43.9, cancer24.2']
I am trying to find the mean of an array created from data in a CSV file using Python. Data in the array is included between a range of values, so it does not include all the values in the column of the CSV. My current code that creates the array is shown below. Several arrays have been created, but I only need to find the mean of the array called "T07s". I am consistently getting the error "cannot perform reduce with flexible type" when using the function np.mean(T07s)
import csv
class dataPoint:
def __init__(self, V, T07, T19, T27, Time):
self.V = V
self.T07 = T07
self.T19 = T19
self.T27 = T27
self.Time = Time
dataPoints = []
with open("data_final.csv") as csvfile:
reader = csv.reader(csvfile)
next(reader)
for row in reader:
if 229 <= float(row[2]) <= 231:
temp = dataPoint(row[1], row[12], row[24], row[32], row[0].split(" ")[1])
dataPoints.append(temp)
T07s = np.array([x.T07 for x in dataPoints])
The data included in T07s is shown below:
for x in T07s:
print(x)
37.2
539
435.6
717.4
587
757.9
861.8
1024.2
325
117.9
136.3
167.8
809
405.3
405.1
112.7
1317.1
1731.8
1080.2
1208.6
1212.6
1363.8
1715.3
2376.4
2563.9
2998.4
2934.7
2862.4
390.8
2332.2
2121
2237.6
2334.1
2082.2
1892.1
1888.8
1960.6
1329.1
1657.2
2042.4
1417.5
977.3
1442.8
561.2
500.3
413.3
324.1
693.7
750
865.7
434.2
635.2
815.7
171.4
829.3
815.3
774.8
1411.6
1685.1
1345.1
1193.2
1674.9
1636.4
1389.8
753.3
1102.8
908.3
1223.2
1199.4
1040.7
1040.9
824.7
620
795.7
810.4
378.8
643.2
441.8
682.8
417.8
515.6
2354.7
1938.8
1512.4
1933.5
1739.8
2281.9
1997.5
2833.4
182.8
202.4
217.3
234.2
741.9
Clearly more of a simple solution:
import pandas as pd
data = pd.read_csv('data_final.csv')
data_filtered = data[data.iloc[:,2] >= 229 & data.iloc[:,2] <= 231]
print(data_filtered['T07'].mean())
I have 2 CSV files. One with city name, population and humidity. In second cities are mapped to states. I want to get state-wise total population and average humidity. Can someone help? Here is the example:
CSV 1:
CityName,population,humidity
Austin,1000,20
Sanjose,2200,10
Sacramento,500,5
CSV 2:
State,city name
Ca,Sanjose
Ca,Sacramento
Texas,Austin
Would like to get output(sum population and average humidity for state):
Ca,2700,7.5
Texas,1000,20
The above solution doesn't work because dictionary will contain one one key value. i gave up and finally used a loop. below code is working, mentioned input too
csv1
state_name,city_name
CA,sacramento
utah,saltlake
CA,san jose
Utah,provo
CA,sanfrancisco
TX,austin
TX,dallas
OR,portland
CSV2
city_name population humidity
sacramento 1000 1
saltlake 300 5
san jose 500 2
provo 100 7
sanfrancisco 700 3
austin 2000 4
dallas 2500 5
portland 300 6
def mapping_within_dataframe(self, file1,file2,file3):
self.csv1 = file1
self.csv2 = file2
self.outcsv = file3
one_state_data = 0
outfile = csv.writer(open('self.outcsv', 'w'), delimiter=',')
state_city = read_csv(self.csv1)
city_data = read_csv(self.csv2)
all_state = list(set(state_city.state_name))
for one_state in all_state:
one_state_cities = list(state_city.loc[state_city.state_name == one_state, "city_name"])
one_state_data = 0
for one_city in one_state_cities:
one_city_data = city_data.loc[city_data.city_name == one_city, "population"].sum()
one_state_data = one_state_data + one_city_data
print one_state, one_state_data
outfile.writerows(whatever)
def output(file1, file2):
f = lambda x: x.strip() #strips newline and white space characters
with open(file1) as cities:
with open(file2) as states:
states_dict = {}
cities_dict = {}
for line in states:
line = line.split(',')
states_dict[f(line[0])] = f(line[1])
for line in cities:
line = line.split(',')
cities_dict[f(line[0])] = (int(f(line[1])) , int(f(line[2])))
for state , city in states_dict.iteritems():
try:
print state, cities_dict[city]
except KeyError:
pass
output(CSV1,CSV2) #these are the names of the files
This gives the output you wanted. Just make sure the names of cities in both files are the same in terms of capitalization.
I have a code which is able to give me the list like this:
Name id number week number
Piata 4 6
Mali 2 20,5
Goerge 5 4
Gooki 3 24,64,6
Mali 5 45,9
Piata 6 1
Piata 12 2,7,8,27,16 etc..
with the below code:
import csv
from datetime import date
datedict = defaultdict(set)
with open('d:/info.csv', 'r') as csvfile:
filereader = csv.reader(csvfile, 'excel')
#passing the header
read_header = False
start_date=date(year=2009,month=1,day=1)
#print((seen_date - start_date).days)
tdic = {}
for row in filereader:
if not read_header:
read_header = True
continue
# reading the rest rows
name,id,firstseen = row[0],row[1],row[3]
try:
seen_date = datetime.datetime.strptime(firstseen, '%d/%m/%Y').date()
deltadays = (seen_date-start_date).days
deltaweeks = deltadays/7 + 1
key = name,id
currentvalue = tdic.get(key, set())
currentvalue.add(deltaweeks)
tdic[key] = currentvalue
except ValueError:
print('Date value error')
pass
Right now I want to convert my list to a list that give me number of ids for each name and its weeks numbers like the below list:
Name number of ids weeknumbers
Mali 2 20,5,45,9
Piata 3 1,6,2,7,8,27,16
Goerge 1 4
Gooki 1 24,64,6
Can anyone help me with writing the code for this part?
Since it looks like your csv file has headers (which you are currently ignoring) why not use a DictReader instead of the standard reader class? If you don't supply fieldnames the DictReader will assume the first line contains them, which will also save you from having to skip the first line in your loop.
This seems like a great opportunity to use defaultdict and Counter from the collections module.
import csv
from datetime import date
from collections import defaultdict, Counter
datedict = defaultdict(set)
namecounter = Counter()
with open('d:/info.csv', 'r') as csvfile:
filereader = csv.DictReader(csvfile)
start_date=date(year=2009,month=1,day=1)
for row in filereader:
name,id,firstseen = row['name'], row['id'], row['firstseen']
try:
seen_date = datetime.datetime.strptime(firstseen, '%d/%m/%Y').date()
except ValueError:
print('Date value error')
pass
deltadays = (seen_date-start_date).days
deltaweeks = deltadays/7 + 1
datedict[name].add(deltaweeks)
namecounter.update([name]) # Without putting name into a list, update will index each character
This assumes that (name, id) is unique. If this is not the case then you can use anotherdefaultdict for namecounter. I've also moved the try-except statement so it is more explicit in what you are testing.
givent that :
tdict = {('Mali', 5): set([9, 45]), ('Gooki', 3): set([24, 64, 6]), ('Goerge', 5): set([4]), ('Mali', 2): set([20, 5]), ('Piata', 4): set([4]), ('Piata', 6): set([1]), ('Piata', 12): set([8, 16, 2, 27, 7])}
then to output the result above:
names = {}
for ((name, id), more_weeks) in tdict.items():
(ids, weeks) = names.get(name, (0, set()))
ids = ids + 1
weeks = weeks.union(more_weeks)
names[name] = (ids, weeks)
for (name, (id, weeks)) in names.items():
print("%s, %s, %s" % (name, id, weeks)