Python MySQL/CSV Processing

Python MySQL/CSV Processing - python

I need to create two different CSV files from two MySQL databases --- Done!
Then take those files and compare them --- Done!
Then take the output and format it to display the Keys and Values of both of the files that are the Same, Different, and Missing in those fields and display them in a Tabular format --- Banging my head against a wall!!!
My Code is below, the output is all one line and separated in a weird way:
import easygui
import pandas as pd
def compare_dicts(dict1, dict2):
match = []
same = []
missing = []
for key1, value1 in dict1.items():
if key1 in dict2:
if dict2[key1] == value1:
same.append((key1, value1))
else:
match.append((key1, value1, dict2[key1]))
else:
missing.append((key1, value1))
for key2, value2 in dict2.items():
if key2 not in dict1:
missing.append((key2, value2))
return match, same, missing
def save_to_csv(data, file_path):
df = pd.DataFrame(data)
df.to_csv(file_path, index=False)
file_path1 = easygui.fileopenbox(default='*.csv')
file_path2 = easygui.fileopenbox(default='*.csv')
df1 = pd.read_csv(file_path1)
df2 = pd.read_csv(file_path2)
dict1 = df1.to_dict()
dict2 = df2.to_dict()
match, same, missing = compare_dicts(dict1, dict2)
save_path = easygui.filesavebox(default='compare_results.csv')
save_to_csv(match, save_path)
print("Match:", match)
print("Same:", same)
print("Missing:", missing)`
The first file reads as follows:
first_name last_name phone_number
Paquito Asch 252-383-5824
Ulick Valentinetti 424-573-9867
Loydie Snoday 324-342-5445
Jeana Crigin 155-132-5153
Chelsy Faraker 232-784-3138
Abbye Fulle 312-910-1686
Lanna Island 889-685-3615
Erny Geering 978-743-7561
Bruce Arnal 339-742-1475
Ralina Dohmann 255-878-3225
The second:
first_name last_name phone_number
Paquito Asch 252-383-5823
Ulick Valentinetti 424-573-9867
Loydie Snoday 324-342-5445
Jeana Crigin 155-132-5153
Chelsy Faraker
Abbye Fulle 312-910-1634
Lanna Island 889-685-3615
Erny Geering 978-743-7542
Bruce Arnal
Ralina Dohmann 255-878-3225

Related

Excluding CSV Columns from Data Dictionary

I am attempting to create a data dictionary that does not include all of the columns in the source csv file. I have managed to create one that does include all the columns, but want to exclude some of them.
The code I am using is this:
input_file = csv.DictReader(open(DATA_FILE))
fieldnames = input_file.fieldnames
data_large_countries = {fn: [] for fn in fieldnames}
for line in input_file:
for k, v in line.items():
if (v == ''):
v=0
try:
data_large_countries[k].append(int(v))
except ValueError:
try:
data_large_countries[k].append(float(v))
except ValueError:
data_large_countries[k].append(v)
for k, v in data_large_countries.items():
data_large_countries[k] = np.array(v)
print(data_large_countries.keys())
with the output:
dict_keys(['iso_code', 'continent', 'location', 'date', 'total_cases', 'new_cases', 'new_cases_smoothed', 'total_deaths', 'new_deaths', 'new_deaths_smoothed', 'total_cases_per_million', 'new_cases_per_million', 'new_cases_smoothed_per_million', 'total_deaths_per_million', 'new_deaths_per_million', 'new_deaths_smoothed_per_million', 'reproduction_rate', 'icu_patients', 'icu_patients_per_million', 'hosp_patients', 'hosp_patients_per_million', 'weekly_icu_admissions', 'weekly_icu_admissions_per_million', 'weekly_hosp_admissions', 'weekly_hosp_admissions_per_million', 'total_tests', 'new_tests', 'total_tests_per_thousand', 'new_tests_per_thousand', 'new_tests_smoothed', 'new_tests_smoothed_per_thousand', 'positive_rate', 'tests_per_case', 'tests_units', 'total_vaccinations', 'people_vaccinated', 'people_fully_vaccinated', 'total_boosters', 'new_vaccinations', 'new_vaccinations_smoothed', 'total_vaccinations_per_hundred', 'people_vaccinated_per_hundred', 'people_fully_vaccinated_per_hundred', 'total_boosters_per_hundred', 'new_vaccinations_smoothed_per_million', 'new_people_vaccinated_smoothed', 'new_people_vaccinated_smoothed_per_hundred', 'stringency_index', 'population', 'population_density', 'median_age', 'aged_65_older', 'aged_70_older', 'gdp_per_capita', 'extreme_poverty', 'cardiovasc_death_rate', 'diabetes_prevalence', 'female_smokers', 'male_smokers', 'handwashing_facilities', 'hospital_beds_per_thousand', 'life_expectancy', 'human_development_index', 'excess_mortality_cumulative_absolute', 'excess_mortality_cumulative', 'excess_mortality', 'excess_mortality_cumulative_per_million'])
I only need 6 of these keys in my data dictionary. How do I amend my code to get only the keys I want?

Performance problem when using pandas apply on big dataframes

Im having some performance issues with the code below, mostly because of the apply function that im using on a huge dataframe. I want to update the semi_dict dictionary with some other data that im calculating with the some functions. Is it any way to improve this?
def my_function_1(semi_dict, row):
#do some calculation/other stuff based on the row data and append it to the dictionary
random_dict = dict(data=some_data, more_data=more_data)
semi_dict["data"].append(random_dict)
def my_function_2(semi_dict, row):
#do some calculation/other stuff based on the row data and append it to the dictionary
random_dict = dict(data=some_data, more_data=more_data)
semi_dict["data2"].append(random_dict)
dictionary_list = []
for v in values:
df_1_rows = df_1_rows[(df_1_rows.values == v)]
df_2_rows = df_2_rows[(df_2_rows.values == v)]
semi_dict = dict(value=v, data=[], data2=[])
function = partial(my_function_1, semi_dict)
function_2 = partial(my_function_2, semi_dict)
df_1_rows.apply(lambda row : function(row), axis=1)
df_2_rows.apply(lambda row : function_2(row), axis=1)
dictionary_list.append(semi_dict)

This answer uses dictionary merge from How to merge dictionaries of dictionaries?, but depending on your use case, you might not need it in the end:
import pandas as pd
import random
len_df = 10
row_values = list("ABCD")
extra_col_values = list("12345")
df_1 = pd.DataFrame([[random.choice(row_values), random.choice(extra_col_values)] for _ in range(len_df)], columns=['col1', 'extra1'])
df_2 = pd.DataFrame([[random.choice(row_values), random.choice(extra_col_values)] for _ in range(len_df)], columns=['col2', 'extra2'])
def make_dict(df):
# some calculations on the df
return {
'data': df.head(1).values.tolist(),
}
def make_dict_2(df):
# some calculations on the df
return {
'data_2': df.head(1).values.tolist(),
}
def merge(a, b, path=None):
"merges b into a, taken from https://stackoverflow.com/questions/7204805/how-to-merge-dictionaries-of-dictionaries "
if path is None: path = []
for key in b:
if key in a:
if isinstance(a[key], dict) and isinstance(b[key], dict):
merge(a[key], b[key], path + [str(key)])
elif a[key] == b[key]:
pass # same leaf value
else:
raise Exception('Conflict at %s' % '.'.join(path + [str(key)]))
else:
a[key] = b[key]
return a
dict1 = df_1.groupby('col1').apply(make_dict).to_dict()
dict2 = df_2.groupby('col2').apply(make_dict_2).to_dict()
result = merge(dict1, dict2)
result

'Jump' Operation in Python to Skip Rows in DataFrame

I have an Excel file contain data like in that picture.
"doc_id" refers to the document ID where the text comes from. In our example, we have 4 documents (doc_id from 0 to 3).
I want to get the values of "text" in the first 5 pages of each document OR before Table of Contents.
With our example, it should return:
"A0","A1","B1","A3"
(Note that we don't want B0, C0, D0, C1 because they occur after Table of Contents of that document, and we don't want A2 and B3 because they have page_id >= 5)
I don't understand how we can create condition to "break" the iteration in each doc_id once we find Table of Contents or page_id == 5 and move to the next doc_id.
I tried like this and I'm stuck.
import pandas as pd
data = pd.read_csv('book2.csv')
test_data = data['text']
doc_id = data['doc_id']
page_id = data['page_id']
def TOC(text):
return 'content' in text
def new_doc():
if i==0:
return False
elif doc_id[i] != doc_id[i-1]:
return True
i=0
while i < len(test_data):
stop=0
while stop == 0 and not new_doc():
if TOC(test_data[i]):
print('toc')
stop=1
else:
print(doc_id[i],test_data[i])
i+=1
Appreciate your help. Thanks!

See if this helps
a = df[df.page_id<5]
def tex(x):
try:
if (x.any()):
i = x.index[x.str.contains('Table')][0]
except IndexError :
i = x.index[-1]+1
return i
a[a.index<a.groupby('doc_id')['text'].transform(tex)]['text'].to_list()
Output
['A0', 'A1', 'B1', 'A3']

you have to iterate through whole document
import pandas as pd
data = pd.read_csv('book2.csv')[['page_id', 'doc_id', 'text']]
curr_doc_id = -1
before_toc = False
for i, row in data.iterrows():
if curr_doc_id < row.doc_id:
curr_doc_id = row.doc_id
before_toc = True
if row.text == "Table of Contents":
before_toc = False
if before_toc and row.page_id < 5:
print(row)
*code wasn't tested

Joining data from two tables with different number of columns

I am new to python and this is a sample code I got online.
I have two big data CSV files one from the database and another from the company metadata. I would like to compare specific columns in both tables and generate a new csv file that shows me where the missing records in the metadata are. Keeping in mind that the two csv files do not have the same number of columns and I want to analyse specific columns in both csv files.
These are the two csv files:
csv1 copied from excel sheet
start_time end_time aitechid hh_village grpdetails1/farmername grpdetails1/farmermobile 2016-11-26T14:01:47.329+03 2016-11-26T14:29:05.042+03 AI00001 2447 KahsuGebru 919115604 2016-11-26T19:34:42.159+03 2016-11-26T20:39:27.430+03 936891238 2473 Moto Aleka 914370833 2016-11-26T12:13:23.094+03 2016-11-26T14:25:19.178+03 914127382 2390 Hagos 914039654 2016-11-30T14:31:28.223+03 2016-11-30T14:56:33.144+03 920784222 384 Mohammed Ali 923456788 2016-11-30T14:22:38.631+03 2016-11-30T15:06:44.199+03 912320358 378 Habtamu Nuru 913856087 2016-11-29T03:41:36.532+03 2016-11-29T16:33:12.632+03 914763134 2301 Are gaining Giday 0 2016-11-29T16:21:05.012+03 2016-11-29T16:37:27.934+03 914763134 2290 G 912345678 2016-11-30T17:23:34.145+03 2016-11-30T18:00:32.142+03 914763134 2291 Haile tesfu 0 2016-11-30T20:37:54.657+03 2016-11-30T20:56:16.472+03 914763134 2300 Negative Abay 933082495 2016-11-30T21:00:22.063+03 2016-11-30T21:18:44.478+03 914763134 2291 Niguel Amare 914270455
csv2 copied from excel sheet
farmermobile 941807851 946741296 9 920212218 915 939555303 961579437 919961811 100004123 972635273 918166831 961579437
I have tried this code but I am not getting the expected output:
import csv
def get_key(row):
return row["!Sample_title"], row["!Sample_geo_accession"]
def load_csv(filename):
"""Put csv data into a dict that maps title/geo to the complete row.
"""
d = {}
with open(filename) as f:
for row in csv.DictReader(f, delimiter=","):
key = get_key(row)
assert key not in d
d[key] = row
return d
def diffs(old, new):
yield from added_or_removed("ADDED", new.keys() - old.keys(), new)
yield from added_or_removed("REMOVED", old.keys() - new.keys(), old)
yield from changed(old, new)
def compare_row(key, old, new):
i = -1
for i, line in enumerate(diffs(old, new)):
if not i:
print("/".join(key))
print(" " + line)
if i >= 0:
print()
def added_or_removed(state, keys, d):
items = sorted((key, d[key]) for key in keys)
for key, value in items:
yield "{:10}: {:30} | {:30}".format(state, key, value)
def changed(old, new):
common_columns = old.keys() & new.keys()
for column in sorted(common_columns):
oldvalue = old[column]
newvalue = new[column]
if oldvalue != newvalue:
yield "{:10}: {:30} | {:30} | {:30}".format(
"CHANGED",
column,
oldvalue.ljust(30),
newvalue.ljust(30))
if __name__ == "__main__":
oldcsv = load_csv("/media/dmogaka/DATA/week4/combine201709.csv")
newcsv = load_csv("/media/dmogaka/DATA/week4/combinedmissingrecords.csv")
# title/geo pairs that occur in both files:
common = oldcsv.keys() & newcsv.keys()
for key in sorted(common):
compare_row(key, oldcsv[key], newcsv[key])

How can I count different values per same key with Python?

I have a code which is able to give me the list like this:
Name id number week number
Piata 4 6
Mali 2 20,5
Goerge 5 4
Gooki 3 24,64,6
Mali 5 45,9
Piata 6 1
Piata 12 2,7,8,27,16 etc..
with the below code:
import csv
from datetime import date
datedict = defaultdict(set)
with open('d:/info.csv', 'r') as csvfile:
filereader = csv.reader(csvfile, 'excel')
#passing the header
read_header = False
start_date=date(year=2009,month=1,day=1)
#print((seen_date - start_date).days)
tdic = {}
for row in filereader:
if not read_header:
read_header = True
continue
# reading the rest rows
name,id,firstseen = row[0],row[1],row[3]
try:
seen_date = datetime.datetime.strptime(firstseen, '%d/%m/%Y').date()
deltadays = (seen_date-start_date).days
deltaweeks = deltadays/7 + 1
key = name,id
currentvalue = tdic.get(key, set())
currentvalue.add(deltaweeks)
tdic[key] = currentvalue
except ValueError:
print('Date value error')
pass
Right now I want to convert my list to a list that give me number of ids for each name and its weeks numbers like the below list:
Name number of ids weeknumbers
Mali 2 20,5,45,9
Piata 3 1,6,2,7,8,27,16
Goerge 1 4
Gooki 1 24,64,6
Can anyone help me with writing the code for this part?

Since it looks like your csv file has headers (which you are currently ignoring) why not use a DictReader instead of the standard reader class? If you don't supply fieldnames the DictReader will assume the first line contains them, which will also save you from having to skip the first line in your loop.
This seems like a great opportunity to use defaultdict and Counter from the collections module.
import csv
from datetime import date
from collections import defaultdict, Counter
datedict = defaultdict(set)
namecounter = Counter()
with open('d:/info.csv', 'r') as csvfile:
filereader = csv.DictReader(csvfile)
start_date=date(year=2009,month=1,day=1)
for row in filereader:
name,id,firstseen = row['name'], row['id'], row['firstseen']
try:
seen_date = datetime.datetime.strptime(firstseen, '%d/%m/%Y').date()
except ValueError:
print('Date value error')
pass
deltadays = (seen_date-start_date).days
deltaweeks = deltadays/7 + 1
datedict[name].add(deltaweeks)
namecounter.update([name]) # Without putting name into a list, update will index each character
This assumes that (name, id) is unique. If this is not the case then you can use anotherdefaultdict for namecounter. I've also moved the try-except statement so it is more explicit in what you are testing.

givent that :
tdict = {('Mali', 5): set([9, 45]), ('Gooki', 3): set([24, 64, 6]), ('Goerge', 5): set([4]), ('Mali', 2): set([20, 5]), ('Piata', 4): set([4]), ('Piata', 6): set([1]), ('Piata', 12): set([8, 16, 2, 27, 7])}
then to output the result above:
names = {}
for ((name, id), more_weeks) in tdict.items():
(ids, weeks) = names.get(name, (0, set()))
ids = ids + 1
weeks = weeks.union(more_weeks)
names[name] = (ids, weeks)
for (name, (id, weeks)) in names.items():
print("%s, %s, %s" % (name, id, weeks)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python MySQL/CSV Processing - python

Related

Excluding CSV Columns from Data Dictionary

Performance problem when using pandas apply on big dataframes

'Jump' Operation in Python to Skip Rows in DataFrame

Joining data from two tables with different number of columns

How can I count different values per same key with Python?

Categories

Resources