So I need help looping thru a nested dictionaries that i have created in order to answer some problems. My code that splits up the 2 different dictionaries and adds items into them is as follows:
Link to csv :
https://docs.google.com/document/d/1v68_QQX7Tn96l-b0LMO9YZ4ZAn_KWDMUJboa6LEyPr8/edit?usp=sharing
import csv
region_data = {}
country_data = {}
answers = []
data = []
cuntry = False
f = open('dph_SYB60_T03_Population Growth, Fertility and Mortality Indicators.csv')
reader = csv.DictReader(f)
for line in reader:
#This gets all the values into a standard dict
data.append(dict(line))
#This will loop thru the dict and create variables to hold specific items
for i in data:
# collects all of the Region/Country/Area
location = i['Region/Country/Area']
# Gets All the Years
years = i['Year']
i_d = i['ID']
info = i['Footnotes']
series = i['Series']
value = float(i['Value'])
# print(series)
stats = {i['Series']:i['Value']}
# print(stats)
# print(value)
if (i['ID']== '4'):
cuntry = True
if cuntry == True:
if location not in country_data:
country_data[location] = {}
if years not in country_data[location]:
country_data[location][years] = {}
if series not in country_data[location][years]:
country_data[location][years][series] = value
else:
if location not in region_data:
region_data[location] = {}
if years not in region_data[location]:
region_data[location][years] = {}
if series not in region_data[location][years]:
region_data[location][years][series] = value
When I print the dictionary region_data output is:
For Clarification What is shown is a "Region" as a key in a dict. The years being Values and keys in that 'Region's Dict and so on so forth....
I want to understand how i can loop thru the data and answer a question like :
Which region had the largest numeric decrease in Maternal mortality ratio from 2005 to 2015?
Were "Maternal mortality ratio (deaths per 100,000 population)" is a key within the dictionary.
Build a dataframe
Use pandas for that and read your file accordint to this answer.
import pandas as pd
filename = 'dph_SYB60_T03_Population Growth, Fertility and Mortality Indicators.csv'
df = pd.read_csv(filename)
Build a pivot table
Then you can make a pivot for "'Region/Country/Area'" and "Series" and use as a aggregate function "max".
pivot = df.pivot_table(index='Region/Country/Area', columns='Series', values='Value', aggfunc='max')
Sort by your series of interest
Then sort your "pivot table" by a series name and use the argument "ascending"
df_sort = pivot.sort_values(by='Maternal mortality ratio (deaths per 100,000 population)', ascending=False)
Extract the greatest value in the first row.
Finally you will have the answer to your question.
df_sort['Maternal mortality ratio (deaths per 100,000 population)'].head(1)
Region/Country/Area
Sierra Leone 1986.0
Name: Maternal mortality ratio (deaths per 100,000 population), dtype: float64
Warning: Some of your regions have records before 2005, so you should filter your data only for values between 2005 and 2015.
If you prefer to loop throught dictionaries in Python 3.x you can use the method .items() from each dictionary and nest them with three loops.
With a main dictionary called hear dict_total, this code will work it.
out_region = None
out_value = None
sel_serie = 'Maternal mortality ratio (deaths per 100,000 population)'
min_year = 2005
max_year = 2015
for reg, dict_reg in dict_total.items():
print(reg)
for year, dict_year in dict_reg.items():
if min_year <= year <= max_year:
print(year)
for serie, value in dict_year.items():
if serie == sel_serie and value is not None:
print('{} {}'.format(serie, value))
if out_value is None or out_value < value:
out_value = value
out_region = reg
print('Region: {}\nSerie: {} Value: {}'.format(out_region, sel_serie, out_value))
Related
This question already has answers here:
Count occurrences in DataFrame
(2 answers)
Closed 5 months ago.
I have a data set of customers and products and I would like to know which combination of products are more popular combinations chosen by customers and display that in a table (like a traditional Mileage chart or other neat way).
Example dataset:
Example output:
I am able to tell that the most popular combination of products for customers are P1 with P2 and the least popular is P1 with P3. My actual dataset is of course much larger in terms of customers and products.
I'd also be keen to hear any ideas on better outputs visualisations too, especially as I can't figure out how to best display 3 way or 4 way popular combinations.
Thank you
I have a full code example that may work for what you are doing... or at least give you some ideas on how to move forward.
This script uses OpenPyXl to scrape the info from the first sheet. It is turned into a dictionary where the key's are strings of the combinations. The combinations are then counted and it is then placed into a 2nd sheet (see image).
Results:
The Code:
from openpyxl import load_workbook
from collections import Counter
#Load master workbook/Worksheet and the file to be processed
data_wb = load_workbook(filename=r"C:\\Users\---Place your loc here---\SO_excel.xlsx")
data_ws = data_wb['Sheet1']
results_ws = data_wb['Sheet2']
#Finding Max rows in sheets
data_max_rows = data_ws.max_row
results_max_rows = results_ws.max_row
#Collecting Values and placing in array
customer_dict = {}
for row in data_ws.iter_rows(min_row = 2, max_col = 2, max_row = data_max_rows): #service_max_rows
#Gathering row values and creatin var's for relevant ones
row_list = [cell.value for cell in row]
customer_cell = row_list[0]
product_cell = row_list[3]
#Creating Str value for dict
if customer_cell not in customer_dict:
words = ""
words += product_cell
customer_dict.update({customer_cell:words})
else:
words += ("_" + product_cell)
customer_dict.update({customer_cell:words})
#Counting Occurances in dict for keys
count_dict = Counter(customer_dict.values())
#Column Titles
results_ws.cell(1, 1).value = "Combonation"
results_ws.cell(1, 2).value = "Occurances"
#Placing values into spreadsheet
count = 2
for key, value in count_dict.items():
results_ws.cell(count, 1).value = key
results_ws.cell(count, 2).value = value
count += 1
data_wb.save(filename = r"C:\\Users\---Place your loc here---\SO_excel.xlsx")
data_wb.close()
I'm still new to python, so forgive me if my code seems rather messy or out of place. However, I need help with an assignment for university. I was wondering how I am able to find specific items in a CSV file? Here is what the assignment says:
Allow the user to type in a year, then, find the average life expectancy for that year. Then find the country with the minimum and the one with the maximum life expectancies for that year.
import csv
country = []
digit_code = []
year = []
life_expectancy = []
count = 0
lifefile = open("life-expectancy.csv")
with open("life-expectancy.csv") as lifefile:
for line in lifefile:
count += 1
if count != 1:
line.strip()
parts = line.split(",")
country.append(parts[0])
digit_code.append(parts[1])
year.append(parts[2])
life_expectancy.append(float(parts[3]))
highest_expectancy = max(life_expectancy)
country_with_highest = country[life_expectancy.index(max(life_expectancy))]
print(f"The country that has the highest life expectancy is {country_with_highest} at {highest_expectancy}!")
lowest_expectancy = min(life_expectancy)
country_with_lowest = country[life_expectancy.index(min(life_expectancy))]
print(f"The country that has the lowest life expectancy is {country_with_lowest} at {lowest_expectancy}!")
It looks like you only want the first and fourth tokens from each row in your CSV. Therefore, let's simplify it like this:
Hong Kong,,,85.29
Japan,,,85.03
Macao,,,84.68
Switzerland,,,84.25
Singapore,,,84.07
You can then process it like this:
FILE = 'life-expectancy.csv'
data = []
with open(FILE) as csv:
for line in csv:
tokens = line.split(',')
data.append((float(tokens[3]), tokens[0]))
hi = max(data)
lo = min(data)
print(f'The country with the highest life expectancy {hi[0]:.2f} is {hi[1]}')
print(f'The country with the lowest life expectancy {lo[0]:.2f} is {lo[1]}')
I have a dataframe which consists out of 5 million name entries. The structure is as follows:
dataframe
What one can read from this dataframe is that for instance the name Mary was given to 14 babys in the state Alaska (AK) in the year 1910. But the name Mary were also given to newborns in the other states and the following years as well.
What I would like to identify is: What is the most given name in that particular dataset overall and how often was that name assigned?
I have tried this:
import pandas as pd
from collections import defaultdict
df = pd.read_csv("names.csv")
mask = df[["Name", "Count"]]
counter = 0
dd = defaultdict(int)
for pos, data in mask.iterrows():
name = data["Name"]
dd[name] = dd[name] + data["Count"]
counter += 1
if counter == 100000:
break
print ("Done!")
freq_name = 0
name = ""
for key, value in dd.items():
if freq_name < value:
freq_name = value
name = key
print(name)
This code works pretty well but only for up to 100.000 rows. However, when I use the presented code with the full dataset it takes ages.
Any idea or hint what I could improve would be greatly appreciated.
As suggested in the comments you can use something like this:
df = pd.read_csv("names.csv")
name, total_count = max(df.groupby('Name').Count.sum().items(), key=lambda x: x[1])
I'm having an issue with two functions I have defined in Python. Both functions have similar operations in the first few lines of the function body, and one will run and the other produces a 'key error' message. I will explain more below, but here are the two functions first.
#define function that looks at the number of claims that have a decider id that was dealer
#normalize by business amount
def decider(df):
#subset dataframe by date
df_sub = df[(df['vehicle_repair_date'] >= Q1_sd) & (df['vehicle_repair_date'] <= Q1_ed)]
#get the dealer id
did = df_sub['dealer_id'].unique()
#subset data further by selecting only records where 'dealer_decide' equals 1
df_dealer_decide = df_sub[df_sub['dealer_decide'] == 1]
#count the number of unique warranty claims
dealer_decide_count = df_dealer_decide['warranty_claim_number'].nunique()
#get the total sales amount for that dealer
total_sales = float(df_sub['amount'].max())
#get the number of warranty claims decided by dealer per $100k in dealer sales
decider_count_phk = dealer_decide_count * (100000/total_sales)
#create a dictionary to store results
output_dict = dict()
output_dict['decider_phk'] = decider_count_phk
output_dict['dealer_id'] = did
output_dict['total_claims_dealer_dec_Q1_2019'] = dealer_decide_count
output_dict['total_sales2019'] = total_sales
#convert resultant dictionary to dataframe
sum_df = pd.DataFrame.from_dict(output_dict)
#return the summarized dataframe
return sum_df
#apply the 'decider' function to each dealer in dataframe 'data'
decider_count = data.groupby('dealer_id').apply(decider)
#define a function that looks at the percentage change between 2018Q4 and 2019Q1 in terms of the number #of claims processed
def turnover(df):
#subset dealer records for Q1
df_subQ1 = df[(df['vehicle_repair_date'] >= Q1_sd) & (df['vehicle_repair_date'] <= Q1_ed)]
#subset dealer records for Q4
df_subQ4 = df[(df['vehicle_repair_date'] >= Q4_sd) & (df['vehicle_repair_date'] <= Q4_ed)]
#get the dealer id
did = df_subQ1['dealer_id'].unique()
#get the unique number of claims for Q1
unique_Q1 = df_subQ1['warranty_claim_number'].nunique()
#get the unique number of claims for Q1
unique_Q4 = df_subQ4['warranty_claim_number'].nunique()
#determine percent change from Q4 to Q1
percent_change = round((1 - (unique_Q1/unique_Q4))*100, ndigits = 1)
#create a dictionary to store results
output_dict = dict()
output_dict['nclaims_Q1_2019'] = unique_Q1
output_dict['nclaims_Q4_2018'] = unique_Q4
output_dict['dealer_id'] = did
output_dict['quarterly_pct_change'] = percent_change
#apply the 'turnover' function to each dealer in 'data' dataframe
dealer_turnover = data.groupby('dealer_id').apply(turnover)
Each function is being applied to the exact same dataset and I am obtaining the dealer id(variable did in function body) in the same way. I am also using the same groupby then apply code, but when I run the two functions the function decider runs as expected, but the function turnover gives the following error:
KeyError: 'dealer_id'.
At first I thought it might be a scoping issue, but that doesn't really make sense so if anyone can shed some light on what might be happening I would greatly appreciate it.
Thanks,
Curtis
IIUC, you are applying turnover function after decider function. You are getting the key error since dealer_id is present as index and not as a column.
Try replacing
decider_count = data.groupby('dealer_id').apply(decider)
with
decider_count = data.groupby('dealer_id', as_index=False).apply(decider)
I'm working with a big dataset within Excel in which I'm trying to sort a number by top 25 per index value.
The datasite looks like this:
The Final PAC ID is the company number and changes (this does not show in the given data). The PAC contribution is the number I want to sort by.
So for example, there will be 50 contributions done by company C00003590, to different candidates with amount 'PAC contribution', I would like to sort the top 25 contributions done per company.
I've tried working with dictionaries, creating a dictionary for each company and adding in the candidate numbers as a string key, and the contribution as a value.
The code I have so far is the following (this might be the completely wrong way to go about it though):
import pandas as pd
df1 = pd.read_excel('Test2.xlsx')
dict_company = {}
k1 = str(df1['Final PAC ID'])
k2 = str(df1['Candidate ID'])
for each in range(0,100):
dict_company[k1)[each]] = {}
dict_company[k1)[each]] = k2[each]
if each % 50 == 0:
print(each)
print(dict_company)
for each in range(0,100):
dict_company[k1][k2][each] = round(float(k1[each]))
if each % 50:
print(each)
print(dict_company)
I think you need nlargest:
df1 = df.groupby('Final PAC ID')['PAC contribution'].nlargest(50)
If need all columns:
cols = df.columns[~df.columns.isin(['PAC contribution','Final PAC ID'])].tolist()
df1 = df.set_index(cols)
.groupby('Final PAC ID')['PAC contribution']
.nlargest(50)
.reset_index()
Another solution (can be slowier):
df1 = df.sort_values('PAC contribution', ascending=False).groupby('Final PAC ID').head(50)
Last save to excel by to_excel:
df1.to_excel('filename.xlsx')
df.groupby('Final PAC ID').head(50).reset_index(drop=True)
You can use groupby in conjunction with a dictionary comprehension here. The result is a dictionary containing your company names as keys and the sub dataframes with top 25 payments as values:
def aggregate(sub_df):
return sub_df.sort_values('PAC contribution', ascending=False).head(25)
grouped = df.groupby('Final PAC ID')
results = {company: aggregate(sub_df)
for company, sub_df in grouped}