I understand what I am doing wrong, but not how to fix it.
I am trying to write "Z" to excel and then save once I have looped through all my list entries, but the index in the DF resets every iteration and overwrites so I only see my last list entry and not all 5.
Any help is appreciated.
orig_df = (pd.read_excel("abc.xlsx"))
writer = pd.ExcelWriter('NEW_Frame.xlsx', engine='xlsxwriter')
List1 = ['Market_A', 'Market_B', 'Market_C', 'Market_D', 'Market_E']
new_df = pd.DataFrame(columns=['Location','Data1','Data2','Data3'], index=range(5))
for i in range(len(List1)) :
M = List1[i]
P = List1[i]
M = abc[abc.Location.str.contains(M)]
Z = [{'Location': P , 'Data1': abc['Data1'].sum(), 'Data2': abc['Data2'].sum(), 'Data3':
abc['Data3'].sum(),}]
Z = pd.DataFrame(Z)
Z.to_excel(writer, sheet_name=P)
i += 1
writer.save()
since I don't have your input file I am just saving a small test dataframe, but I had no trouble saving all 5 pages with this:
writer = pd.ExcelWriter('NEW_Frame.xlsx', engine='xlsxwriter')
for sheet_name in ['Market_A', 'Market_B', 'Market_C', 'Market_D', 'Market_E']:
df = pd.DataFrame({'test1': [1,1,1], 'test2': [2,2,2]})
df.to_excel(writer, sheet_name=sheet_name)
writer.save()
Could you try this and see if you have any issues?
orig_df = (pd.read_excel("abc.xlsx"))
writer = pd.ExcelWriter('NEW_Frame.xlsx', engine='xlsxwriter')
List1 = ['Market_A', 'Market_B', 'Market_C', 'Market_D', 'Market_E']
new_df = pd.DataFrame(columns=['Location','Data1','Data2','Data3'], index=range(5))
location, data1, data2, data3 = []
for i in range(len(List1)) :
M = List1[i]
P = List1[i]
M = abc[abc.Location.str.contains(M)]
location.append(p)
data1.append(abc['Data1'].sum())
data2.append(abc['Data2'].sum())
data3.append(abc['Data3'].sum())
i += 1
Z = [{'Location': location , 'Data1': data1, 'Data2': data2, 'Data3':
data3,}]
Z = pd.DataFrame(Z)
Z.to_excel(writer, sheet_name=P)
writer.save()
Related
Let's say I have the following dataframe:
import pandas as pd
data = {'Flag':['a', 'b', 'a', 'b'],
'Item':['ball', 'car', 'pen', 'candy'],
'Char1':[0, 0, 0, 0],
'Char2':[23, 21, 19, 13],
'Char3':[40, 43, 60, 70]}
df = pd.DataFrame(data)
Now, let's perform some calculation:
df['Char1_avg'] = df.apply(lambda x: df[df.Flag == x.Flag].Char1.mean(), axis=1)
df['Char1_std'] = df.apply(lambda x: df[df.Flag == x.Flag].Char1.std(), axis=1)
df['Char2_avg'] = df.apply(lambda x: df[df.Flag == x.Flag].Char2.mean(), axis=1)
df['Char2_std'] = df.apply(lambda x: df[df.Flag == x.Flag].Char2.std(), axis=1)
df['Char3_avg'] = df.apply(lambda x: df[df.Flag == x.Flag].Char3.mean(), axis=1)
df['Char3_std'] = df.apply(lambda x: df[df.Flag == x.Flag].Char3.std(), axis=1)
Finally let's create the following dictionary:
Flag_list = ['a','b']
sum_dict = {'Flag':Flag_list,
'Char1_average':df['Char1_avg'].head(2).tolist(),
'Char1_std':df['Char1_std'].head(2).tolist(),
'Char2_average':df['Char2_avg'].head(2).tolist(),
'Char2_std':df['Char2_std'].head(2).tolist(),
'Char3_average':df['Char3_avg'].head(2).tolist(),
'Char3_std':df['Char3_std'].head(2).tolist()}
In this way all works fine,
correct dictionary
but I need to define a function that performs the same things, so I have written the following code:
def fnctn(dataf):
param_list=["Char1", "Char2", 'Char3']
for param in param_list:
dataf[f'{param}_avg'] = dataf.apply(lambda x: dataf[dataf.Flag == x.Flag][f'{param}'].mean(), axis=1)
dataf[f'{param}_StDev'] = dataf.apply(lambda x: dataf[dataf.Flag == x.Flag][f'{param}'].std(), axis=1)
sum_dict = {'Flag':Flag_list,
f'{param}_average':dref[f'{param}_avg'].head(2).tolist(),
f'{param}_std':dref[f'{param}_StDev'].head(2).tolist()}
ref_avg_values = pd.DataFrame(sum_dict)
dataf = df.copy()
fnctn(dataf)
But this time the dictionary I get contains only the values of the last iteration:
wrong dictionary
How can I get the same dictionary as in the previous case?
you have to update it into the dictionary so that you have all the values that are iterated inside the for loop.
Here is the solution to your query:
def fnctn(dataf):
param_list=["Char1", "Char2", 'Char3']
dictie={}
for param in param_list:
dataf[f'{param}_avg'] = dataf.apply(lambda x: dataf[dataf.Flag == x.Flag][f'{param}'].mean(), axis=1)
dataf[f'{param}_StDev'] = dataf.apply(lambda x: dataf[dataf.Flag == x.Flag][f'{param}'].std(), axis=1)
sum_dict = {'Flag':Flag_list,
f'{param}_average':dataf[f'{param}_avg'].head(2).tolist(),
f'{param}_std':dataf[f'{param}_StDev'].head(2).tolist()}
dictie.update(sum_dict)
return pd.DataFrame(dictie)
dataf = df.copy()
fnctn(dataf)
And the answer is as below:
I am trying to select the values from the top 3 records of each group in a python sorted dataframe and put them into new columns. I have a function that is processing each group but I am having difficulties finding the right method to extract, rename the series, then combine the result as a single series to return.
Below is a simplified example of an input dataframe (df_in) and the expected output (df_out):
import pandas as pd
data_in = { 'Product': ['A', 'A', 'A', 'A', 'B', 'C', 'C'],
'Price': [25.0, 30.5, 50.0, 61.5, 120.0, 650.0, 680.0],
'Qty': [15 , 13, 14, 10, 5, 2, 1]}
df_in = pd.DataFrame (data_in, columns = ['Product', 'Price', 'Qty'])
I am reproducing below 2 examples of the functions I've tested and trying to get a more efficient option that works, especially if I have to process many more columns and records.
Function best3_prices_v1 works but have to explicitly specify each column or variable, and is especially an issue as I have to add more columns.
def best3_prices_v1(x):
d = {}
# get best 3 records if records available, else set volumes as zeroes
best_price_lv1 = x.iloc[0].copy()
rec_with_zeroes = best_price_lv1.copy()
rec_with_zeroes['Price'] = 0
rec_with_zeroes['Qty'] = 0
recs = len(x) # number of records
if (recs == 1):
# 2nd and 3rd records not available
best_price_lv2 = rec_with_zeroes.copy()
best_price_lv3 = rec_with_zeroes.copy()
elif (recs == 2):
best_price_lv2 = x.iloc[1]
# 3rd record not available
best_price_lv3 = rec_with_zeroes.copy()
else:
best_price_lv2 = x.iloc[1]
best_price_lv3 = x.iloc[2]
# 1st best
d['Price_1'] = best_price_lv1['Price']
d['Qty_1'] = best_price_lv1['Qty']
# 2nd best
d['Price_2'] = best_price_lv2['Price']
d['Qty_2'] = best_price_lv2['Qty']
# 3rd best
d['Price_3'] = best_price_lv3['Price']
d['Qty_3'] = best_price_lv3['Qty']
# return combined results as a series
return pd.Series(d, index=['Price_1', 'Qty_1', 'Price_2', 'Qty_2', 'Price_3', 'Qty_3'])
Codes to call function:
# sort dataframe by Product and Price
df_in.sort_values(by=['Product', 'Price'], ascending=True, inplace=True)
# get best 3 prices and qty as new columns
df_out = df_in.groupby(['Product']).apply(best3_prices_v1).reset_index()
Second attempt to improve/reduce codes and explicit names for each variable ... not complete and not working.
def best3_prices_v2(x):
d = {}
# get best 3 records if records available, else set volumes as zeroes
best_price_lv1 = x.iloc[0].copy()
rec_with_zeroes = best_price_lv1.copy()
rec_with_zeroes['Price'] = 0
rec_with_zeroes['Qty'] = 0
recs = len(x) # number of records
if (recs == 1):
# 2nd and 3rd records not available
best_price_lv2 = rec_with_zeroes.copy()
best_price_lv3 = rec_with_zeroes.copy()
elif (recs == 2):
best_price_lv2 = x.iloc[1]
# 3rd record not available
best_price_lv3 = rec_with_zeroes.copy()
else:
best_price_lv2 = x.iloc[1]
best_price_lv3 = x.iloc[2]
stats_columns = ['Price', 'Qty']
# get records values for best 3 prices
d_lv1 = best_price_lv1[stats_columns]
d_lv2 = best_price_lv2[stats_columns]
d_lv3 = best_price_lv3[stats_columns]
# How to rename (keys?) or combine values to return?
lv1_stats_columns = [c + '_1' for c in stats_columns]
lv2_stats_columns = [c + '_2' for c in stats_columns]
lv3_stats_columns = [c + '_3' for c in stats_columns]
# return combined results as a series
return pd.Series(d, index=lv1_stats_columns + lv2_stats_columns + lv3_stats_columns)
Let's unstack():
df_in=(df_in.set_index([df_in.groupby('Product').cumcount().add(1),'Product'])
.unstack(0,fill_value=0))
df_in.columns=[f"{x}_{y}" for x,y in df_in]
df_in=df_in.reset_index()
OR via pivot()
df_in=(df_in.assign(key=df_in.groupby('Product').cumcount().add(1))
.pivot('Product','key',['Price','Qty'])
.fillna(0,downcast='infer'))
df_in.columns=[f"{x}_{y}" for x,y in df_in]
df_in=df_in.reset_index()
Based on #AnuragDabas's pivot solution and #ceruler's feedback above, I can now expand it to a more general problem.
New dataframe with more groups and columns:
data_in = { 'Product': ['A', 'A', 'A', 'A', 'B', 'C', 'C'],
'Model': ['A1', 'A1', 'A1', 'A2', 'B1', 'C1', 'C1'],
'Price': [25.0, 30.5, 50.0, 61.5, 120.0, 650.0, 680.0],
'Qty': [15 , 13, 14, 10, 5, 2, 1],
'Ratings': [9, 7, 8, 10, 6, 7, 8 ]}
df_in = pd.DataFrame (data_in, columns = ['Product', 'Model' ,'Price', 'Qty', 'Ratings'])
group_list = ['Product', 'Model']
stats_list = ['Price','Qty', 'Ratings']
df_out = df_in.groupby(group_list).head(3)
df_out=(df_out.assign(key=df_out.groupby(group_list).cumcount().add(1))
.pivot(group_list,'key', stats_list)
.fillna(0,downcast='infer'))
df_out.columns=[f"{x}_{y}" for x,y in df_out]
df_out = df_out.reset_index()
def frame(dt_type, start_year, end_year, columns_req):
frame = pd.DataFrame()
for i in range (start_year, end_year):
file_name = f"{dt_type} {i}"
dataframe = pd.read_csv(BytesIO(uploaded["%s.csv"%file_name]))
if len(columns_req) == 1:
df = pd.DataFrame(data, columns= [columns_req[0])
if len(columns_req) == 2:
df = pd.DataFrame(data, columns= [columns_req[0], columns_req[1]])
if len(columns_req) == 3:
df = pd.DataFrame(data, columns= [columns_req[0], columns_req[1], columns_req[2])
if len(columns_req) == 4:
df = pd.DataFrame(data, columns= [columns_req[0], columns_req[1], columns_req[2], columns_req[3]])
frame = frame.append(dataframe, ignore_index=True)
return (frame)
As you can see, the if loop is repetitive and feels odd. I am new to programming. Is there anyway to reduce that whole bunch of code?
you could do this
df = pd.DataFrame(data, columns = colums_req)
instead of all those if - conditions
I have a dataframe as below:
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
To get percentiles of sales,state wise,I have written below code:
pct_list1 = []
pct_list2 = []
for i in df['state'].unique().tolist():
pct_list1.append(i)
for j in range(0,101,10):
pct_list1.append(np.percentile(df[df['state'] == i]['sales'],j))
pct_list2.append(pct_list1)
pct_list1 = []
colnm_list1 = []
for k in range(0,101,10):
colnm_list1.append('perct_'+str(k))
colnm_list2 = ['state'] + colnm_list1
df1 = pd.DataFrame(pct_list2)
df1.columns = colnm_list2
df1
Can we optimize this code?
I feel that,we can also use
df1 = df[['state','sales']].groupby('state').quantile(0.1).reset_index(level=0)
df1.columns = ['state','perct_0']
for i in range(10,101,10):
df1.loc[:,('perct_'+str(i))] = df[['state','sales']].groupby('state').quantile(float(i/100.0)).reset_index(level=0)['sales']
If there are any other alternatives,please help.
Thanks.
How about this?
quants = np.arange(.1,1,.1)
pd.concat([df.groupby('state')['sales'].quantile(x) for x in quants],axis=1,keys=[str(x) for x in quants])
I posted below a code that works fine. What it does at the moment is:
it opens 2 .csv files 'CMF.csv' and 'D65.csv', and then
performs some math on it.
Here's the simple structure of those files :
'CMF.csv' (wavelength, x, y, z)
400,1.879338E-02,2.589775E-03,8.508254E-02
410,8.277331E-02,1.041303E-02,3.832822E-01
420,2.077647E-01,2.576133E-02,9.933444E-01
...etc
'D65.csv': (wavelength, a, b)
400,82.7549,14.708
410,91.486,17.6753
420,93.4318,20.995
...etc
I have a 3rd file data.csv, with this structure (serialNumber, wavelength, measurement, name) :
0,400,2.21,1
0,410,2.22,1
0,420,2.22,1
...
1,400,2.21,2
1,410,2.22,2
1,420,2.22,2
...etc
What I would like to do is to be able to write a few lines of code to perform
math on all the series of the last file (series are defined by their serial number and their name)
For example I need a loop that will perform, for each name or serial number, and for each wavelength, the operation:
x * a * measurement
I tried to load data.csv`in the csv reader as the other files, but I couldn't
any ideas?
Thanks
import csv
with open('CMF.csv') as cmf:
reader = csv.reader(cmf)
dict_cmf = dict()
for row in reader:
dict_cmf[float(row[0])] = row
with open('D65.csv') as d65:
reader = csv.reader(d65)
dict_d65 = dict()
for row in reader:
dict_d65[float(row[0])] = row
with open('data.csv') as sp:
reader = csv.reader(sp)
dict_sp = dict()
for row in reader:
dict_sp[float(row[0])] = row
Y = 0
Y_total = 0
X = 0
X_total = 0
Z = 0
Z_total = 0
i = 0
j = 0
for i in range(400, 700, i+10):
X = float(dict_cmf[i][1]) * float(dict_d65[i][1])
X_total = X_total + X
Y = float(dict_cmf[i][2]) * float(dict_d65[i][1])
Y_total = Y_total + Y
Z = float(dict_cmf[i][3]) * float(dict_d65[i][1])
Z_total = Z_total + Z
wp_X = 100 * X_total / Y_total
wp_Y = 100 * Y_total / Y_total
wp_Z = 100 * Z_total / Y_total
print Y_total
print "D65_CMF_2006_10_deg white point = "
print wp_X, wp_Y, wp_Z
I get this :
Traceback (most recent call last): File "C:\Users\gary\Documents\eclipse\Spectro\1illum_XYZ2006_D65_numpy.py", line 24, in <module> dict_sp[row[0]] = row IndexError: list index out of range
You need pandas. You can read the files into pandas tables, then join them to replace your code with this code:
import pandas
cmf = pandas.read_csv('CMF.csv', names=['wavelength', 'x', 'y', 'z'])
d65 = pandas.read_csv('D65.csv', names=['wavelength', 'a', 'b'])
data = pandas.read_csv('data.csv', names=['serialNumber', 'wavelength', 'measurement', 'name'])
lookup = pandas.merge(cmf, d65, on='wavelength')
merged = pandas.merge(data, lookup, on='wavelength')
totals = ((lookup[['x', 'y', 'z']].T*lookup['a']).T).sum()
wps = totals/totals['y']
print totals['y']
print "D65_CMF_2006_10_deg white point = "
print wps
Now, that doesn't do the last bit where you want to calculate extra values for each measurement. You can do this by adding a column to merged, like this:
merged['newcol'] = merged.x * merged.a * merged.measurement
One or more of the lines in data.csv does not contain what you think it does. Try to put your statement inside a try...except block to see what the problem is:
with open('spectral_data.csv') as sp:
reader = csv.reader(sp)
dict_sp = dict()
for row in reader:
try:
dict_sp[float(row[0])] = row
except IndexError as e:
print 'The problematic row is:'
print row
raise e
A proper debugger would also be helpful in these kind of situations.
pandas is probably the better way to go, but if you'd like an example in vanilla Python, you can have a look at this example:
import csv
from collections import defaultdict
d = defaultdict(dict)
for fname, cols in [('CMF.csv', ('x', 'y', 'z')), ('D65.csv', ('a', 'b'))]:
with open(fname) as ifile:
reader = csv.reader(ifile)
for row in reader:
wl, values = int(row[0]), row[1:]
d[wl].update(zip(cols, map(float, values)))
measurements = defaultdict(dict)
with open('data.csv') as ifile:
reader = csv.reader(ifile)
cols = ('measurement', 'name')
for serial, wl, me, name in reader:
measurements[int(serial)][int(wl)] = dict(zip(cols, (float(me), str(name))))
for serial in sorted(measurements.keys()):
for wl in sorted(measurements[serial].keys()):
me = measurements[serial][wl]['measurement']
print me * d[wl]['x'] * d[wl]['a']
This will store both x, y, z, a and b in a dictionary inside a dictionary with wavelength as the key (there is no apparent reason to store these values in separate dicts).
The measurements are stored in a two level deep dictionary with keys serial and wavelength. This way you can iterate over all serials and all corresponding wavelength like shown in the latter part of the code.
As for your specific calculations on the data in your example, this can be done quite easily with this structure:
tot_x = sum(v['x']*v['a'] for v in data.values())
tot_y = sum(v['y']*v['a'] for v in data.values())
tot_z = sum(v['z']*v['a'] for v in data.values())
wp_x = 100 * tot_x / tot_y
wp_y = 100 * tot_y / tot_y # Sure this is correct? It will always be 100
wp_z = 100 * tot_z / tot_y
print wp_x, wp_y, wp_z # 798.56037811 100.0 3775.04316468
These are the dictionaries given the input file in your question:
>>> from pprint import pprint
>>> pprint(dict(data))
{400: {'a': 82.7549,
'b': 14.708,
'x': 0.01879338,
'y': 0.002589775,
'z': 0.08508254},
410: {'a': 91.486,
'b': 17.6753,
'x': 0.08277331,
'y': 0.01041303,
'z': 0.3832822},
420: {'a': 93.4318,
'b': 20.995,
'x': 0.2077647,
'y': 0.02576133,
'z': 0.9933444}}
>>> pprint(dict(measurements))
{0: {400: {'measurement': 2.21, 'name': '1'},
410: {'measurement': 2.22, 'name': '1'},
420: {'measurement': 2.22, 'name': '1'}},
1: {400: {'measurement': 2.21, 'name': '2'},
410: {'measurement': 2.22, 'name': '2'},
420: {'measurement': 2.22, 'name': '2'}}}