Python: looping through 2 dataframes having thresholds and calculating revenue, stuck - python

I am trying to solve a business problem using Python but have difficulties to come up with a script to solve it. I have tried to loop through the dataframe using df.iterrows() but I am totally stuck because I just don't know how to proceed.
We process volumes in production orders of 1 type of resource that we need to process FIFO (first in first out). Each lot has a certain volume and price, after using up a lot we start with the next lot (FIFO).
Question: How can I automate the calculation of column Revenu ? Can you come up with some Python code that I can use to automate this process? Would you use a while or for loop, and would you iterate through the dataframe?
Herebelow I posted a print screen of the solution, on the left the production orders and on the right the volume and price per lot.
Below the image I posted 2 dictionaries containing the data of the screenshot.
Would really appreciate your help...
{'Productionorder': {0: 'Productionorder 1',
1: 'Productionorder 2',
2: 'Productionorder 3',
3: 'Productionorder 4',
4: 'Productionorder 5',
5: 'Productionorder 6',
6: 'Productionorder 7',
7: 'Productionorder 8',
8: 'Productionorder 9',
9: 'Productionorder 10',
10: 'Productionorder 11',
11: 'Productionorder 12',
12: 'Productionorder 13',
13: 'Productionorder 14',
14: 'Productionorder 15',
15: 'Productionorder 16',
16: 'Productionorder 17',
17: 'Productionorder 18',
18: 'Productionorder 19',
19: 'Productionorder 20',
20: 'Productionorder 21',
21: 'Productionorder 22'},
'Processed volume': {0: 810,
1: 3240,
2: 3177,
3: 1620,
4: 6480,
5: 5120,
6: 10880,
7: 13770,
8: 21060,
9: 4860,
10: 810,
11: 1620,
12: 15390,
13: 15390,
14: 6800,
15: 4480,
16: 10200,
17: 16650,
18: 2550,
19: 9050,
20: 9900,
21: 3200},
'Lotno.': {0: 1,
1: 1,
2: 1,
3: 1,
4: 2,
5: 2,
6: 2,
7: 2,
8: 2,
9: 2,
10: 2,
11: 2,
12: 2,
13: 3,
14: 3,
15: 3,
16: 3,
17: 3,
18: 3,
19: 3,
20: 4,
21: 4},
'Left of Lotno.': {0: 8490,
1: 5250,
2: 2073,
3: 453,
4: 75973,
5: 70853,
6: 59973,
7: 46203,
8: 25143,
9: 20283,
10: 19473,
11: 17853,
12: 2463,
13: 52073,
14: 45273,
15: 40793,
16: 30593,
17: 13943,
18: 11393,
19: 2343,
20: 38443,
21: 35243},
'Revenu': {0: 1741.5,
1: 6966.0,
2: 6830.549999999999,
3: 3483.0,
4: 10315.800000000001,
5: 7936.0,
6: 16864.0,
7: 21343.5,
8: 32643.0,
9: 7533.0,
10: 1255.5,
11: 2511.0,
12: 23854.5,
13: 20622.750000000004,
14: 8840.0,
15: 5824.0,
16: 13260.0,
17: 21645.0,
18: 3315.0,
19: 11765.0,
20: 12492.15,
21: 4000.0}}
{'Date': {0: Timestamp('2021-01-01 00:00:00'),
1: Timestamp('2021-01-02 00:00:00'),
2: Timestamp('2021-01-03 00:00:00'),
3: Timestamp('2021-01-04 00:00:00')},
'Lotno.': {0: 1, 1: 2, 2: 3, 3: 4},
'Volume': {0: 9300, 1: 82000, 2: 65000, 3: 46000},
'Price': {0: 2.15, 1: 1.55, 2: 1.3, 3: 1.25}}

Assuming you have two dataframes:
One for the Production Orders
And another for the Lot Details
The following function should allow you to calculate the Revenues (Along with the 'Lotno.' and 'Left of Lotno.' intermediary columns)
Requirements for each dataframe:
The Production Orders DataFrame must:
contain a column with the title 'Processed volume'
the index should be of consecutive integers starting at 0.
The Lot Details must:
contain the Columns ['Lotno.', 'Volume', 'Price']
have at least one row
rows should be ordered in the order of expected depletion.
In the event that the Quantity available in the lot is depleted, no additional revenue will be generated.
def fill_revenue(df1_orig, df2):
"""
df1_orig is the Production Orders DataFrame
df2 is the Lot Details DataFrame
The returned DataFrame is based on a copy of the df1_orig
"""
df1 = df1_orig.copy()
# Create Empty Columns for calculated fields
df1['Lotno.'] = None
df1['Left of Lotno.'] = None
df1['Revenu'] = None
def recursive_revenu_calc(order_volume, current_lot, current_lot_quantity, return_dict=None):
"""A function used to update the new values of a row"""
if return_dict is None:
return_dict = {'Revenu': 0}
return_dict.update({'Lotno.': current_lot, 'Left of Lotno.': current_lot_quantity})
lot_info = df2.loc[df2['Lotno.'] == current_lot].iloc[0]
# start calculation
if current_lot_quantity > order_volume:
return_dict['Revenu'] += order_volume * lot_info['Price']
current_lot_quantity -= order_volume
order_volume = 0
return_dict['Left of Lotno.'] = current_lot_quantity
else:
return_dict['Revenu'] += current_lot_quantity * lot_info['Price']
order_volume -= current_lot_quantity
try:
lot_info = df2.iloc[df2.index.get_loc(lot_info.name) + 1]
except IndexError:
return_dict['Left of Lotno.'] = 0
return return_dict
current_lot = lot_info['Lotno.']
current_lot_quantity = lot_info['Volume']
recursive_revenu_calc(order_volume, current_lot, current_lot_quantity, return_dict)
return return_dict
# updating each row of the Production Orders DataFrame
for idx, row in df1.iterrows():
order_volume = row['Processed volume']
current_lot = df2.iloc[0]['Lotno.'] if idx == 0 else df1.iloc[idx - 1]['Lotno.']
current_lot_quantity = df2.iloc[0]['Volume'] if idx == 0 else df1.iloc[idx - 1]['Left of Lotno.']
update_dict = recursive_revenu_calc(order_volume, current_lot, current_lot_quantity)
for key, value in update_dict.items():
df1.loc[idx, key] = value
return df1

Related

How to get cumulative sum over index and columns in pandas? [duplicate]

This question already has an answer here:
How can I use cumsum within a group in Pandas?
(1 answer)
Closed 6 months ago.
I have a periodic table that includes premium in different categories over a year for different companies. The dataframe looks like the below:
Company
Type
Month
Year
Ferdi Grup
Premium
1
Allianz
Birikimli Hayat
1
2022
Ferdi
325
2
Allianz
Birikimli Hayat
2
2022
Ferdi
476
3
Axa
Birikimli Hayat
3
2022
Ferdi
687
I want to get a table where I can see the premium cumulated over 'Company' and 'Year'. For each month I want to see premium cumulated from the beginning of the year.
This is the regular sum operation which works well in this case.
data.pivot_table(
columns = 'Company',
index = 'Month',
values = 'Premium',
aggfunc= np.sum
)
However when I change to np.cumsum the result is a series. I want a cumulated pivot table for each year, adding each month's value to the previous ones. How can I do that?
Expected output:
Company
Month
Year
Premium
1
Allianz
1
2022
325
2
Allianz
2
2022
801
3
Axa
3
2022
687
So, this is the original data I am working with:
{'Company': {0: 'AgeSA',
1: 'Türkiye',
2: 'Türkiye',
3: 'AgeSA',
4: 'AgeSA',
5: 'Türkiye',
6: 'AgeSA',
7: 'Türkiye',
8: 'Türkiye',
9: 'AgeSA',
10: 'Türkiye',
11: 'Türkiye',
12: 'AgeSA',
13: 'Türkiye',
14: 'Türkiye',
15: 'AgeSA',
16: 'AgeSA',
17: 'Türkiye',
18: 'AgeSA',
19: 'Türkiye',
20: 'Türkiye',
21: 'AgeSA',
22: 'Türkiye',
23: 'Türkiye'},
'Type': {0: 'Birikimli Hayat',
1: 'Birikimli Hayat',
2: 'Sadece Yaşam Teminatlı',
3: 'Karma Sigorta',
4: 'Yıllık Vefat',
5: 'Yıllık Vefat',
6: 'Uzun Süreli Vefat',
7: 'Uzun Süreli Vefat',
8: 'Birikimli Hayat',
9: 'Yıllık Vefat',
10: 'Yıllık Vefat',
11: 'Uzun Süreli Vefat',
12: 'Birikimli Hayat',
13: 'Birikimli Hayat',
14: 'Sadece Yaşam Teminatlı',
15: 'Karma Sigorta',
16: 'Yıllık Vefat',
17: 'Yıllık Vefat',
18: 'Uzun Süreli Vefat',
19: 'Uzun Süreli Vefat',
20: 'Birikimli Hayat',
21: 'Yıllık Vefat',
22: 'Yıllık Vefat',
23: 'Uzun Süreli Vefat'},
'Month': {0: 1,
1: 1,
2: 1,
3: 1,
4: 1,
5: 1,
6: 1,
7: 1,
8: 1,
9: 1,
10: 1,
11: 1,
12: 2,
13: 2,
14: 2,
15: 2,
16: 2,
17: 2,
18: 2,
19: 2,
20: 2,
21: 2,
22: 2,
23: 2},
'Year': {0: 2022,
1: 2022,
2: 2022,
3: 2022,
4: 2022,
5: 2022,
6: 2022,
7: 2022,
8: 2022,
9: 2022,
10: 2022,
11: 2022,
12: 2022,
13: 2022,
14: 2022,
15: 2022,
16: 2022,
17: 2022,
18: 2022,
19: 2022,
20: 2022,
21: 2022,
22: 2022,
23: 2022},
'Ferdi Grup': {0: 'Ferdi',
1: 'Ferdi',
2: 'Ferdi',
3: 'Ferdi',
4: 'Ferdi',
5: 'Ferdi',
6: 'Ferdi',
7: 'Ferdi',
8: 'Grup',
9: 'Grup',
10: 'Grup',
11: 'Grup',
12: 'Ferdi',
13: 'Ferdi',
14: 'Ferdi',
15: 'Ferdi',
16: 'Ferdi',
17: 'Ferdi',
18: 'Ferdi',
19: 'Ferdi',
20: 'Grup',
21: 'Grup',
22: 'Grup',
23: 'Grup'},
'Premium': {0: 936622.43,
1: 14655.67,
2: 8496.0,
3: 124768619.29,
4: 6651019.24,
5: 11055383.530005993,
6: 54273212.457471885,
7: 22163192.66,
8: 81000.95,
9: 9338009.52,
10: 251790130.54997802,
11: 140949274.79999998,
12: 910808.77,
13: 8754.71,
14: 7128.0,
15: 129753498.31,
16: 8015974.454128993,
17: 16776490.000003006,
18: 67607915.34000003,
19: 24683694.700000003,
20: 60887.56,
21: 1497105.2458709963,
22: 195019190.297756,
23: 167424048.43},
'cumsum': {0: 936622.43,
1: 14655.67,
2: 23151.67,
3: 125705241.72000001,
4: 132356260.96000001,
5: 11078535.200005993,
6: 186629473.4174719,
7: 33241727.860005993,
8: 33322728.810005993,
9: 195967482.9374719,
10: 285112859.35998404,
11: 426062134.159984,
12: 196878291.7074719,
13: 426070888.869984,
14: 426078016.869984,
15: 326631790.0174719,
16: 334647764.4716009,
17: 442854506.869987,
18: 402255679.8116009,
19: 467538201.569987,
20: 467599089.129987,
21: 403752785.05747193,
22: 662618279.427743,
23: 830042327.857743}}
This is the result of a regular sum pivot:
AgeSA
Türkiye
1
195967482.9374719
426062134.159984
2
207785302.12000003
403980193.69775903
When I use the suggested code as below:
df_2 = data.copy()
df_2['cumsum'] = df_2.groupby(['Company', 'Year'])[['Premium']].cumsum()
df_2.sort_values(['Company', 'Year', 'cumsum']).reset_index(drop = True)
Each line gets a cumsum value from the above lines it seems:
For me to be able to get the table I need, I need to get max in each group again in a pivot_table:
df_2.pivot_table(
index = ['Year', 'Month'],
values = ['Premium', 'cumsum'],
columns = 'Company',
aggfunc = {'Premium': 'sum', 'cumsum': 'max'}
)
which finally gets me to this result:
Is it that difficult to get the cumsum table in pandas or am I just doing it the hard way?
Your dataframe is already in the right format, why you want to pivot it again?
I think what you are searching for is a pandas.groupby.
df['cumsum_by_group'] = df.groupby(['Company', 'Year'])['Premium'].cumsum()
Output:
Company Type Month Year Ferdi Grup Premium cumsum_by_group
1 Allianz Birikimli Hayat 1 2022 Ferdi 325 325
2 Allianz Birikimli Hayat 2 2022 Ferdi 476 801
3 Axa Birikimli Hayat 3 2022 Ferdi 687 687
To calculate the cumulative sum over multiple colums of a dataframe, you can use pandas.DataFrame.groupby and pandas.DataFrame.cumsum combined.
Assuming that data is the dataframe that holds the original dataset, use the code below :
data['Premium'] = data.groupby(['Company', 'Year'])['Premium'].cumsum()
out = data[['Company', 'Month', 'Year', 'Premium']] #to select the specific columns
>>> print(out)

Python nested for loop and if statement trouble

I am looking to build a dictionary of list that meet the following criteria:
Item0 in list1 == Item0 in list2 and Item1 in list1 == Item1 in list2 and Date2 in list1 < Date2 in list2.
Running the code as is gives me one list in the dict. The one list is the same even if I change the if statement to > instead of <.
Everything prior to this(see below) looks correct:
"for li in Liststr2:
for lr in Liststr1A:
if lr[0] == li[0] and lr[1] == li[1] and lr[2] > li[2]:"
Also "lr[2] & li[2]" are dtype <M8[ns] if that makes a difference.
df = {'Position': {0: 1, 1: 2, 2: 1, 3: 2, 4: 1, 5: 2, 6: 1, 7: 2, 8: 1, 9: 1, 10: 1, 11: 2, 12: 1, 13: 2, 14: 1, 15: 2, 16: 1, 17: 2, 18: 1, 19: 2, 20: 1}, 'Location': {0: 'AB1', 1: 'AB2', 2: 'AB3', 3: 'AB4', 4: 'AB4', 5: 'AB4', 6: 'AB4', 7: 'AB4', 8: 'AB4', 9: 'AB2', 10: 'AB5', 11: 'AB4', 12: 'AB4', 13: 'AB6', 14: 'AB6', 15: 'AB6', 16: 'AB6', 17: 'AB6', 18: 'AB6', 19: 'AB1', 20: 'AB1'}, 'DATE': {0: Timestamp('2021-05-22 18:00:00'), 1: Timestamp('2021-05-21 13:00:00'), 2: Timestamp('2021-05-24 12:23:00'), 3: Timestamp('2021-05-23 12:25:00'), 4: Timestamp('2021-05-23 12:25:00'), 5: Timestamp('2021-05-23 12:25:00'), 6: Timestamp('2021-05-23 12:25:00'), 7: Timestamp('2021-05-23 12:25:00'), 8: Timestamp('2021-05-23 12:25:00'), 9: Timestamp('2021-05-21 18:00:00'), 10: Timestamp('2021-05-21 18:00:00'), 11: Timestamp('2021-05-24 14:08:00'), 12: Timestamp('2021-05-24 14:08:00'), 13: Timestamp('2021-05-24 16:35:00'), 14: Timestamp('2021-05-24 16:35:00'), 15: Timestamp('2021-05-24 16:35:00'), 16: Timestamp('2021-05-24 16:35:00'), 17: Timestamp('2021-05-24 19:48:00'), 18: Timestamp('2021-05-24 19:48:00'), 19: Timestamp('2021-05-25 23:45:00'), 20: Timestamp('2021-05-25 23:45:00')}, 'Item Numbers': {0: '788-33', 1: '07-1', 2: '5214-3', 3: '003', 4: '003', 5: '009J', 6: '009J', 7: '009J', 8: '009J', 9: '07-1', 10: '68-302', 11: '6-5213', 12: '6-5214', 13: '1-801', 14: '1-801', 15: '1-801', 16: '1-801', 17: '4-008', 18: '4-008', 19: 'A-001', 20: 'A-001'}}
finaltemp = []
Finallist = {}
str1Temp = []
str2Temp = []
NaValues = []
Liststr2 = []
Liststr1 = []
Listna = []
n = 0
for col, row in df.iterrows() :
col1Temp = row['col1']
col2Temp = row['col2']
col3temp = row['col3']
col4Temp = row['col4']
if col4Temp == None:
NaValues = [col1Temp, col3temp, col2Temp]
Listna.append(NaValues)
if col4Temp == 'str1':
str1Temp = [col1Temp, col3temp, col2Temp]
Liststr1.append(str1Temp)
if col4Temp == 'str2':
str2Temp = [col1Temp, col3temp, col2Temp]
Liststr2.append(str2Temp)
for li in Liststr2:
for lr in Liststr1:
if lr[0] == li[0] and lr[1] == li[1] and lr[2] > li[2]:
finaltemp = [lr[0], lr[1], lr[2]]
n = +1
key = 'Bad' + str(n)
def t() : return {key : finaltemp}
Finallist.update(t())
print(Finallist)
This simplifies your final loop, which as I said should be at the left margin, not indented one step:
Liststr1A = Liststr1[:10]
for li in Liststr2:
for lr in Liststr1A:
if lr[0] == li[0] and lr[1] == li[1] and lr[2] > li[2]:
Finallist['Bad'+str(len(FinalList)+1)] = lr[:]
print(Finallist)
It's not clear to me why you want Finallist to be a dictionary, since you want incrementing keys. Why not just make it a list and use Finallist.append?

how to clip pandas for a multiple column in a data frame

Here is the df:
{'Type 1': {1: 123.0,
2: 123.0,
3: 123.0,
4: 123.0,
5: 123.0,
6: 45.0,
7: 45.0,
8: 45.0,
9: 45.0,
10: 9.5,
11: 9.5,
12: 9.5,
13: 2.34,
14: 2.34,
15: 2.34},
'Type 2': {1: 0,
2: 0,
3: -90,
4: -90,
5: -90,
6: -90,
7: -90,
8: -270,
9: -270,
10: -270,
11: -270,
12: 180,
13: 180,
14: 181,
15: 181},
'Type 3': {1: 0,
2: 0,
3: 0,
4: 0,
5: 55,
6: 55,
7: 55,
8: 55,
9: 55,
10: 9,
11: 9,
12: 3,
13: 3,
14: 3,
15: 3},
'Type 4': {1: 5.0,
2: 5.0,
3: 5.0,
4: 5.0,
5: 10.0,
6: 123.0,
7: 12.0,
8: 23.0,
9: 16.0,
10: 3.14,
11: 0.0,
12: 0.0,
13: 0.0,
14: 0.0,
15: 18.0},
'Type 5': {1: 65536,
2: 65536,
3: 65536,
4: 65536,
5: 78888888,
6: 665,
7: 665,
8: 665,
9: 665,
10: 665,
11: 665,
12: 665,
13: 665,
14: 665,
15: 665},
'Type 6': {1: 3.4124,
2: 3.4124,
3: 3.4124,
4: 3.4124,
5: 3.4124,
6: 3.4124,
7: 3.4124,
8: 3.4124,
9: 3.4124,
10: 3.4124,
11: 3.4124,
12: 3.4124,
13: 3.4124,
14: 3.4124,
15: 3.4124},
'Type 7': {1: 0,
2: 0,
3: 2,
4: 2,
5: 2,
6: 1,
7: 1,
8: 1,
9: 1,
10: 10,
11: 10,
12: 9,
13: 9,
14: -5,
15: -5},
'Type 8': {1: 'convert the string to 0 and non-zero value to 1',
2: 'convert the string to 0 and non-zero value to 1',
3: 'convert the string to 0 and non-zero value to 1',
4: 'convert the string to 0 and non-zero value to 1',
5: 'convert the string to 0 and non-zero value to 1',
6: 'convert the string to 0 and non-zero value to 1',
7: 'convert the string to 0 and non-zero value to 1',
8: 'convert the string to 0 and non-zero value to 1',
9: 'convert the string to 0 and non-zero value to 1',
10: 'convert the string to 0 and non-zero value to 1',
11: 'convert the string to 0 and non-zero value to 1',
12: 'convert the string to 0 and non-zero value to 1',
13: 'convert the string to 0 and non-zero value to 1',
14: 'convert the string to 0 and non-zero value to 1',
15: 'convert the string to 0 and non-zero value to 1'},
'Type 9': {1: 0,
2: 0,
3: 0,
4: 0,
5: 0,
6: 1,
7: 1,
8: 0,
9: 0,
10: 8,
11: 8,
12: 0,
13: 0,
14: 45,
15: 45}}
each column in the dataframe has a lower and an upper limit as mentioned in the below list
eg:
lower_limit = [3,-90,0,0,0,1,0,0,0] #Type 1 lower limit is 3...
upper_limit = [100,90,50,100,65535,3,1,1,1] #Type 1 upper limit is 100...
lower_limit = pd.Series(lower_limit)
upper_limit = pd.Series(upper_limit)
df.clip(lower_limit, upper_limit, axis = 1)
But this returns every element as nan
whereas the expected result is to clip each column based on the upper limit and lower limit mentioned in the list...
Using for loop, I was able to make the necessary change, but it was extremely slower when the size of df is huge
I understand clipping is the faster way to make the changes to df but it doesnt work as expected, I am doing some mistake in it and advice if any other alternative ways of clipping the columns in a faster way?
From documentation, lower and upper must be float or array-like, not Series.
You could do
lower_limit = [3,-90,0,0,0,1,0,'',0] #Type 1 lower limit is 3...
upper_limit = [100,90,50,100,65535,3,1,'',1] #Type 1 upper limit is 100...
df.clip(lower_limit, upper_limit, axis = 1)
but column Type 8 is as string so you'd get an empty column with clip, you can fix with
lower_limit = [3,-90,0,0,0,1,0,df['Type 8'].min(),0]
upper_limit = [100,90,50,100,65535,3,1,df['Type 8'].max(),1]

How to remove duplicates based on lower frequency [duplicate]

This question already has answers here:
Get the row(s) which have the max value in groups using groupby
(15 answers)
Closed 2 years ago.
I have a table that looks like this
I want to be able to keep ids for brands that have highest freq. For example in case of audi both ids have same frequencies so keep only one. In case of mercedes-benz keep the latter one since it has frequency 7.
This is my dataframe:
{'Brand':
{0: 'audi',
1: 'audi',
2: 'bmw',
3: 'dacia',
4: 'fiat',
5: 'ford',
6: 'ford',
7: 'honda',
8: 'honda',
9: 'hyundai',
10: 'kia',
11: 'mercedes-benz',
12: 'mercedes-benz',
13: 'nissan',
14: 'nissan',
15: 'opel',
16: 'renault',
17: 'renault',
18: 'renault',
19: 'renault',
20: 'toyota',
21: 'toyota',
22: 'volvo',
23: 'vw',
24: 'vw',
25: 'vw',
26: 'vw'},
'id':
{0: 'audi_a4_dynamic_2016_otomatik',
1: 'audi_a6_standart_2015_otomatik',
2: 'bmw_5 series_executive_2016_otomatik',
3: 'dacia_duster_laureate_2017_manuel',
4: 'fiat_egea_easy_2017_manuel',
5: 'ford_focus_trend x_2015_manuel',
6: 'ford_focus_trend x_2015_otomatik',
7: 'honda_civic_eco elegance_2017_otomatik',
8: 'honda_cr-v_executive_2018_otomatik',
9: 'hyundai_tucson_elite plus_2017_otomatik',
10: 'kia_sportage_concept plus_2015_otomatik',
11: 'mercedes-benz_c-class_amg_2016_otomatik',
12: 'mercedes-benz_e-class_edition e_2015_otomatik',
13: 'nissan_qashqai_black edition_2014_manuel',
14: 'nissan_qashqai_sky pack_2015_otomatik',
15: 'opel_astra_edition_2016_manuel',
16: 'renault_clio_joy_2016_manuel',
17: 'renault_kadjar_icon_2015_otomatik',
18: 'renault_kadjar_icon_2016_otomatik',
19: 'renault_mégane_touch_2017_otomatik',
20: 'toyota_corolla_touch_2015_otomatik',
21: 'toyota_corolla_touch_2016_otomatik',
22: 'volvo_s60_advance_2018_otomatik',
23: 'vw_jetta_comfortline_2013_otomatik',
24: 'vw_passat_highline_2017_otomatik',
25: 'vw_tiguan_sport&style_2012_manuel',
26: 'vw_tiguan_sport&style_2013_manuel'},
'freq': {0: 4,
1: 4,
2: 7,
3: 4,
4: 4,
5: 4,
6: 4,
7: 4,
8: 4,
9: 4,
10: 4,
11: 4,
12: 7,
13: 4,
14: 4,
15: 4,
16: 4,
17: 4,
18: 4,
19: 4,
20: 4,
21: 4,
22: 4,
23: 4,
24: 7,
25: 4,
26: 4}}
Edit: tried one of the answers and got an extra level of header
You need to pandas.groupby Brand and then aggregate with respect to the maximal frequency.
Something like this should work:
df.groupby('Brand')[['id', 'freq']].agg({'freq': 'max'})
To get your result, run:
result = df.groupby('Brand', as_index=False).apply(
lambda grp: grp[grp.freq == grp.freq.max()].iloc[0])

left and right justify columns correctly in pandas dataframe

I have a following dictionary where each key in the dictionary is associated with a dataframe.
data['total_brands'] = pd.DataFrame({'total_brands': {0: 164}})
data['new_portfolios_added'] = pd.DataFrame({'new_portfolios_added': {0: 3}})
data['total_updated_portfolios'] = pd.DataFrame({'total_updated_portfolios': {0: 1}})
data['family_per_brand'] = pd.DataFrame({'brand_name': {0: 'Morningstar',
1: 'Vanguard',
2: 'WisdomTree',
3: 'State Street',
4: 'First Trust',
5: 'Franklin Templeton Investments',
6: 'Logicly',
7: 'Nuveen',
8: 'Scott Burns',
9: 'Paul Merriman',
10: 'Fidelity',
11: 'FlexShares',
12: 'Alpha Architect',
13: 'Rick Ferri',
14: 'Craig Israelsen',
15: 'Rajan Subramanian',
16: 'Goldman Sachs',
17: 'JPMorgan',
18: 'Xtrackers',
19: 'PIMCO',
20: 'John Hancock',
21: 'Hartford',
22: 'Invesco',
23: 'Schwab'},
'family_per_brand': {0: 7,
1: 6,
2: 5,
3: 5,
4: 4,
5: 4,
6: 3,
7: 3,
8: 2,
9: 2,
10: 2,
11: 1,
12: 1,
13: 1,
14: 1,
15: 1,
16: 0,
17: 0,
18: 0,
19: 0,
20: 0,
21: 0,
22: 0,
23: 0}})
Now, i want to send all my data to an email in text format with in the body of the email with the data frames looking presentable. I searched around stack overflow and found these functions to help with my case:
blanks = r'^ *([a-zA-Z_0-9-]*) .*$'
blanks_comp = re.compile(blanks)
def find_index_in_line(line):
index = 0
spaces = False
for ch in line:
if ch == ' ':
spaces = True
elif spaces:
break
index += 1
return index
def pretty_to_string(df):
lines = df.to_string().split('\n')
header = lines[0]
m = blanks_comp.match(header)
indices = []
if m:
st_index = m.start(1)
indices.append(st_index)
non_header_lines = lines[1:len(lines)]
for line in non_header_lines:
index = find_index_in_line(line)
indices.append(index)
mn = np.min(indices)
newlines = []
for l in lines:
newlines.append(l[mn:len(l)])
return '\n'.join(newlines) if df.shape[0] > 1 else ':'.join(newlines)
Then I tried:
final = "\n".join(pretty_to_string(data[key]) for key in data.keys())
print(final)
Gives me the following output which is visually not appealing as you can see from the attachment.
Ideally i would want, 164 under total_brands, 3 under new_portfolios_added and 1 in total_updated_portfolios all aligned to the right
Ideally I would want the dataframe with the column "brand_name" aligned below the "total_updated_portfolios" tab
Perhaps saving to a csv, then opening in excel, copying the table into email would be fastest / easiest. That method often preserves the formatting you select, depending on your email client.
data.to_csv('newfilename.csv')

Categories

Resources