Pandas dataframe applying logic to columns calculations - python

Hi I have a huge dataframe with the following structure:
ticker calendar-date last-update Assets Ebitda .....
0 a 2001-06-30 2001-09-14 110 1000 .....
1 a 2001-09-30 2002-01-22 0 -8 .....
2 a 2001-09-30 2002-02-01 0 800 .....
3 a 2001-12-30 2002-03-06 120 0 .....
4 b 2001-06-30 2001-09-18 110 0 .....
5 b 2001-06-30 2001-09-27 110 30 .....
6 b 2001-09-30 2002-01-08 140 35 .....
7 b 2001-12-30 2002-03-08 120 40 .....
..
What I want is for each ticker: create new columns with % change in Assets and Ebitda from last calendar-date (t-1) and last calendar-date(t-2) for each row.
But here comes the problems:
1) As you can see calendar-date (by ticker) are not always uniques values since there can be more last-update for the same calendar-date but I always want the change since last calendar-date and not from last last-update.
2)there are rows with 0 values in that case I want to use the last observed value to calculate the %change. If I only had one stock that would be easy, I would just ffill the values, but since I have many tickers I cannot perform this operation safely since I could pad the value from ticker 'a' to ticker 'b' and that is not what I want
I guess this could be solved creating a function with if statements to handle data exceptions or maybe there is a good way to handle this inside pandas... maybe multi indexing?? the truth is that I have no idea on how to approach this task, anybody can help?
Thanks

Step 1
sort_values to ensure proper ordering for later manipulation
icols = ['ticker', 'calendar-date', 'last-update']
df.sort_values(icols, inplace=True)
Step 2
groupby 'ticker' and replace zeros and forward fill
vcols = ['Assets', 'Ebitda']
temp = df.groupby('ticker')[vcols].apply(lambda x: x.replace(0, np.nan).ffill())
d1 = df.assign(**temp.to_dict('list'))
d1
ticker calendar-date last-update Assets Ebitda
0 a 2001-06-30 2001-09-14 110.0 1000.0
1 a 2001-09-30 2002-01-22 110.0 -8.0
2 a 2001-09-30 2002-02-01 110.0 800.0
3 a 2001-12-30 2002-03-06 120.0 800.0
4 b 2001-06-30 2001-09-18 110.0 NaN
5 b 2001-06-30 2001-09-27 110.0 30.0
6 b 2001-09-30 2002-01-08 140.0 35.0
7 b 2001-12-30 2002-03-08 120.0 40.0
NOTE: The first 'Ebitda' for 'b' is NaN because there was nothing to forward fill from.
Step 3
groupby ['ticker', 'calendar-date'] and grab the last column. Because we sorted above, the last row will be the most recently updated row.
d2 = d1.groupby(icols[:2])[vcols].last()
Step 4
groupby again, this time just by 'ticker' which is in the index of d2, and take the pct_change
d3 = d2.groupby(level='ticker').pct_change()
Step 5
join back with df
df.join(d3, on=icols[:2], rsuffix='_pct')
ticker calendar-date last-update Assets Ebitda Assets_pct Ebitda_pct
0 a 2001-06-30 2001-09-14 110 1000 NaN NaN
1 a 2001-09-30 2002-01-22 0 -8 0.000000 -0.200000
2 a 2001-09-30 2002-02-01 0 800 0.000000 -0.200000
3 a 2001-12-30 2002-03-06 120 0 0.090909 0.000000
4 b 2001-06-30 2001-09-18 110 0 NaN NaN
5 b 2001-06-30 2001-09-27 110 30 NaN NaN
6 b 2001-09-30 2002-01-08 140 35 0.272727 0.166667
7 b 2001-12-30 2002-03-08 120 40 -0.142857 0.142857

Related

how to adjust subtotal columns in pandas using grouby?

I'm working on exporting data frames to Excel using dataframe join.
However, after Join dataframe,
when calculating subtotal using groupby, the figure below is executed.
There's a "Subtotal" word in the index column.
enter image description here
Is there any way to move it into the code column and sort the indexes?
enter image description here
here codes :
def subtotal(df__, str):
container = []
for key, group in df__.groupby(['key']):
group.loc['subtotal'] = group[['quantity', 'quantity2', 'quantity3']].sum()
container.append(group)
df_subtotal = pd.concat(container)
df_subtotal.loc['GrandTotal'] = df__[['quantity', 'quantity2', 'quantity3']].sum()
print(df_subtotal)
return (df_subtotal.to_excel(writer, sheet_name=str))
Use np.where() to fill NaN in code column with value in df.index. Then assign a new index array to df.index.
import numpy as np
df['code'] = np.where(df['code'].isna(), df.index, df['code'])
df.index = np.arange(1, len(df) + 1)
print(df)
code key product quntity1 quntity2 quntity3
1 cs01767 a apple-a 10 0 10.0
2 Subtotal NaN NaN 10 0 10.0
3 cs0000 b bannana-a 50 10 40.0
4 cs0000 b bannana-b 0 0 0.0
5 cs0000 b bannana-c 0 0 0.0
6 cs0000 b bannana-d 80 20 60.0
7 cs0000 b bannana-e 0 0 0.0
8 cs01048 b bannana-f 0 0 NaN
9 cs01048 b bannana-g 0 0 0.0
10 Subtotal NaN NaN 130 30 100.0
11 cs99999 c melon-a 50 10 40.0
12 cs99999 c melon-b 20 20 0.0
13 cs01188 c melon-c 10 0 10.0
14 Subtotal NaN NaN 80 30 50.0
15 GrandTotal NaN NaN 220 60 160.0

How to loc 5 rows before and 5 rows after value 1 in column

I have dataframe , i want to change loc 5 rows before and 5 rows after flag value is 1.
df=pd.DataFrame({'A':[2,1,3,4,7,8,11,1,15,20,15,16,87],
'flag':[0,0,0,0,0,1,1,1,0,0,0,0,0]})
expect_output
df1_before =pd.DataFrame({'A':[1,3,4,7,8],
'flag':[0,0,0,0,1]})
df1_after =pd.DataFrame({'A':[8,11,1,15,20],
'flag':[1,1,1,0,0]})
do same process for all three flag 1
I think one easy way is to loop over the index where the flag is 1 and select the rows you want with loc:
l = len(df)
for idx in df[df.flag.astype(bool)].index:
dfb = df.loc[max(idx-4,0):idx]
dfa = df.loc[idx:min(idx+4,l)]
#do stuff
the min and max function are to ensure the boundary are not passed in case you have a flag=1 within the first or last 5 rows. Note also that with loc, if you want 5 rows, you need to use +/-4 on idx to get the right segment.
That said, depending on what your actual #do stuff is, you might want to change tactic. Let's say for example, you want to calculate the difference between the sum of A over the 5 rows after and the 5 rows before. you could use rolling and shift:
df['roll'] = df.rolling(5)['A'].sum()
df.loc[df.flag.astype(bool), 'diff_roll'] = df['roll'].shift(-4) - df['roll']
print (df)
A flag roll diff_roll
0 2 0 NaN NaN
1 1 0 NaN NaN
2 3 0 NaN NaN
3 4 0 NaN NaN
4 7 0 17.0 NaN
5 8 1 23.0 32.0 #=55-23, 55 is the sum of A of df_after and 23 df_before
6 11 1 33.0 29.0
7 1 1 31.0 36.0
8 15 0 42.0 NaN
9 20 0 55.0 NaN
10 15 0 62.0 NaN
11 16 0 67.0 NaN
12 87 0 153.0 NaN

Concat dataframes row wised and merge rows if exsist

I have two dataframes:
Df_1:
A B C D
1 10 nan 20 30
2 20 30 20 10
Df_2:
A B
1 10 40
2 30 70
I want to merge them and have this final dataframe.
A B C D
1 10 40 20 30
2 20 30 20 10
3 30 70 nan nan
How do I do that?
Looking at the expected result, I think, the index in the second row of Df_2
should be 3 (instead of 2).
Run Df_1.combine_first(Df_2).
The result is:
A B C D
1 10.0 40.0 20.0 30.0
2 20.0 30.0 20.0 10.0
3 30.0 70.0 NaN NaN
i.e. due to possible NaN values, the type of columns is coerced to float.
But if you want, you can revert this where possible, by applying to_numeric:
Df_1.combine_first(Df_2).apply(pd.to_numeric, downcast='integer')

Merge Columns with the Same name in the same dataframe if null

I have a dataframe that looks like this
Depth DT DT DT GR GR GR
1 100 NaN 45 NaN 100 50 NaN
2 200 NaN 45 NaN 100 50 NaN
3 300 NaN 45 NaN 100 50 NaN
4 400 NaN Nan 50 100 50 NaN
5 500 NaN Nan 50 100 50 NaN
I need to merge the same name columns into one if there are null values and keep the first occurrence of the column if other columns are not null.
In the end the data frame should look like
Depth DT GR
1 100 45 100
2 200 45 100
3 300 45 100
4 400 50 100
5 500 50 100
I am beginner in pandas. I tried but wasn't successful. I tried drop duplicate but it couldn't do the what I wanted. Any suggestions?
IIUC, you can do:
(df.set_index('Depth')
.groupby(level=0, axis=1).first()
.reset_index())
output:
Depth DT GR
0 100 45.0 100.0
1 200 45.0 100.0
2 300 45.0 100.0
3 400 50.0 100.0
4 500 50.0 100.0

pandas - add a column of the mean of the last 3 elements in groupby

I have a dataframe of several columns, which I sorted, grouped by index and calculated the difference between each row and the next one in the group. Next I want to add a column of the means of the last 3 differences. For example:
index A B A_diff B_diff A_diff_last3mean B_diff_last3mean
1111 1 2 0 0 NaN NaN
1111 1 2 0 0 NaN NaN
1111 2 4 1 2 0.33 0.67
1111 4 6 2 2 1 1.33
2222 5 7 NaN NaN NaN NaN #index changed
2222 2 8 -3 1 NaN NaN
I managed to create such columns using
df=df.join(df.groupby(['index'],sort=False,as_index=False).diff(),rsuffix='_diff')
y=df.groupby(['index'],sort=False,as_index=False).nth([-1,-2,-3])
z=y.groupby(['index'],sort=False,as_index=False).mean()
but that creates an aggregated dataframe, and I need the values to be merged in the original one. I tried with the .transform() function and did not succeed much. Would really appreciate your help.
import io
import pandas as pd
data = io.StringIO('''\
group A B
1111 1 2
1111 1 2
1111 2 4
1111 4 6
2222 5 7
2222 2 8
''')
df = pd.read_csv(data, delim_whitespace=True)
diff = (df.groupby('group')
.diff()
.fillna(0)
.add_suffix('_diff'))
df = df.join(diff)
last3mean = (df.groupby('group')[diff.columns]
.rolling(3).mean()
.reset_index(drop=True)
.add_suffix('_last3mean'))
df = df.join(last3mean)
print(df)
Output:
group A B A_diff B_diff A_diff_last3mean B_diff_last3mean
0 1111 1 2 0.0 0.0 NaN NaN
1 1111 1 2 0.0 0.0 NaN NaN
2 1111 2 4 1.0 2.0 0.333333 0.666667
3 1111 4 6 2.0 2.0 1.000000 1.333333
4 2222 5 7 0.0 0.0 NaN NaN
5 2222 2 8 -3.0 1.0 NaN NaN
Notes:
Although index is a perfectly valid column name, pandas DataFrames have indices too. To avoid confusion, I have renamed that column to group.
In your desired output, you seem to have filled the NaNs in columns A_diff and B_diff for the group 1111 but not for the group 2222. The first line in your code snippet does not perform such filling. I have filled them all — .fillna(0) in the definition of diff, but you can drop that if you want.

Categories

Resources