I have two DataFrames as follows:
df1:
A B C D
index
0 10000 20000 30000 40000
df2:
time type A B C D
index
0 5/2020 unit 1000 4000 900 200
1 6/2020 unit 7000 2000 600 4000
I want to divide df1.iloc[0] by all rows in df2 to get the following:
df:
time type A B C D
index
0 5/2020 unit 10 5 33.33 200
1 6/2020 unit 1.43 10 50 10
I tried to use df1.iloc[0].div(df2.iloc[:]) but that gave me NaNs for all rows other than the 0 index.
Any suggestions would be greatly appreciated. Thank you.
Let us do
df2.update(df2.loc[:,df1.columns].rdiv(df1.iloc[0]))
df2
Out[861]:
time type A B C D
0 5/2020 unit 10.000000 5.0 33.333333 200.0
1 6/2020 unit 1.428571 10.0 50.000000 10.0
another way to do it, using numpy divide
df2.update(np.divide(df.to_numpy()[:,:], df2.loc[:,df.columns]))
df2
time type A B C D
0 5/2020 unit 10.000000 5.0 33.333333 200.0
1 6/2020 unit 1.428571 10.0 50.000000 10.0
You can use div.
df = df2.apply(lambda x:df1.iloc[0].div(x[df1.columns]), axis=1)
print(df):
A B C D
index
0 10.000000 5.0 33.333333 200.0
1 1.428571 10.0 50.000000 10.0
Related
I have two dataframes:
Df_1:
A B C D
1 10 nan 20 30
2 20 30 20 10
Df_2:
A B
1 10 40
2 30 70
I want to merge them and have this final dataframe.
A B C D
1 10 40 20 30
2 20 30 20 10
3 30 70 nan nan
How do I do that?
Looking at the expected result, I think, the index in the second row of Df_2
should be 3 (instead of 2).
Run Df_1.combine_first(Df_2).
The result is:
A B C D
1 10.0 40.0 20.0 30.0
2 20.0 30.0 20.0 10.0
3 30.0 70.0 NaN NaN
i.e. due to possible NaN values, the type of columns is coerced to float.
But if you want, you can revert this where possible, by applying to_numeric:
Df_1.combine_first(Df_2).apply(pd.to_numeric, downcast='integer')
I have a dataframe df like:
GROUP TYPE COUNT
A 1 5
A 2 10
B 1 3
B 2 9
C 1 20
C 2 100
I would like to add a row for each group such that the new row calculates the quotient of COUNT where TYPE equals 2 and COUNT where TYPE equals 1 for each GROUP ala:
GROUP TYPE COUNT
A 1 5
A 2 10
A .5
B 1 3
B 2 9
B .33
C 1 20
C 2 100
C .2
Thanks in advance.
df2 = df.pivot(index='GROUP', columns='TYPE', values='COUNT')
df2['div'] = df2[1]/df2[2]
df2.reset_index().melt('GROUP').sort_values('GROUP')
Output:
GROUP TYPE value
0 A 1 5.000000
3 A 2 10.000000
6 A div 0.500000
1 B 1 3.000000
4 B 2 9.000000
7 B div 0.333333
2 C 1 20.000000
5 C 2 100.000000
8 C div 0.200000
My approach would be to reshape the dataframe by pivoting, so every type has its own column. Then the division is very easy, and then by melting you reshape it back to the original shape. In my opinion this is also a very readable solution.
Of course, if you prefer np.nan to div as a type, you can replace it very easily, but I'm not sure if that's what you want.
s=df[df.TYPE.isin([1,2])].sort_values(['GROUP','TYPE']).groupby('GROUP').COUNT.apply(lambda x : x.iloc[0]/x.iloc[1])
# I am sort and filter your original df ,to make they are ordered and only have type 1 and 2
pd.concat([df,s.reset_index()]).sort_values('GROUP')
# cancat your result back
Out[77]:
COUNT GROUP TYPE
0 5.000000 A 1.0
1 10.000000 A 2.0
0 0.500000 A NaN
2 3.000000 B 1.0
3 9.000000 B 2.0
1 0.333333 B NaN
4 20.000000 C 1.0
5 100.000000 C 2.0
2 0.200000 C NaN
You can do:
import numpy as np
import pandas as pd
def add_quotient(x):
last_row = x.iloc[-1]
last_row['COUNT'] = x[x.TYPE == 1].COUNT.min() / x[x.TYPE == 2].COUNT.max()
last_row['TYPE'] = np.nan
return x.append(last_row)
print(df.groupby('GROUP').apply(add_quotient))
Output
GROUP TYPE COUNT
GROUP
A 0 A 1.0 5.000000
1 A 2.0 10.000000
1 A NaN 0.500000
B 2 B 1.0 3.000000
3 B 2.0 9.000000
3 B NaN 0.333333
C 4 C 1.0 20.000000
5 C 2.0 100.000000
5 C NaN 0.200000
Note that the function select the min of the TYPE == 1 and the max of the TYPE == 2, in case there is more than one value per group. And the TYPE is set to np.nan, but that can be easily changed.
Here's a way first using sort_values' by '['GROUP', 'TYPE'] so ensuring that TYPE 2 comes before 1 and then GroupBy GROUP.
Then use first and last to compute the quocient and outer merging with df:
g = df.sort_values(['GROUP', 'TYPE']).groupby('GROUP')
s = (g.first()/ g.nth(1)).COUNT.reset_index()
df.merge(s, on = ['GROUP','COUNT'], how='outer').fillna(' ').sort_values('GROUP')
GROUP TYPE COUNT
0 A 1 5.000000
1 A 2 10.000000
6 A 0.500000
2 B 1 3.000000
3 B 2 9.000000
7 B 0.333333
4 C 1 20.000000
5 C 2 100.000000
8 C 0.200000
I have a pandas dataframe:
SrNo value
a nan
1 100
2 200
3 300
b nan
1 500
2 600
3 700
c nan
1 900
2 1000
i want my final dataframe as:
value new_col
100 a
200 a
300 a
500 b
600 b
700 b
900 c
1000 c
i.e for sr.no 'a' the values under a should have 'a' as a new column similarly for b and c
Create new column by where with condition by isnull, then use ffill for replace NaNs by forward filling.
Last remove NaNs rows by dropna and column by drop:
print (df['SrNo'].where(df['value'].isnull()))
0 a
1 NaN
2 NaN
3 NaN
4 b
5 NaN
6 NaN
7 NaN
8 c
9 NaN
10 NaN
Name: SrNo, dtype: object
df['new_col'] = df['SrNo'].where(df['value'].isnull()).ffill()
df = df.dropna().drop('SrNo', 1)
print (df)
value new_col
1 100.0 a
2 200.0 a
3 300.0 a
5 500.0 b
6 600.0 b
7 700.0 b
9 900.0 c
10 1000.0 c
Here's one way
In [2160]: df.assign(
new_col=df.SrNo.str.extract('(\D+)', expand=True).ffill()
).dropna().drop('SrNo', 1)
Out[2160]:
value new_col
1 100.0 a
2 200.0 a
3 300.0 a
5 500.0 b
6 600.0 b
7 700.0 b
9 900.0 c
10 1000.0 c
Another way with replace numbers with nan and ffill()
df['col'] = df['SrNo'].replace('([0-9]+)',np.nan,regex=True).ffill()
df = df.dropna(subset=['value']).drop('SrNo',1)
Output:
value col
1 100.0 a
2 200.0 a
3 300.0 a
5 500.0 b
6 600.0 b
7 700.0 b
9 900.0 c
10 1000.0 c
I have a pandas DataFrame, and would like to get column wise mean()
as below.
A B C D
1 10 100 1000 10000
2 20 200 2000 20000
3 30 300 3000 30000
4 40 400 4000 40000
5 50 500 5000 50000
Answer:
A B C D
30 300 3000 30000
Please suggest a way to do it.
I have tried df.mean() and other variations of it.
Add to_frame with T:
print (df.mean().to_frame().T)
A B C D
0 30.0 300.0 3000.0 30000.0
Or:
print (pd.DataFrame(df.mean().values.reshape(1,-1), columns=df.columns))
A B C D
0 30.0 300.0 3000.0 30000.0
Or:
print (pd.DataFrame(np.mean(df.values, axis=0).reshape(1,-1), columns=df.columns))
A B C D
0 30.0 300.0 3000.0 30000.0
Hi I have a huge dataframe with the following structure:
ticker calendar-date last-update Assets Ebitda .....
0 a 2001-06-30 2001-09-14 110 1000 .....
1 a 2001-09-30 2002-01-22 0 -8 .....
2 a 2001-09-30 2002-02-01 0 800 .....
3 a 2001-12-30 2002-03-06 120 0 .....
4 b 2001-06-30 2001-09-18 110 0 .....
5 b 2001-06-30 2001-09-27 110 30 .....
6 b 2001-09-30 2002-01-08 140 35 .....
7 b 2001-12-30 2002-03-08 120 40 .....
..
What I want is for each ticker: create new columns with % change in Assets and Ebitda from last calendar-date (t-1) and last calendar-date(t-2) for each row.
But here comes the problems:
1) As you can see calendar-date (by ticker) are not always uniques values since there can be more last-update for the same calendar-date but I always want the change since last calendar-date and not from last last-update.
2)there are rows with 0 values in that case I want to use the last observed value to calculate the %change. If I only had one stock that would be easy, I would just ffill the values, but since I have many tickers I cannot perform this operation safely since I could pad the value from ticker 'a' to ticker 'b' and that is not what I want
I guess this could be solved creating a function with if statements to handle data exceptions or maybe there is a good way to handle this inside pandas... maybe multi indexing?? the truth is that I have no idea on how to approach this task, anybody can help?
Thanks
Step 1
sort_values to ensure proper ordering for later manipulation
icols = ['ticker', 'calendar-date', 'last-update']
df.sort_values(icols, inplace=True)
Step 2
groupby 'ticker' and replace zeros and forward fill
vcols = ['Assets', 'Ebitda']
temp = df.groupby('ticker')[vcols].apply(lambda x: x.replace(0, np.nan).ffill())
d1 = df.assign(**temp.to_dict('list'))
d1
ticker calendar-date last-update Assets Ebitda
0 a 2001-06-30 2001-09-14 110.0 1000.0
1 a 2001-09-30 2002-01-22 110.0 -8.0
2 a 2001-09-30 2002-02-01 110.0 800.0
3 a 2001-12-30 2002-03-06 120.0 800.0
4 b 2001-06-30 2001-09-18 110.0 NaN
5 b 2001-06-30 2001-09-27 110.0 30.0
6 b 2001-09-30 2002-01-08 140.0 35.0
7 b 2001-12-30 2002-03-08 120.0 40.0
NOTE: The first 'Ebitda' for 'b' is NaN because there was nothing to forward fill from.
Step 3
groupby ['ticker', 'calendar-date'] and grab the last column. Because we sorted above, the last row will be the most recently updated row.
d2 = d1.groupby(icols[:2])[vcols].last()
Step 4
groupby again, this time just by 'ticker' which is in the index of d2, and take the pct_change
d3 = d2.groupby(level='ticker').pct_change()
Step 5
join back with df
df.join(d3, on=icols[:2], rsuffix='_pct')
ticker calendar-date last-update Assets Ebitda Assets_pct Ebitda_pct
0 a 2001-06-30 2001-09-14 110 1000 NaN NaN
1 a 2001-09-30 2002-01-22 0 -8 0.000000 -0.200000
2 a 2001-09-30 2002-02-01 0 800 0.000000 -0.200000
3 a 2001-12-30 2002-03-06 120 0 0.090909 0.000000
4 b 2001-06-30 2001-09-18 110 0 NaN NaN
5 b 2001-06-30 2001-09-27 110 30 NaN NaN
6 b 2001-09-30 2002-01-08 140 35 0.272727 0.166667
7 b 2001-12-30 2002-03-08 120 40 -0.142857 0.142857