how to merge two dataframe with different times and sizes - python

I am trying to merge these two dataframe together and preserve all the rows and columns. They have different times under the column 'time', so i want them to merge in a way that is time sequential.
df1:
time
run_id
weight
0
H1
500
24
H1
400
48
H1
300
0
H2
900
24
H2
800
48
H2
700
df2:
time
run_id
totalizer
0.5
H1
100
10
H1
200
40
H1
300
60
H1
400
0.5
H2
900
5
H2
1000
35
H2
1100
70
H2
1200
How do I merge these two tables into:
time
run_id
weight
totalizer
0
H1
500
0.5
H1
100
10
H1
200
24
H1
400
40
H1
300
48
H1
300
60
H1
400
0
H2
900
0.5
H2
900
5
H2
1000
24
H2
800
35
H2
1100
48
H2
700
70
H2
1200
I tried:
mergedf = df1.merge(df2, how='outer')
but it stacked df1 on top of df2.

One option is to use combine_first :
cols = ["run_id", "time"]
​
out = (
df1.set_index(cols)
.combine_first(df2.set_index(cols))
.reset_index().sort_values(by=cols)
[["time", "run_id", "weight", "totalizer"]]
)
​Output :
print(out)
time run_id weight totalizer
0 0.0 H1 500.0 NaN
1 0.5 H1 NaN 100.0
2 10.0 H1 NaN 200.0
3 24.0 H1 400.0 NaN
4 40.0 H1 NaN 300.0
5 48.0 H1 300.0 NaN
6 60.0 H1 NaN 400.0
7 0.0 H2 900.0 NaN
8 0.5 H2 NaN 900.0
9 5.0 H2 NaN 1000.0
10 24.0 H2 800.0 NaN
11 35.0 H2 NaN 1100.0
12 48.0 H2 700.0 NaN
13 70.0 H2 NaN 1200.0

You could simply add line after what you have already done:
mergedf = df1.merge(df2, how='outer') # your current code
mergedf.sort_values(['run_id', 'time']) # add this
Read more here: https://stackoverflow.com/a/17141755/2650341

You can use panda's merge_ordered
df_merged=pd.merge_ordered(df1,df2, on=['run_id','time'])

Related

Merge Columns with the Same name in the same dataframe if null

I have a dataframe that looks like this
Depth DT DT DT GR GR GR
1 100 NaN 45 NaN 100 50 NaN
2 200 NaN 45 NaN 100 50 NaN
3 300 NaN 45 NaN 100 50 NaN
4 400 NaN Nan 50 100 50 NaN
5 500 NaN Nan 50 100 50 NaN
I need to merge the same name columns into one if there are null values and keep the first occurrence of the column if other columns are not null.
In the end the data frame should look like
Depth DT GR
1 100 45 100
2 200 45 100
3 300 45 100
4 400 50 100
5 500 50 100
I am beginner in pandas. I tried but wasn't successful. I tried drop duplicate but it couldn't do the what I wanted. Any suggestions?
IIUC, you can do:
(df.set_index('Depth')
.groupby(level=0, axis=1).first()
.reset_index())
output:
Depth DT GR
0 100 45.0 100.0
1 200 45.0 100.0
2 300 45.0 100.0
3 400 50.0 100.0
4 500 50.0 100.0

How to group by and make sum from different sources in a pandas dataframe?

I have a dataframe df that contains the number of transactions between companies.
df Receiver Payer Amount
0 0045 xx04 300
1 5400 zz03 600
2 5400 0045 100
3 xx04 5400 400
For each companies I would like count in and out and distinguish it between companies with only numbers and companies with non-numeric values. I would like to return something like:
df1 ID In_0 In_1 Out_0 Out_1
0 0045 0 300 100 0
1 5400 100 600 0 400
2 zz03 0 0 600 0
3 xx04 400 0 300 0
For now I just tried a simple groupby. For the total amount between each couple of companies, for instance such as
df.groupby(['Receiver', 'Payer'], as_index = False)['Amount'].sum()
I think you need to a a little logic and reshaping your dataframe.
df_out = df.rename(columns={'Receiver':'IN','Payer':'OUT'})
df_out['IN_TYPE'] = df_out['OUT'].str.contains(r'\D').astype(int).astype(str)
df_out['OUT_TYPE'] = df_out['IN'].str.contains(r'\D').astype(int).astype(str)
df_out = df_out.melt(['df','Amount','IN_TYPE','OUT_TYPE'], value_name='ID')
df_out['Cols'] = df_out['variable']+'_'+np.where(df_out['variable']=='IN',df_out['IN_TYPE'],df_out['OUT_TYPE'])
df_out = df_out.groupby(['ID','Cols'])['Amount'].sum().unstack().fillna(0).reset_index()
print(df_out)
Output:
Cols ID IN_0 IN_1 OUT_0 OUT_1
0 0045 0.0 300.0 100.0 0.0
1 5400 100.0 600.0 0.0 400.0
2 xx04 400.0 0.0 300.0 0.0
3 zz03 0.0 0.0 600.0 0.0

Map value from one row as a new column in pandas

I have a pandas dataframe:
SrNo value
a nan
1 100
2 200
3 300
b nan
1 500
2 600
3 700
c nan
1 900
2 1000
i want my final dataframe as:
value new_col
100 a
200 a
300 a
500 b
600 b
700 b
900 c
1000 c
i.e for sr.no 'a' the values under a should have 'a' as a new column similarly for b and c
Create new column by where with condition by isnull, then use ffill for replace NaNs by forward filling.
Last remove NaNs rows by dropna and column by drop:
print (df['SrNo'].where(df['value'].isnull()))
0 a
1 NaN
2 NaN
3 NaN
4 b
5 NaN
6 NaN
7 NaN
8 c
9 NaN
10 NaN
Name: SrNo, dtype: object
df['new_col'] = df['SrNo'].where(df['value'].isnull()).ffill()
df = df.dropna().drop('SrNo', 1)
print (df)
value new_col
1 100.0 a
2 200.0 a
3 300.0 a
5 500.0 b
6 600.0 b
7 700.0 b
9 900.0 c
10 1000.0 c
Here's one way
In [2160]: df.assign(
new_col=df.SrNo.str.extract('(\D+)', expand=True).ffill()
).dropna().drop('SrNo', 1)
Out[2160]:
value new_col
1 100.0 a
2 200.0 a
3 300.0 a
5 500.0 b
6 600.0 b
7 700.0 b
9 900.0 c
10 1000.0 c
Another way with replace numbers with nan and ffill()
df['col'] = df['SrNo'].replace('([0-9]+)',np.nan,regex=True).ffill()
df = df.dropna(subset=['value']).drop('SrNo',1)
Output:
value col
1 100.0 a
2 200.0 a
3 300.0 a
5 500.0 b
6 600.0 b
7 700.0 b
9 900.0 c
10 1000.0 c

Pandas multi column mean

I have a pandas DataFrame, and would like to get column wise mean()
as below.
A B C D
1 10 100 1000 10000
2 20 200 2000 20000
3 30 300 3000 30000
4 40 400 4000 40000
5 50 500 5000 50000
Answer:
A B C D
30 300 3000 30000
Please suggest a way to do it.
I have tried df.mean() and other variations of it.
Add to_frame with T:
print (df.mean().to_frame().T)
A B C D
0 30.0 300.0 3000.0 30000.0
Or:
print (pd.DataFrame(df.mean().values.reshape(1,-1), columns=df.columns))
A B C D
0 30.0 300.0 3000.0 30000.0
Or:
print (pd.DataFrame(np.mean(df.values, axis=0).reshape(1,-1), columns=df.columns))
A B C D
0 30.0 300.0 3000.0 30000.0

Pandas dataframe applying logic to columns calculations

Hi I have a huge dataframe with the following structure:
ticker calendar-date last-update Assets Ebitda .....
0 a 2001-06-30 2001-09-14 110 1000 .....
1 a 2001-09-30 2002-01-22 0 -8 .....
2 a 2001-09-30 2002-02-01 0 800 .....
3 a 2001-12-30 2002-03-06 120 0 .....
4 b 2001-06-30 2001-09-18 110 0 .....
5 b 2001-06-30 2001-09-27 110 30 .....
6 b 2001-09-30 2002-01-08 140 35 .....
7 b 2001-12-30 2002-03-08 120 40 .....
..
What I want is for each ticker: create new columns with % change in Assets and Ebitda from last calendar-date (t-1) and last calendar-date(t-2) for each row.
But here comes the problems:
1) As you can see calendar-date (by ticker) are not always uniques values since there can be more last-update for the same calendar-date but I always want the change since last calendar-date and not from last last-update.
2)there are rows with 0 values in that case I want to use the last observed value to calculate the %change. If I only had one stock that would be easy, I would just ffill the values, but since I have many tickers I cannot perform this operation safely since I could pad the value from ticker 'a' to ticker 'b' and that is not what I want
I guess this could be solved creating a function with if statements to handle data exceptions or maybe there is a good way to handle this inside pandas... maybe multi indexing?? the truth is that I have no idea on how to approach this task, anybody can help?
Thanks
Step 1
sort_values to ensure proper ordering for later manipulation
icols = ['ticker', 'calendar-date', 'last-update']
df.sort_values(icols, inplace=True)
Step 2
groupby 'ticker' and replace zeros and forward fill
vcols = ['Assets', 'Ebitda']
temp = df.groupby('ticker')[vcols].apply(lambda x: x.replace(0, np.nan).ffill())
d1 = df.assign(**temp.to_dict('list'))
d1
ticker calendar-date last-update Assets Ebitda
0 a 2001-06-30 2001-09-14 110.0 1000.0
1 a 2001-09-30 2002-01-22 110.0 -8.0
2 a 2001-09-30 2002-02-01 110.0 800.0
3 a 2001-12-30 2002-03-06 120.0 800.0
4 b 2001-06-30 2001-09-18 110.0 NaN
5 b 2001-06-30 2001-09-27 110.0 30.0
6 b 2001-09-30 2002-01-08 140.0 35.0
7 b 2001-12-30 2002-03-08 120.0 40.0
NOTE: The first 'Ebitda' for 'b' is NaN because there was nothing to forward fill from.
Step 3
groupby ['ticker', 'calendar-date'] and grab the last column. Because we sorted above, the last row will be the most recently updated row.
d2 = d1.groupby(icols[:2])[vcols].last()
Step 4
groupby again, this time just by 'ticker' which is in the index of d2, and take the pct_change
d3 = d2.groupby(level='ticker').pct_change()
Step 5
join back with df
df.join(d3, on=icols[:2], rsuffix='_pct')
ticker calendar-date last-update Assets Ebitda Assets_pct Ebitda_pct
0 a 2001-06-30 2001-09-14 110 1000 NaN NaN
1 a 2001-09-30 2002-01-22 0 -8 0.000000 -0.200000
2 a 2001-09-30 2002-02-01 0 800 0.000000 -0.200000
3 a 2001-12-30 2002-03-06 120 0 0.090909 0.000000
4 b 2001-06-30 2001-09-18 110 0 NaN NaN
5 b 2001-06-30 2001-09-27 110 30 NaN NaN
6 b 2001-09-30 2002-01-08 140 35 0.272727 0.166667
7 b 2001-12-30 2002-03-08 120 40 -0.142857 0.142857

Categories

Resources