Merge Columns with the Same name in the same dataframe if null - python

I have a dataframe that looks like this
Depth DT DT DT GR GR GR
1 100 NaN 45 NaN 100 50 NaN
2 200 NaN 45 NaN 100 50 NaN
3 300 NaN 45 NaN 100 50 NaN
4 400 NaN Nan 50 100 50 NaN
5 500 NaN Nan 50 100 50 NaN
I need to merge the same name columns into one if there are null values and keep the first occurrence of the column if other columns are not null.
In the end the data frame should look like
Depth DT GR
1 100 45 100
2 200 45 100
3 300 45 100
4 400 50 100
5 500 50 100
I am beginner in pandas. I tried but wasn't successful. I tried drop duplicate but it couldn't do the what I wanted. Any suggestions?

IIUC, you can do:
(df.set_index('Depth')
.groupby(level=0, axis=1).first()
.reset_index())
output:
Depth DT GR
0 100 45.0 100.0
1 200 45.0 100.0
2 300 45.0 100.0
3 400 50.0 100.0
4 500 50.0 100.0

Related

How to aggregate if a condition is met in a transposed dataset?

I am trying to figure out how to "aggregate" this transposed dataset. I am not sure if aggregate is the right word because the math is happening across rows. I have a dataframe that looks similar to such:
EDIT: There are multiple cases of the same value in "date." The data is transposed to the person ID. There are also Date1-5 columns as well. The date referenced in the below table is the one i ultimately home to aggregate by for the created NRev1-NRev# values.
Date
Return1
Return2
Return3
Return4
Return5
Rev1
Rev2
Rev3
Rev4
Rev5
2020-1
0
1
2
3
4
100
500
100
200
300
2020-2
5
6
7
8
nan
200
120
100
200
nan
2020-3
2
3
7
9
nan
100
0
100
200
nan
and am trying to create additional revenue columns based upon their values of return, while adding together the values from rev1-rev5.
The resulting columns would look as follows:
Date
NRev0
NRev1
NRev2
NRev3
NRev4
NRev5
NRev6
NRev7
NRev8
NRev9
2020-1
100
500
100
200
300
0
0
0
0
0
2020-2
0
0
0
0
0
200
120
100
200
0
2020-3
0
0
100
0
0
0
0
100
0
200
Essentially, what I'm looking to do is to create a new variable "NRev," concatenated based upon the row value of "return." So if return1 = 4, for instance, NRev4 would equal the value of Rev1. The values of returns will change over time, but the number of return columns to revenue columns will always match. So theoretically, if there were a maximum value of 100 across all "Return" columns, the corresponding "Revenue" column would create "NRev100", and be filled with the corresponding revenue value's index.
In SPSS, I am able to create the columns using this code, but is non pythonic, and the number of return and rev columns will increase over time, as well as return values:
if return1=0 NRev0= NRev0+Rev1.
if return1=1 NRev1= NRev1+Rev1.
if return1=2 NRev2= NRev2+Rev1.
if return1=3 NRev3= NRev3+Rev1.
if return1=4 NRev4= NRev4+Rev1.
if return2=0 NRev0= NRev0+Rev2.
if return2=1 NRev1= NRev1+Rev2.
if return2=2 NRev2= NRev2+Rev2.
if return2=3 NRev3= NRev3+Rev2.
if return2=4 NRev4= NRev4+Rev2.
if return3=0 NRev0= NRev0+Rev3.
if return3=1 NRev1= NRev1+Rev3.
if return3=2 NRev2= NRev2+Rev3.
if return3=3 NRev3= NRev3+Rev3.
if return3=4 NRev4= NRev4+Rev3.
We can do some reshaping with pd.wide_to_long then pivot_table back to wide format. This allows us to align Return and Rev lines then convert the Return values to the new columns. Some cleanup with add_prefix and rename_axis can be done to polish the output:
new_df = (
pd.wide_to_long(df, stubnames=['Return', 'Rev'], i='Date', j='K')
.dropna()
.astype({'Return': int})
.pivot_table(index='Date', columns='Return', values='Rev', fill_value=0)
.add_prefix('NRev')
.rename_axis(columns=None)
.reset_index()
)
new_df:
Date NRev0 NRev1 NRev2 NRev3 NRev4 NRev5 NRev6 NRev7 NRev8 NRev9
0 2020-1 100 500 100 200 300 0 0 0 0 0
1 2020-2 0 0 0 0 0 200 120 100 200 0
2 2020-3 0 0 100 0 0 0 0 100 0 200
wide_to_long gives:
Return Rev
Date K
2020-1 1 0.0 100.0 # Corresponding Return index and Rev are in the same row
2020-2 1 5.0 200.0
2020-3 1 2.0 100.0
2020-1 2 1.0 500.0
2020-2 2 6.0 120.0
2020-3 2 3.0 0.0
2020-1 3 2.0 100.0
2020-2 3 7.0 100.0
2020-3 3 7.0 100.0
2020-1 4 3.0 200.0
2020-2 4 8.0 200.0
2020-3 4 9.0 200.0
2020-1 5 4.0 300.0
2020-2 5 NaN NaN
2020-3 5 NaN NaN # These NaN are Not Needed
The Removing NaN step and returning Return to int
(pd.wide_to_long(df, stubnames=['Return', 'Rev'], i='Date', j='K')
.dropna()
.astype({'Return': int}))
Return Rev
Date K
2020-1 1 0 100.0
2020-2 1 5 200.0
2020-3 1 2 100.0
2020-1 2 1 500.0
2020-2 2 6 120.0
2020-3 2 3 0.0
2020-1 3 2 100.0
2020-2 3 7 100.0
2020-3 3 7 100.0
2020-1 4 3 200.0
2020-2 4 8 200.0
2020-3 4 9 200.0
2020-1 5 4 300.0
Then this can easily be moved back to wide with a pivot_table:
(pd.wide_to_long(df, stubnames=['Return', 'Rev'], i='Date', j='K')
.dropna()
.astype({'Return': int})
.pivot_table(index='Date', columns='Return', values='Rev', fill_value=0))
Return 0 1 2 3 4 5 6 7 8 9
Date
2020-1 100 500 100 200 300 0 0 0 0 0
2020-2 0 0 0 0 0 200 120 100 200 0
2020-3 0 0 100 0 0 0 0 100 0 200
The rest is just cosmetic changes to the DataFrame.
If dates are duplicated wide_to_long cannot be used, but we can manually reshape the DataFrame to wide with str.extract then set_index + stack:
# Set Index Column
new_df = df.set_index('Date')
# Handle MultiIndex Manually
new_df.columns = pd.MultiIndex.from_frame(
new_df.columns.str.extract('(.*)(\d+)$')
)
# Stack then the rest is the same
new_df = (
new_df.stack()
.dropna()
.astype({'Return': int})
.pivot_table(index='Date', columns='Return', values='Rev',
fill_value=0, aggfunc='first')
.add_prefix('NRev')
.rename_axis(columns=None)
.reset_index()
)
Sample DF with duplicate dates:
df = pd.DataFrame({'Date': ['2020-1', '2020-2', '2020-2'],
'Return1': [0, 5, 0],
'Return2': [1, 6, 1],
'Return3': [2, 7, 2],
'Return4': [3, 8, 3],
'Return5': [4.0, nan, 4.0],
'Rev1': [100, 200, 100],
'Rev2': [500, 120, 0],
'Rev3': [100, 100, 100],
'Rev4': [200, 200, 200],
'Rev5': [300.0, nan, nan]})
df
Date Return1 Return2 Return3 Return4 Return5 Rev1 Rev2 Rev3 Rev4 Rev5
0 2020-1 0 1 2 3 4.0 100 500 100 200 300.0
1 2020-2 5 6 7 8 NaN 200 120 100 200 NaN
2 2020-2 0 1 2 3 4.0 100 0 100 200 NaN
new_df
Date NRev0 NRev1 NRev2 NRev3 NRev4 NRev5 NRev6 NRev7 NRev8
0 2020-1 100 500 100 200 300 0 0 0 0
1 2020-2 100 0 100 200 0 200 120 100 200

Python loop for calculating sum of column values in pandas

I have below data frame:
a
100
200
200
b
20
30
40
c
400
50
Need help to calculate sum of values for each item and place it in 2nd column, which ideally should look like below:
a 500
100
200
200
b 90
20
30
40
c 450
400
50
If need sum by groups by column col converted to numeric use GroupBy.transform with repeated non numeric values by ffill:
s = pd.to_numeric(df['col'], errors='coerce')
mask = s.isna()
df.loc[mask, 'new'] = s.groupby(df['col'].where(mask).ffill()).transform('sum')
print (df)
col new
0 a 500.0
1 100 NaN
2 200 NaN
3 200 NaN
4 b 90.0
5 20 NaN
6 30 NaN
7 40 NaN
8 c 450.0
9 400 NaN
10 50 NaN
Or:
df['new'] = np.where(mask, new.astype(int), '')
print (df)
col new
0 a 500
1 100
2 200
3 200
4 b 90
5 20
6 30
7 40
8 c 450
9 400
10 50

How to add values from another DataFrame onto rows where a column matches?

I have two DataFrames:
a = pd.DataFrame()
a['id'] = range(0,100)
a['N'] = 100
b = pd.DataFrame()
b['id'] = 3*np.arange(0,100)
b['N'] = 50
What I want to do, is for rows in a where the 'id' matches the 'id' of a row in b to add b['N']. With a very inefficient and poorly-coded for-loop, that would be something like:
for idx in a[a.id.isin(b.id)].index:
a.loc[idx, 'N'] = a.loc[idx, 'N'] + b.loc[b.id == a.loc[idx, 'id'], 'N'].iloc[0]
Is there a way to do the above, but with efficient DataFrame operations? For example, a better way might be to take only the rows in a and b that have matching 'id', sort them both by ascending (so that they are the same exact ids in the same order), and then just add the 'N' column. This would require us to select the rows, sort them, add them, and finally concatenate back into the rows of a that didn't have matching 'id' in b, but also seems inefficient. What is the recommended way of doing this in pandas/numpy
Assuming "id" is unique, you can use Series.map and add the mapped values:
a['N'] = a['N'].add(a['id'].map(b.set_index('id')['N']), fill_value=0)
a
id N
0 0 150.0
1 1 100.0
2 2 100.0
3 3 150.0
4 4 100.0
.. .. ...
95 95 100.0
96 96 150.0
97 97 100.0
98 98 100.0
99 99 150.0
[100 rows x 2 columns]
IIUC you can simply do a merge and then sum:
a = a.merge(b,on="id",how="left")
a["result"] = a[["N_x","N_y"]].sum(1)
print (a)
id N_x N_y result
0 0 100 50.0 150.0
1 1 100 NaN 100.0
2 2 100 NaN 100.0
3 3 100 50.0 150.0
4 4 100 NaN 100.0
.. .. ... ... ...
95 95 100 NaN 100.0
96 96 100 50.0 150.0
97 97 100 NaN 100.0
98 98 100 NaN 100.0
99 99 100 50.0 150.0

Concat dataframes row wised and merge rows if exsist

I have two dataframes:
Df_1:
A B C D
1 10 nan 20 30
2 20 30 20 10
Df_2:
A B
1 10 40
2 30 70
I want to merge them and have this final dataframe.
A B C D
1 10 40 20 30
2 20 30 20 10
3 30 70 nan nan
How do I do that?
Looking at the expected result, I think, the index in the second row of Df_2
should be 3 (instead of 2).
Run Df_1.combine_first(Df_2).
The result is:
A B C D
1 10.0 40.0 20.0 30.0
2 20.0 30.0 20.0 10.0
3 30.0 70.0 NaN NaN
i.e. due to possible NaN values, the type of columns is coerced to float.
But if you want, you can revert this where possible, by applying to_numeric:
Df_1.combine_first(Df_2).apply(pd.to_numeric, downcast='integer')

Pandas dataframe applying logic to columns calculations

Hi I have a huge dataframe with the following structure:
ticker calendar-date last-update Assets Ebitda .....
0 a 2001-06-30 2001-09-14 110 1000 .....
1 a 2001-09-30 2002-01-22 0 -8 .....
2 a 2001-09-30 2002-02-01 0 800 .....
3 a 2001-12-30 2002-03-06 120 0 .....
4 b 2001-06-30 2001-09-18 110 0 .....
5 b 2001-06-30 2001-09-27 110 30 .....
6 b 2001-09-30 2002-01-08 140 35 .....
7 b 2001-12-30 2002-03-08 120 40 .....
..
What I want is for each ticker: create new columns with % change in Assets and Ebitda from last calendar-date (t-1) and last calendar-date(t-2) for each row.
But here comes the problems:
1) As you can see calendar-date (by ticker) are not always uniques values since there can be more last-update for the same calendar-date but I always want the change since last calendar-date and not from last last-update.
2)there are rows with 0 values in that case I want to use the last observed value to calculate the %change. If I only had one stock that would be easy, I would just ffill the values, but since I have many tickers I cannot perform this operation safely since I could pad the value from ticker 'a' to ticker 'b' and that is not what I want
I guess this could be solved creating a function with if statements to handle data exceptions or maybe there is a good way to handle this inside pandas... maybe multi indexing?? the truth is that I have no idea on how to approach this task, anybody can help?
Thanks
Step 1
sort_values to ensure proper ordering for later manipulation
icols = ['ticker', 'calendar-date', 'last-update']
df.sort_values(icols, inplace=True)
Step 2
groupby 'ticker' and replace zeros and forward fill
vcols = ['Assets', 'Ebitda']
temp = df.groupby('ticker')[vcols].apply(lambda x: x.replace(0, np.nan).ffill())
d1 = df.assign(**temp.to_dict('list'))
d1
ticker calendar-date last-update Assets Ebitda
0 a 2001-06-30 2001-09-14 110.0 1000.0
1 a 2001-09-30 2002-01-22 110.0 -8.0
2 a 2001-09-30 2002-02-01 110.0 800.0
3 a 2001-12-30 2002-03-06 120.0 800.0
4 b 2001-06-30 2001-09-18 110.0 NaN
5 b 2001-06-30 2001-09-27 110.0 30.0
6 b 2001-09-30 2002-01-08 140.0 35.0
7 b 2001-12-30 2002-03-08 120.0 40.0
NOTE: The first 'Ebitda' for 'b' is NaN because there was nothing to forward fill from.
Step 3
groupby ['ticker', 'calendar-date'] and grab the last column. Because we sorted above, the last row will be the most recently updated row.
d2 = d1.groupby(icols[:2])[vcols].last()
Step 4
groupby again, this time just by 'ticker' which is in the index of d2, and take the pct_change
d3 = d2.groupby(level='ticker').pct_change()
Step 5
join back with df
df.join(d3, on=icols[:2], rsuffix='_pct')
ticker calendar-date last-update Assets Ebitda Assets_pct Ebitda_pct
0 a 2001-06-30 2001-09-14 110 1000 NaN NaN
1 a 2001-09-30 2002-01-22 0 -8 0.000000 -0.200000
2 a 2001-09-30 2002-02-01 0 800 0.000000 -0.200000
3 a 2001-12-30 2002-03-06 120 0 0.090909 0.000000
4 b 2001-06-30 2001-09-18 110 0 NaN NaN
5 b 2001-06-30 2001-09-27 110 30 NaN NaN
6 b 2001-09-30 2002-01-08 140 35 0.272727 0.166667
7 b 2001-12-30 2002-03-08 120 40 -0.142857 0.142857

Categories

Resources