Subtract between two pandas data frame with different length - python

I need to do a subtract between two distinct data frames.
I had try the follow code:
df_sw['Apropriacao_total'] = df_sw_ant.merge(df_sw, how='left', right_on=['Data posicao', 'Ativo', 'Data vencimento'],
left_on=['Data posicao', 'Ativo', 'Data vencimento'])
['Apropriacao_conjunta'].sub(['Apropriacao'], axis=1)
Below, the sample data frames sample:
df_sw Cols: 6 rows: 62
Data_posicao Ativo Data_vencimento Apropriacao Apropriacao_conjunta
0 2017-07-03 RXU7 2017-09-07 -631.17 -631.17
1 2017-07-04 RXU7 2017-09-07 -828.59 -828.59
...
22 2017-07-05 GCQ7 2017-07-31 1820.06 1820.06
...
53 2017-07-18 CNHBRL 2017-09-28 1431.82 1431.82
df_sw_ant Cols: 6 rows: 32
Data_swap Data_posicao Ativo Data_vencimento Apropriacao_swap
0 2017-07-03 2017-06-30 RXU7 2017-09-07 -333.66
1 2017-07-04 2017-07-03 RXU7 2017-09-07 -631.17
...
22 2017-07-05 2017-07-04 GCQ7 2017-07-31 720.06
...
29 2017-07-20 2017-07-19 CNHBRL 2017-09-28 -157.30
Question:
How to perform a subtraction (df_sw['Apropriacao_conjunta'] - df_sw_ant['Apropriacao_swap']) where:
df_sw['Data_posicao'] = df_sw_ant['Data_swap'] and df_sw['Ativo'] = df_sw_ant['Ativo'] and df_sw['Data_vencimento'] = df_sw_ant['Data_vencimento']
The subtraction will be done in the axis = 1

You can try following and see if it works for you:
# merge and save to new dataframe
df_merged = df_sw_ant.merge(df_sw, how='left', right_on=['Data posicao', 'Ativo', 'Data vencimento'],
left_on=['Data posicao', 'Ativo', 'Data vencimento'])
# save subtracted result to a new column
df_merged['Sub_Value'] = df_merged['Apropriacao_conjunta'] - df_merged['Apropriacao']
Then, Sub_Value column in df_merged will have the result from subtraction of two columns.

Related

Drop Rows in Pandas DataFrame When Items in Column Match Items in List

I have a pandas df with 5181 rows and with a column of customer names and I have a separate list of 383 customer names from within that column whose corresponding rows I want to drop from the df. I tried to write a piece of code that would iterate through all the names in the customer column and drop each of the rows with customer names matching those on the list. My result is TypeError: 'NoneType' object is not subscriptable.
The list is called Retail_Customer_Tracking and the df is called df_final and looks like:
index Customer First_Order_Date Last_Order_Date
0 0 0 2022-09-15 2022-09-15
1 1 287 2018-02-19 2020-11-30
2 2 606 2017-10-31 2017-12-07
3 3 724 2021-12-28 2022-09-15
4 4 1025 2015-08-13 2015-08-13
... ... ... ... ...
5176 5176 tulips little pop up shop 2021-10-25 2022-10-08
5177 5177 unboxed 2021-06-24 2022-10-10
5178 5178 upMADE 2021-09-10 2022-03-31
5179 5179 victorias floral design 2021-07-12 2021-07-12
5180 5180 vintique marketplace 2021-03-16 2022-10-15
5181 rows × 4 columns
The code i wrote looks like
i = 0
for x in Retail_Customer_Tracking:
while i < 5182:
if df_final["Customer"].iloc[i] == x:
df_final = df_final.drop(df_final[i], axis=0, inplace=True)
else:
i = i + 1
I was hoping that the revised df_final would not have the rows I wanted to drop...
i'm very new at coding and any help would be greatly appreciated. Thanks!

Creating dataframe from dict

first start by creating a list with some values:
list = ['SBSP3.SA', 'CSMG3.SA', 'CGAS5.SA']
I create an empty dictionary because that's the only way I found it to read several .csv files I want as a dataframe. And then I do a for loop to store my .csv files in the empty dictionary:
d = {}
d = {ticker: pd.read_csv('{}.csv'.format(ticker)) for ticker in list}
after that I can only call the dataframe by passing slices with the dictionary keys:
d['SBSP3.SA'].head(5)
Date High Low Open Close Volume Adj Close
0 2017-01-02 14.70 14.60 14.64 14.66 7525700.0 13.880955
1 2017-01-03 15.65 14.95 14.95 15.50 39947800.0 14.676315
2 2017-01-04 15.68 15.31 15.45 15.50 37071700.0 14.676315
3 2017-01-05 15.91 15.62 15.70 15.75 47586300.0 14.913031
4 2017-01-06 15.92 15.50 15.78 15.66 25592000.0 14.827814
I can't for example:
df = pd.DataFrame(d)
My question is:
Can I merge all these dataframes that I threw in dictionary (d) with axis = 1 to view it as one?
Breaking the head a lot here I managed to put all the dataframes together but I lost their key and I could not distinguish who is who, since the name of the columns is the same.
Can I name these keys in columns?
Example:
Date High_SBSP3.SA Low_SBSP3.SA Open_SBSP3.SA Close_SBSP3.SA Volume_SBSP3.SA Adj Close_SBSP3.SA
0 2017-01-02 14.70 14.60 14.64 14.66 7525700.0 13.880955
1 2017-01-03 15.65 14.95 14.95 15.50 39947800.0 14.676315
2 2017-01-04 15.68 15.31 15.45 15.50 37071700.0 14.676315
3 2017-01-05 15.91 15.62 15.70 15.75 47586300.0 14.913031
4 2017-01-06 15.92 15.50 15.78 15.66 25592000.0 14.827814
Don't use list as a variable name, it shadows the actual built-in list.
You don't need a dictionary, a simple list is enough to store all your dataframes.
Call pd.concat on this list - it should properly concatenate the dataframes one below the other, as long as they have the same column names.
ticker_list = ['SBSP3.SA', 'CSMG3.SA', 'CGAS5.SA']
pd_list = [pd.read_csv('{}.csv'.format(ticker)) for ticker in ticker_list]
df = pd.concat(pd_list)
Use df = pd.concat(pd_list, ignore_index=True) if you want to reset the indices when concatenating.
pd.merge will do what you want (including renaming columns) but since it only allows for merging two frames at a time the column names will not be consistent when repeating the merge. Thus you need to rename the columns manually before.
import pandas as pd
from functools import reduce
ticker_list = ['SBSP3.SA', 'CSMG3.SA', 'CGAS5.SA']
pd_list = [pd.read_csv('{}.csv'.format(ticker)) for ticker in ticker_list]
for idx, df in enumerate(pd_list):
old_names = df.columns[1:]
new_names = list(map(lambda x : x + '_' + ticker_list[idx] , old_names))
zipped = dict(zip(old_names, new_names))
df.rename(zipped, axis=1, inplace=True)
def dfmerge(x, y):
return pd.merge(x, y, on="date")
df = reduce(dfmerge, pd_list)
print(df)
Output (with my data):
date High_SBSP3.SA Low_SBSP3.SA Open_SBSP3.SA High_CSMG3.SA Low_CSMG3.SA Open_CSMG3.SA High_CGAS5.SA Low_CGAS5.SA Open_CGAS5.SA
0 2017-01-02 1 2 3 1 2 3 1 2 3
1 2017-01-03 4 5 6 4 5 6 4 5 6
2 2017-01-04 7 8 9 7 8 9 7 8 9
Hint: you may need to edit/delete your comment. Since I preferred to overwrite my previous answer instead of adding a new one.

Get data using row / col reference from two column values in another data frame

df1
Date APA AR BP-GB CDEV ... WLL WPX XEC XOM CL00-USA
0 2018-01-01 42.22 19.00 5.227 19.80 ... 26.48 14.07 122.01 83.64 60.42
1 2018-01-02 44.30 19.78 5.175 20.00 ... 27.37 14.31 125.51 85.03 60.37
2 2018-01-03 45.33 19.78 5.242 20.33 ... 27.99 14.39 126.20 86.70 61.63
3 2018-01-04 46.84 19.80 5.300 20.37 ... 28.11 14.44 128.66 86.82 62.01
4 2018-01-05 46.39 19.44 5.296 20.12 ... 27.79 14.24 127.82 86.75 61.44
df2
Date Ticker Event_Type Event_Description Price add
0 2018-11-19 XEC M&A REN 88.03 1
1 2018-03-28 CXO M&A RSPP 143.25 1
2 2018-08-14 FANG M&A EGN 133.75 1
3 2019-05-09 OXY M&A APC 56.33 1
4 2019-08-26 PDCE M&A SRCI 29.65 1
My goal is to update df2.['add'] by using df2['Ticker'] and df2['Date'] to pull the value from df1 ... so for example the first row in df2 is XEC on 2018-11-19... the code needs to first look at df1[XEC] and then pull the value that matches the 2018-11-19 row in df[Date]
My attempt was:
df_Events['add'] = df_Prices.loc[[df_Prices['Date']==df_Events['Date']],[df_Prices.columns==df_Events['Ticker']]]
Try:
df2 = df2.merge(df1.melt(value_vars=df1.columns.tolist()[1:], id_vars='date', value_name="add", var_name='Ticker').reset_index(), how='left')`
This should change df1 Tickers columns to a single column, and than merge the values in that column to df2.
One more approach may be as below (I had started looking at it, so I am putting here even though you have accepted the answer)
First convert dates into datetime object in both dataframes & set it as index ony in the first one (code below)
df1['Date']=pd.to_datetime(df1['Date'])
df1.set_index('Date',inplace=True)
df2['Date']=pd.to_datetime(df2['Date'])
Then use apply to get the values for each of the columns.
df2['add']=df2.apply(lambda x: df1.loc[(x['Date']),(x['Ticker'])], axis=1)
This will work only if dates & values for all tickers exist in both dataframes (hence will throw as 'KeyError'

pandas dataframe: perform calculations on columns

New to pandas and new to stackoverflow (really), any suggestions are highly appreciated!
I have this dataframe df:
col1 col2 col3
Date
2017-08-24 100 101 105
2017-08-23 102 102 107
2017-08-22 101 100 106
2017-08-21 103 99 106
2017-08-18 103 98 108
...
Now I'd like to perform some calculations with the values of each column, e.g. calculate the logarithm of each value.
I thought it's a good idea to loop over the columns and create a new temporary data frame with the resulting columns.
This new data frame should look like this e.g.:
col1 RN LOG
Date
2017-08-24 100 1 2
2017-08-23 102 2 2,008600
2017-08-22 101 3 2,004321
2017-08-21 103 4 2,012837
2017-08-18 103 5 2,012837
So I tried this for-loop:
for column in df:
tmp_df = df[column]
tmp_df['RN'] = range(1, len(tmp_df) + 1) # to create a new column with the row number
tmp_df['LOG'] = np.log(df[column]) # to create a new column with the LOG
However this doesn't print the new columns next to col1, but one below the other. The result looks like this:
Name: col1, Length: 86, dtype: object
Date
2017-08-24 00:00:00 100
2017-08-23 00:00:00 102
2017-08-22 00:00:00 101
2017-08-21 00:00:00 103
2017-08-18 00:00:00 103
RN,"range(1, 86)"
LOG,"Date
2017-08-24 2
2017-08-23 2,008600
2017-08-22 2,004321
2017-08-21 2,012837
2017-08-18 2,012837
00:00:00 was added to the date in the first part...
I also tried something with assign:
tmp_df = tmp_df.assign(LN=np.log(df[column]))
But this results in "AttributeError: "'Series' object has no attribute 'assign'""
It'd really be great if someone could point me in the right direction.
Thanks!
Your for loop is a good idea, but you need to create pandas Series in new columns this way:
for column in df:
df['RN ' + column] = pd.Series(range(1, len(df[column]) + 1))
df['Log ' + column] = pd.Series(np.log(df[column]))
Now I figured it out. :)
import pandas as pd
import numpy as np
...
for column in df:
tmp_res=pd.DataFrame(data=df[column])
newcol=range(1, len(df) + 1)
tmp_res=tmp_res.assign(RN=newcol)
newcol2=np.log(df[column])
tmp_res=tmp_res.assign(LN=newcol2)
This prints all columns next to each other:
col1 RN LOG
Date
2017-08-24 100 1 2
2017-08-23 102 2 2.008600
2017-08-22 101 3 2.004321
2017-08-21 103 4 2.012837
2017-08-18 103 5 2.012837
Now I can go on processing them or put it all in a csv / excel file.
Thanks for all your suggestions!

Efficient way of merging data frames on custom conditions

Below are the below data frames i have esh -> earnings surprise history
and sph-> stock price history.
earnings surprise history
ticker reported_date reported_time_code eps_actual
0 ABC 2017-10-05 AMC 1.01
1 ABC 2017-07-04 BMO 0.91
2 ABC 2017-03-03 BMO 1.08
3 ABC 2016-10-02 AMC 0.5
stock price history
ticker date adj_open ad_close
0 ABC 2017-10-06 12.10 13.11
1 ABC 2017-12-05 11.11 11.87
2 ABC 2017-12-04 12.08 11.40
3 ABC 2017-12-03 12.01 13.03
..
101 ABC 2017-07-04 9.01 9.59
102 ABC 2017-07-03 7.89 8.19
I like to build a new dataframe by merging two datasets which shall have the following columns as shown below and also if the reported_time_code from the earnings surprise history is AMC then the record to be referred from stock price history should be the next day.if the reported_time_code is BM0 then record to be referred from stock price history should be the same day. if i used straight merge function on the actual_reported column of esh and data column of sph it will break the above conditions. looking for efficient way of transforming the data
Here is the resultant transformed data set
ticker date adj_open ad_close eps_actual
0 ABC 2017-10-06 12.10 13.11 1.01
101 ABC 2017-07-04 9.01 9.59 0.91
Let's add a new column, 'date', to stock price history dataframe based on reported_time_code using np.where and drop unwanted columns then merge to earning history dataframe:
eh['reported_date'] = pd.to_datetime(eh.reported_date)
sph['date'] = pd.to_datetime(sph.date)
eh_new = eh.assign(date=np.where(eh.reported_time_code == 'AMC',
eh.reported_date + pd.DateOffset(days=1),
eh.reported_date)).drop(['reported_date','reported_time_code'],axis=1)
sph.merge(eh_new, on=['ticker','date'])
Output:
ticker date adj_open ad_close eps_actual
0 ABC 2017-10-06 12.10 13.11 1.01
1 ABC 2017-07-04 9.01 9.59 0.91
It is it is great that your offset is only one day. Then you can do something like the following:
mask = esh['reported_time_code'] == 'AMC'
# The mask is basically an array of 0 and 1's \
all we have to do is to convert them into timedelta objects standing for \
the number of days to offset
offset = mask.values.astype('timedelta64[D]')
# The D inside the bracket stands for the unit of time to which \
you want to attach your number. In this case, we want [D]ays.
esh['date'] = esh['reported_date'] + offset
esh.merge(sph, on=['ticker', 'date']).drop(['reported_date', 'reported_time_code'], \
axis=1, inplace=True)

Categories

Resources