I have the following two datasets:
df_ff.head()
Out[382]:
Date Mkt-RF SMB HML RF
0 192607 2.96 -2.38 -2.73 0.22
1 192608 2.64 -1.47 4.14 0.25
2 192609 0.36 -1.39 0.12 0.23
3 192610 -3.24 -0.13 0.65 0.32
4 192611 2.53 -0.16 -0.38 0.31
df_ibm.head()
Out[384]:
Date Open High ... Close Adj_Close Volume
0 2012-01-01 178.518158 184.608032 ... 184.130020 128.620193 115075689
1 2012-02-01 184.713196 190.468445 ... 188.078400 131.378296 82435156
2 2012-03-01 188.556412 199.923523 ... 199.474182 139.881134 92149356
3 2012-04-01 199.770554 201.424469 ... 197.973236 138.828659 90586736
4 2012-05-01 198.068832 199.741867 ... 184.416824 129.322250 89961544
Regarding the type of the date variable, we have the following:
df_ff.dtypes
Out[383]:
Date int64
df_ibm.dtypes
Out[385]:
Date datetime64[ns]
I would like to merge (in SQL language: "Inner join") these two data sets and are therefore writing:
testMerge = pd.merge(df_ibm, df_ff, on = 'Date')
This yields the error:
ValueError: You are trying to merge on datetime64[ns] and int64 columns. If you wish to proceed you should use pd.concat
This merge does not work due to different formats on the date variable. Any tips on how I could solve this? My first thought was to translate dates (in the df_ff data set) of the format:
"192607" to the format "1926-07-01" but I did not manage to do it.
Use pd.to_datetime:
df['Date2'] = pd.to_datetime(df['Date'].astype(str), format="%Y%m")
print(df)
# Output
Date Date2
0 192607 1926-07-01
1 192608 1926-08-01
2 192609 1926-09-01
3 192610 1926-10-01
4 192611 1926-11-01
The first step is to convert to datetime64[ns] and harmonize the Date column:
df_ff['Date'] = pd.to_datetime(df_ff['Date'].astype(str), format='%Y%m')
Then convert them into Indexes (since it's more efficient):
df_ff = df_ff.set_index('Date')
df_ibm = df_ibm.set_index('Date')
Finally pd.merge the two pd.DataFrame:
out = pd.merge(df_ff, df_ibm, left_index=True, right_index=True)
Related
This question already has answers here:
pandas or python equivalent of tidyr complete
(4 answers)
Closed 1 year ago.
I have a problem with looking for gaps in my pandas dataframe.
So I have this data in my df:
date
name
score
2020-01-01
FEAT_1
0.64
2020-01-01
FEAT_2
0.17
2020-01-01
FEAT_3
0.09
2020-01-01
FEAT_4
0.07
2020-01-01
FEAT_5
0.03
2020-01-02
FEAT_1
0.90
2020-01-02
FEAT_2
0.30
2020-01-02
FEAT_3
0.20
2020-01-02
FEAT_4
0.10
2020-01-02
FEAT_6
0.02
And what I want to do is to fill rows for dates that miss some FEAT_N. So, for example 2020-01-01 has feat1,2,3,4,5 but doesn't have feat_6 and I would like to put there new row with same date, feat_6 as name and score = 0. FEAT_N are .unique() values of column feature_name
I don't know how to deal with this problem. My main problem is that I don't know how to find if certain date in date column has FEAT_N value.
Thanks guys.
You can easily do this with by using a MultiIndex in pandas.
Set the index to a MultiIndex of "date" and "name"
Create a new MultiIndex that is the cartesian product of the current MultiIndex
reindex your data filling with 0
df = df.set_index(["date", "name"])
new_index = pd.MultiIndex.from_product(df.index.levels)
new_df = df.reindex(new_index, fill_value=0)
print(new_df)
score
date name
2020-01-01 FEAT_1 0.64
FEAT_2 0.17
FEAT_3 0.09
FEAT_4 0.07
FEAT_5 0.03
FEAT_6 0.00
2020-01-02 FEAT_1 0.90
FEAT_2 0.30
FEAT_3 0.20
FEAT_4 0.10
FEAT_5 0.00
FEAT_6 0.02
Below is the exact same approach, but intended to please the method chaining addicts:
new_df = (
df.set_index(["date", "name"])
.pipe(lambda df:
df.reindex(
pd.MultiIndex.from_product(df.index.levels),
fill_value=0
)
)
)
print(new_df)
score
date name
2020-01-01 FEAT_1 0.64
FEAT_2 0.17
FEAT_3 0.09
FEAT_4 0.07
FEAT_5 0.03
FEAT_6 0.00
2020-01-02 FEAT_1 0.90
FEAT_2 0.30
FEAT_3 0.20
FEAT_4 0.10
FEAT_5 0.00
FEAT_6 0.02
One possible approach is to create a "grid" dataframe containing all the column pairs you want, outer-join it with your data frame, then set the missing values to your default (0):
import pandas as pd
keys1 = df['date'].unique()
keys2 = df['name'].unique()
# get all possible date-name pairs
default = [[k1, k2] for k1 in keys1 for k2 in keys2]
df_default = pd.DataFrame(default, columns=['date', 'name'])
# outer merge
df_complete = pd.merge(df, df_default, on=['date', 'name'], how='outer')
# fill in the missing value with a reasonable default
df_complete['score'] = df_complete['score'].fillna(0)
If these values are required, I'd suggest maybe turning them into columns, e.g. date | FEAT_1_score | FEAT_2_score, etc.
Then, you can fill those columns with either the scores in your current data, or the 0 value.
Here's a rough approximation of the code:
import pandas as pd
data = {'date':['2020-01-01','2020-01-01'], 'name':["FEAT_1", "FEAT_3"], 'score':['0.5', '0.6']}
df = pd.DataFrame(data)
def index_value_or_zero(x, df, feat):
index_value = df.loc[(df["date"] == x.date) & (df["name"] == feat)]["score"]
if len(index_value) == 0:
return 0
else:
return index_value
df["FEAT_1_score"] = df.apply(lambda x: index_value_or_zero(x, df, "FEAT_1"), axis=1)
df["FEAT_2_score"] = df.apply(lambda x: index_value_or_zero(x, df, "FEAT_2"), axis=1)
df["FEAT_3_score"] = df.apply(lambda x: index_value_or_zero(x, df, "FEAT_3"), axis=1)
df = df.drop("score", axis=1).drop("name", axis=1)
df.drop_duplicates()
df
returns:
date FEAT_1_score FEAT_2_score FEAT_3_score
0 2020-01-01 0.5 0 0.6
I want to filter a df I have copied a small snippet of the output here including the desired result I basically want to filter by an index removing columns which do not conform to my search. I would like to basically only keep columns which 5 day change is >= 10%.
I tried df = df.loc["5 Day Change" >= .1] but it didnt work and I am not sure how to make it work.
1 2 3 4
15/10/2020 23.53 15.06 396.700012 348.380005
16/10/2020 23.35 15.09 398.519989 348.399994
5 Day Change-0.049654 0.12 0.009 0.256
10 Day Change-0.014768 0.01 0.11 0.03
Return
2 4
15/10/2020 15.06 348.380005
16/10/2020 15.09 348.399994
5 Day Change 0.12 0.256
10 Day Change 0.01 0.03
I would use .T to transpose the dataframe and then filter by row rather than column as it is easier to filter by rows in pandas. Then, transpose it back:
In[1]:
1 2 3 4
15/10/2020 23.530000 15.06 396.700012 348.380005
16/10/2020 23.350000 15.09 398.519989 348.399994
5 Day Change -0.049654 0.12 0.009000 0.256000
10 Day Change -0.014768 0.01 0.110000 0.030000
df = df.T
df = df[df['5 Day Change'] >= .1].T
df
Out[1]:
2 4
15/10/2020 15.06 348.380005
16/10/2020 15.09 348.399994
5 Day Change 0.12 0.256000
10 Day Change 0.01 0.030000
I have two dataframes. One has the workdays and the stockprice for the Apple-stock. The other one, holds quarterly data on the EPS. However, the list of dates differ, but are in cronological order. I want add the cronological values of the eps frame to the existing price dataframe.
date close
0 2020-07-06 373.85
1 2020-07-02 364.11
2 2020-07-01 364.11
3 2020-06-30 364.80
4 2020-06-29 361.78
... ... ...
9969 1980-12-18 0.48
9970 1980-12-17 0.46
9971 1980-12-16 0.45
9972 1980-12-15 0.49
9973 1980-12-12 0.51
EPS:
date eps
0 2020-03-28 2.59
1 2019-12-28 5.04
2 2019-09-28 3.05
3 2019-06-29 2.22
4 2019-03-30 2.48
... ... ...
71 2002-06-29 0.09
72 2002-03-30 0.11
73 2001-12-29 0.11
74 2001-09-29 -0.11
75 2001-06-30 0.17
So my result should look something like this:
close eps
date
...
2020-04-01 240.91 NaN
2020-03-31 254.29 NaN
2020-03-30 254.81 NaN
2020-03-28 NaN 2.59
2020-03-27 247.74 NaN
2020-03-26 258.44 NaN
Notice the value "2020-03-28", which previously only existed in the eps frame, and sits now neatly were it belongs.
However, I can't get it to work. First i thought there must be a simple join, merge or concat that has this function and fits the data right were it belongs, in cronological order, but so far, I couldn't find it.
My failed attempts:
pd.concat([df, eps], axis=0, sort=True) - does simply append the two dataframes
pd.merge_ordered(df, eps, fill_method="ffill", left_by="date") - Simply ignores the eps dates
The goal is to plot this Dataframe with two graphs - One the stock price, and the other one with the eps data.
I think you need:
pd.concat([df.set_index('date'), eps.set_index('date')]).sort_index(ascending=False)
You can simply sort the concatenated dataframe afterwards by index. Thanks to #jezrael for the tip!
pd.concat([df.set_index('date'), eps.set_index('date')]).sort_index(ascending=False)
For some reason, the sort argument in the concat function doesn't sort my dataframe.
df1
Date APA AR BP-GB CDEV ... WLL WPX XEC XOM CL00-USA
0 2018-01-01 42.22 19.00 5.227 19.80 ... 26.48 14.07 122.01 83.64 60.42
1 2018-01-02 44.30 19.78 5.175 20.00 ... 27.37 14.31 125.51 85.03 60.37
2 2018-01-03 45.33 19.78 5.242 20.33 ... 27.99 14.39 126.20 86.70 61.63
3 2018-01-04 46.84 19.80 5.300 20.37 ... 28.11 14.44 128.66 86.82 62.01
4 2018-01-05 46.39 19.44 5.296 20.12 ... 27.79 14.24 127.82 86.75 61.44
df2
Date Ticker Event_Type Event_Description Price add
0 2018-11-19 XEC M&A REN 88.03 1
1 2018-03-28 CXO M&A RSPP 143.25 1
2 2018-08-14 FANG M&A EGN 133.75 1
3 2019-05-09 OXY M&A APC 56.33 1
4 2019-08-26 PDCE M&A SRCI 29.65 1
My goal is to update df2.['add'] by using df2['Ticker'] and df2['Date'] to pull the value from df1 ... so for example the first row in df2 is XEC on 2018-11-19... the code needs to first look at df1[XEC] and then pull the value that matches the 2018-11-19 row in df[Date]
My attempt was:
df_Events['add'] = df_Prices.loc[[df_Prices['Date']==df_Events['Date']],[df_Prices.columns==df_Events['Ticker']]]
Try:
df2 = df2.merge(df1.melt(value_vars=df1.columns.tolist()[1:], id_vars='date', value_name="add", var_name='Ticker').reset_index(), how='left')`
This should change df1 Tickers columns to a single column, and than merge the values in that column to df2.
One more approach may be as below (I had started looking at it, so I am putting here even though you have accepted the answer)
First convert dates into datetime object in both dataframes & set it as index ony in the first one (code below)
df1['Date']=pd.to_datetime(df1['Date'])
df1.set_index('Date',inplace=True)
df2['Date']=pd.to_datetime(df2['Date'])
Then use apply to get the values for each of the columns.
df2['add']=df2.apply(lambda x: df1.loc[(x['Date']),(x['Ticker'])], axis=1)
This will work only if dates & values for all tickers exist in both dataframes (hence will throw as 'KeyError'
Below are the below data frames i have esh -> earnings surprise history
and sph-> stock price history.
earnings surprise history
ticker reported_date reported_time_code eps_actual
0 ABC 2017-10-05 AMC 1.01
1 ABC 2017-07-04 BMO 0.91
2 ABC 2017-03-03 BMO 1.08
3 ABC 2016-10-02 AMC 0.5
stock price history
ticker date adj_open ad_close
0 ABC 2017-10-06 12.10 13.11
1 ABC 2017-12-05 11.11 11.87
2 ABC 2017-12-04 12.08 11.40
3 ABC 2017-12-03 12.01 13.03
..
101 ABC 2017-07-04 9.01 9.59
102 ABC 2017-07-03 7.89 8.19
I like to build a new dataframe by merging two datasets which shall have the following columns as shown below and also if the reported_time_code from the earnings surprise history is AMC then the record to be referred from stock price history should be the next day.if the reported_time_code is BM0 then record to be referred from stock price history should be the same day. if i used straight merge function on the actual_reported column of esh and data column of sph it will break the above conditions. looking for efficient way of transforming the data
Here is the resultant transformed data set
ticker date adj_open ad_close eps_actual
0 ABC 2017-10-06 12.10 13.11 1.01
101 ABC 2017-07-04 9.01 9.59 0.91
Let's add a new column, 'date', to stock price history dataframe based on reported_time_code using np.where and drop unwanted columns then merge to earning history dataframe:
eh['reported_date'] = pd.to_datetime(eh.reported_date)
sph['date'] = pd.to_datetime(sph.date)
eh_new = eh.assign(date=np.where(eh.reported_time_code == 'AMC',
eh.reported_date + pd.DateOffset(days=1),
eh.reported_date)).drop(['reported_date','reported_time_code'],axis=1)
sph.merge(eh_new, on=['ticker','date'])
Output:
ticker date adj_open ad_close eps_actual
0 ABC 2017-10-06 12.10 13.11 1.01
1 ABC 2017-07-04 9.01 9.59 0.91
It is it is great that your offset is only one day. Then you can do something like the following:
mask = esh['reported_time_code'] == 'AMC'
# The mask is basically an array of 0 and 1's \
all we have to do is to convert them into timedelta objects standing for \
the number of days to offset
offset = mask.values.astype('timedelta64[D]')
# The D inside the bracket stands for the unit of time to which \
you want to attach your number. In this case, we want [D]ays.
esh['date'] = esh['reported_date'] + offset
esh.merge(sph, on=['ticker', 'date']).drop(['reported_date', 'reported_time_code'], \
axis=1, inplace=True)