I have two dataframes loaded from CSV file:
time_df: consist of all the dates I want as shown below
0 2017-01-31
1 2017-01-26
2 2017-01-12
3 2017-01-09
4 2017-01-02
price_df: consist of other fields and many dates that i do not need
Date NYSEARCA:TPYP NYSEARCA:MENU NYSEARCA:SLYV NYSEARCA:CZA
0 2017-01-31 NaN 16.56 117.75 55.96
1 2017-01-26 NaN 16.68 116.89 55.84
2 2017-01-27 NaN 16.70 118.47 56.04
3 2017-01-12 NaN 16.81 119.14 56.13
5 2017-01-09 NaN 16.91 120.00 56.26
6 2017-01-08 NaN 16.91 120.00 56.26
7 2017-01-02 NaN 16.91 120.00 56.26
My aim is to delete the rows where dates in price_df does not equals to the dates in time_df
tried:
del price_df['Date'] if price_df['Date']!=time_df['Date']
but can't so I tried to print print(price_df['Date']!= time_df['Date'])
but it shows the next error: Can only compare identically-labeled Series objects
Sounds like a problem an inner join can fix:
time_df.merge(price_df, on='Date',copy=False)
Related
I have one dataframe_1 as
date
0 2020-01-01
1 2020-01-02
2 2020-01-03
3 2020-01-04
4 2020-01-05
and another dataframe_2 as
date source dest price
634647 2020-09-18 EUR USD 1.186317
634648 2020-09-19 EUR USD 1.183970
634649 2020-09-20 EUR USD 1.183970
I want to merge them on 'date' but the problem is dataframe_1 last date is '2021-02-15' and dataframe_2 last date is '2021-02-01'.
I want the resulting dataframe as
date source dest price
634647 2021-02-01 EUR USD 1.186317
634648 2021-02-02 NaN NaN NaN
634649 2021-02-03 Nan NaN NaN
...
date source dest price
634647 2021-02-13 NaN NaN NaN
634648 2021-02-14 NaN NaN NaN
634649 2021-02-25 NaN NaN NaN
But I am not able to do it using pd.merge, please ignore the indices in the dataframes.
Thanks a lot in advance.
you can use join to do it.
df1.set_index('date').join(df2.set_index('date'))
I am trying to remove the comma separator from values in a dataframe in Pandas to enable me to convert the to Integers. I have been using the following method:
df_orders['qty'] = df_orders['qty'].str.replace(',','')
However this seems to be returning NaN values for some numbers which did not originally contain ',' in their values. I have included a sample of my Input data and current output below:
Input:
date sku qty
556603 2020-10-25 A 6
590904 2020-10-21 A 5
595307 2020-10-20 A 31
602678 2020-10-19 A 11
615022 2020-10-18 A 2
641077 2020-10-16 A 1
650203 2020-10-15 A 3
655363 2020-10-14 A 18
667919 2020-10-13 A 5
674990 2020-10-12 A 2
703901 2020-10-09 A 1
715411 2020-10-08 A 1
721557 2020-10-07 A 31
740515 2020-10-06 A 49
752670 2020-10-05 A 4
808426 2020-09-28 A 2
848057 2020-09-23 A 1
865751 2020-09-21 A 2
886630 2020-09-18 A 3
901095 2020-09-16 A 47
938648 2020-09-10 A 2
969909 2020-09-07 A 3
1021548 2020-08-31 A 2
1032254 2020-08-30 A 8
1077443 2020-08-25 A 5
1089670 2020-08-24 A 24
1098843 2020-08-23 A 16
1102025 2020-08-22 A 23
1179347 2020-08-12 A 1
1305700 2020-07-29 A 1
1316343 2020-07-28 A 1
1399930 2020-07-19 A 1
1451864 2020-07-15 A 1
1463195 2020-07-14 A 15
2129080 2020-05-19 A 1
2143468 2020-05-18 A 1
Current Output:
date sku qty
556603 2020-10-25 A 6
590904 2020-10-21 A 5
595307 2020-10-20 A 31
602678 2020-10-19 A 11
615022 2020-10-18 A 2
641077 2020-10-16 A 1
650203 2020-10-15 A 3
655363 2020-10-14 A NaN
667919 2020-10-13 A NaN
674990 2020-10-12 A NaN
703901 2020-10-09 A NaN
715411 2020-10-08 A NaN
721557 2020-10-07 A NaN
740515 2020-10-06 A NaN
752670 2020-10-05 A NaN
808426 2020-09-28 A 2
848057 2020-09-23 A 1
865751 2020-09-21 A 2
886630 2020-09-18 A 3
901095 2020-09-16 A 47
938648 2020-09-10 A NaN
969909 2020-09-07 A NaN
1021548 2020-08-31 A NaN
1032254 2020-08-30 A NaN
1077443 2020-08-25 A NaN
1089670 2020-08-24 A NaN
1098843 2020-08-23 A NaN
1102025 2020-08-22 A NaN
1179347 2020-08-12 A NaN
1305700 2020-07-29 A NaN
1316343 2020-07-28 A 1
1399930 2020-07-19 A 1
1451864 2020-07-15 A 1
1463195 2020-07-14 A 15
2129080 2020-05-19 A 1
2143468 2020-05-18 A 1
I have had a look around but can't seem to find what is causing this error.
I was able to reproduce your issue:
# toy df
df
qty
0 1
1 2,
2 3
df['qty'].str.replace(',', '')
0 NaN
1 2
2 NaN
Name: qty, dtype: object
I created df by doing this:
df = pd.DataFrame({'qty': [1, '2,', 3]})
In other words, your column has mixed data types - some values are integers while others are strings. So when you apply .str methods on mixed types, non str types are converted to NaN to indicate "hey it doesn't make sense to run a str method on an int".
You may fix this by converting the entire column to string, then back to int:
df['qty'].astype(str).str.replace(',', '').astype(int)
Or if you want something a litte more robust, try
df['qty'] = pd.to_numeric(
df['qty'].astype(str).str.extract('(\d+)', expand=False), errors='coerce')
Basically, what I'm trying to accomplish is to fill the missing dates (creating new DataFrame rows) with respect to each product, then create a new column based on a cumulative sum of column 'A' (example shown below)
The data is a MultiIndex with (product, date) as indexes.
Basically I would like to apply this answer to a MultiIndex DataFrame using only the rightmost index and calculating a subsequent np.cumsum for each product (and all dates).
A
product date
0 2017-01-02 1
2017-01-03 2
2017-01-04 2
2017-01-05 1
2017-01-06 4
2017-01-07 1
2017-01-10 7
1 2018-06-29 1
2018-06-30 4
2018-07-01 1
2018-07-02 1
2018-07-04 2
What I want to accomplish (efficiently) is:
A CumSum
product date
0 2017-01-02 1 1
2017-01-03 2 3
2017-01-04 2 5
2017-01-05 1 6
2017-01-06 4 10
2017-01-07 1 11
2017-01-08 0 11
2017-01-09 0 11
2017-01-10 7 18
1 2018-06-29 1 1
2018-06-30 4 5
2018-07-01 1 6
2018-07-02 1 7
2018-07-03 0 7
2018-07-04 2 9
You have 2 ways:
One way:
Using groupby with apply and with resample and cumsum. Finally, pd.concat result with df.A and fillna with 0
s = (df.reset_index(0).groupby('product').apply(lambda x: x.resample(rule='D')
.asfreq(0).A.cumsum()))
pd.concat([df.A, s.rename('cumsum')], axis=1).fillna(0)
Out[337]:
A cumsum
product date
0 2017-01-02 1.0 1
2017-01-03 2.0 3
2017-01-04 2.0 5
2017-01-05 1.0 6
2017-01-06 4.0 10
2017-01-07 1.0 11
2017-01-08 0.0 11
2017-01-09 0.0 11
2017-01-10 7.0 18
1 2018-06-29 1.0 1
2018-06-30 4.0 5
2018-07-01 1.0 6
2018-07-02 1.0 7
2018-07-03 0.0 7
2018-07-04 2.0 9
Another way:
you need 2 groupbys. First one for resample, 2nd one for cumsum. Finally, use pd.concat and fillna with 0
s1 = df.reset_index(0).groupby('product').resample(rule='D').asfreq(0).A
pd.concat([df.A, s1.groupby(level=0).cumsum().rename('cumsum')], axis=1).fillna(0)
Out[351]:
A cumsum
product date
0 2017-01-02 1.0 1
2017-01-03 2.0 3
2017-01-04 2.0 5
2017-01-05 1.0 6
2017-01-06 4.0 10
2017-01-07 1.0 11
2017-01-08 0.0 11
2017-01-09 0.0 11
2017-01-10 7.0 18
1 2018-06-29 1.0 1
2018-06-30 4.0 5
2018-07-01 1.0 6
2018-07-02 1.0 7
2018-07-03 0.0 7
2018-07-04 2.0 9
Pandas: select DF rows based on another DF is the closest answer I can find to my question, but I don't believe it quite solves it.
Anyway, I am working with two very large pandas dataframes (so speed is a consideration), df_emails and df_trips, both of which are already sorted by CustID and then by date.
df_emails includes the date we sent a customer an email and it looks like this:
CustID DateSent
0 2 2018-01-20
1 2 2018-02-19
2 2 2018-03-31
3 4 2018-01-10
4 4 2018-02-26
5 5 2018-02-01
6 5 2018-02-07
df_trips includes the dates a customer came to the store and how much they spent, and it looks like this:
CustID TripDate TotalSpend
0 2 2018-02-04 25
1 2 2018-02-16 100
2 2 2018-02-22 250
3 4 2018-01-03 50
4 4 2018-02-28 100
5 4 2018-03-21 100
6 8 2018-01-07 200
Basically, what I need to do is find the number of trips and total spend for each customer in between each email sent. If it is the last time an email is sent for a given customer, I need to find the total number of trips and total spend after the email, but before the end of the data (2018-04-01). So the final dataframe would look like this:
CustID DateSent NextDateSentOrEndOfData TripsBetween TotalSpendBetween
0 2 2018-01-20 2018-02-19 2.0 125.0
1 2 2018-02-19 2018-03-31 1.0 250.0
2 2 2018-03-31 2018-04-01 0.0 0.0
3 4 2018-01-10 2018-02-26 0.0 0.0
4 4 2018-02-26 2018-04-01 2.0 200.0
5 5 2018-02-01 2018-02-07 0.0 0.0
6 5 2018-02-07 2018-04-01 0.0 0.0
Though I have tried my best to do this in a Python/Pandas friendly way, the only accurate solution I have been able to implement is through an np.where, shifting, and looping. The solution looks like this:
df_emails["CustNthVisit"] = df_emails.groupby("CustID").cumcount()+1
df_emails["CustTotalVisit"] = df_emails.groupby("CustID")["CustID"].transform('count')
df_emails["NextDateSentOrEndOfData"] = pd.to_datetime(df_emails["DateSent"].shift(-1)).where(df_emails["CustNthVisit"] != df_emails["CustTotalVisit"], pd.to_datetime('04-01-2018'))
for i in df_emails.index:
df_emails.at[i, "TripsBetween"] = len(df_trips[(df_trips["CustID"] == df_emails.at[i, "CustID"]) & (df_trips["TripDate"] > df_emails.at[i,"DateSent"]) & (df_trips["TripDate"] < df_emails.at[i,"NextDateSentOrEndOfData"])])
for i in df_emails.index:
df_emails.at[i, "TotalSpendBetween"] = df_trips[(df_trips["CustID"] == df_emails.at[i, "CustID"]) & (df_trips["TripDate"] > df_emails.at[i,"DateSent"]) & (df_trips["TripDate"] < df_emails.at[i,"NextDateSentOrEndOfData"])].TotalSpend.sum()
df_emails.drop(['CustNthVisit',"CustTotalVisit"], axis=1, inplace=True)
However, a %%timeit has revealed that this takes 10.6ms on just the seven rows shown above, which makes this solution pretty much infeasible on my actual datasets of about 1,000,000 rows. Does anyone know a solution here that is faster and thus feasible?
Add the next date column to emails
df_emails["NextDateSent"] = df_emails.groupby("CustID").shift(-1)
Sort for merge_asof and then merge to nearest to create a trip lookup table
df_emails = df_emails.sort_values("DateSent")
df_trips = df_trips.sort_values("TripDate")
df_lookup = pd.merge_asof(df_trips, df_emails, by="CustID", left_on="TripDate",right_on="DateSent", direction="backward")
Aggregate the lookup table for the data you want.
df_lookup = df_lookup.loc[:, ["CustID", "DateSent", "TotalSpend"]].groupby(["CustID", "DateSent"]).agg(["count","sum"])
Left join it back to the email table.
df_merge = df_emails.join(df_lookup, on=["CustID", "DateSent"]).sort_values("CustID")
I choose to leave NaNs as NaNs because I don't like filling default values (you can always do that later if you prefer, but you can't easily distinguish between things that existed vs things that didn't if you put defaults in early)
CustID DateSent NextDateSent (TotalSpend, count) (TotalSpend, sum)
0 2 2018-01-20 2018-02-19 2.0 125.0
1 2 2018-02-19 2018-03-31 1.0 250.0
2 2 2018-03-31 NaT NaN NaN
3 4 2018-01-10 2018-02-26 NaN NaN
4 4 2018-02-26 NaT 2.0 200.0
5 5 2018-02-01 2018-02-07 NaN NaN
6 5 2018-02-07 NaT NaN NaN
This would be an easy case of merge_asof had I been able to handle the max_date, so I go a long way:
max_date = pd.to_datetime('2018-04-01')
# set_index for easy extraction by id
df_emails.set_index('CustID', inplace=True)
# we want this later in the final output
df_emails['NextDateSentOrEndOfData'] = df_emails.groupby('CustID').shift(-1).fillna(max_date)
# cuts function for groupby
def cuts(df):
custID = df.CustID.iloc[0]
bins=list(df_emails.loc[[custID], 'DateSent']) + [max_date]
return pd.cut(df.TripDate, bins=bins, right=False)
# bin the dates:
s = df_trips.groupby('CustID', as_index=False, group_keys=False).apply(cuts)
# aggregate the info:
new_df = (df_trips.groupby([df_trips.CustID, s])
.TotalSpend.agg(['sum', 'size'])
.reset_index()
)
# get the right limit:
new_df['NextDateSentOrEndOfData'] = new_df.TripDate.apply(lambda x: x.right)
# drop the unnecessary info
new_df.drop('TripDate', axis=1, inplace=True)
# merge:
df_emails.reset_index().merge(new_df,
on=['CustID','NextDateSentOrEndOfData'],
how='left'
)
Output:
CustID DateSent NextDateSentOrEndOfData sum size
0 2 2018-01-20 2018-02-19 125.0 2.0
1 2 2018-02-19 2018-03-31 250.0 1.0
2 2 2018-03-31 2018-04-01 NaN NaN
3 4 2018-01-10 2018-02-26 NaN NaN
4 4 2018-02-26 2018-04-01 200.0 2.0
5 5 2018-02-01 2018-02-07 NaN NaN
6 5 2018-02-07 2018-04-01 NaN NaN
Preface: I'm newish, but have searched for hours here and in the pandas documentation without success. I've also read Wes's book.
I am modeling stock market data for a hedge fund, and have a simple MultiIndexed-DataFrame with tickers, dates(daily), and fields. The sample here is from Bloomberg. 3 months - Dec. 2016 through Feb. 2017, 3 tickers(AAPL, IBM, MSFT).
import numpy as np
import pandas as pd
import os
# get data from Excel
curr_directory = os.getcwd()
filename = 'Sample Data File.xlsx'
filepath = os.path.join(curr_directory, filename)
df = pd.read_excel(filepath, sheetname = 'Sheet1', index_col = [0,1], parse_cols = 'A:D')
# sort
df.sort_index(inplace=True)
# sample of the data
df.head(15)
Out[4]:
PX_LAST PX_VOLUME
Security Name date
AAPL US Equity 2016-12-01 109.49 37086862
2016-12-02 109.90 26527997
2016-12-05 109.11 34324540
2016-12-06 109.95 26195462
2016-12-07 111.03 29998719
2016-12-08 112.12 27068316
2016-12-09 113.95 34402627
2016-12-12 113.30 26374377
2016-12-13 115.19 43733811
2016-12-14 115.19 34031834
2016-12-15 115.82 46524544
2016-12-16 115.97 44351134
2016-12-19 116.64 27779423
2016-12-20 116.95 21424965
2016-12-21 117.06 23783165
df.tail(15)
Out[5]:
PX_LAST PX_VOLUME
Security Name date
MSFT US Equity 2017-02-07 63.43 20277226
2017-02-08 63.34 18096358
2017-02-09 64.06 22644443
2017-02-10 64.00 18170729
2017-02-13 64.72 22920101
2017-02-14 64.57 23108426
2017-02-15 64.53 17005157
2017-02-16 64.52 20546345
2017-02-17 64.62 21248818
2017-02-21 64.49 20655869
2017-02-22 64.36 19292651
2017-02-23 64.62 20273128
2017-02-24 64.62 21796800
2017-02-27 64.23 15871507
2017-02-28 63.98 23239825
When I calculate daily price changes, like this, it seems to work, only the first day is NaN, as it should be:
df.head(5)
Out[7]:
PX_LAST PX_VOLUME px_change_%
Security Name date
AAPL US Equity 2016-12-01 109.49 37086862 NaN
2016-12-02 109.90 26527997 0.003745
2016-12-05 109.11 34324540 -0.007188
2016-12-06 109.95 26195462 0.007699
2016-12-07 111.03 29998719 0.009823
But daily 30 Day Volume doesn't. It should only be NaN for the first 29 days, but is NaN for all of it:
# daily change from 30 day volume - doesn't work
df['30_day_volume'] = df.groupby(level=0,group_keys=True)['PX_VOLUME'].rolling(window=30).mean()
df['volume_change_%'] = (df['PX_VOLUME'] - df['30_day_volume']) / df['30_day_volume']
df.iloc[:,3:].tail(40)
Out[12]:
30_day_volume volume_change_%
Security Name date
MSFT US Equity 2016-12-30 NaN NaN
2017-01-03 NaN NaN
2017-01-04 NaN NaN
2017-01-05 NaN NaN
2017-01-06 NaN NaN
2017-01-09 NaN NaN
2017-01-10 NaN NaN
2017-01-11 NaN NaN
2017-01-12 NaN NaN
2017-01-13 NaN NaN
2017-01-17 NaN NaN
2017-01-18 NaN NaN
2017-01-19 NaN NaN
2017-01-20 NaN NaN
2017-01-23 NaN NaN
2017-01-24 NaN NaN
2017-01-25 NaN NaN
2017-01-26 NaN NaN
2017-01-27 NaN NaN
2017-01-30 NaN NaN
2017-01-31 NaN NaN
2017-02-01 NaN NaN
2017-02-02 NaN NaN
2017-02-03 NaN NaN
2017-02-06 NaN NaN
2017-02-07 NaN NaN
2017-02-08 NaN NaN
2017-02-09 NaN NaN
2017-02-10 NaN NaN
2017-02-13 NaN NaN
2017-02-14 NaN NaN
2017-02-15 NaN NaN
2017-02-16 NaN NaN
2017-02-17 NaN NaN
2017-02-21 NaN NaN
2017-02-22 NaN NaN
2017-02-23 NaN NaN
2017-02-24 NaN NaN
2017-02-27 NaN NaN
2017-02-28 NaN NaN
As pandas seems to have been designed specifically for finance, I'm surprised this isn't straightforward.
Edit: I've tried some other ways as well.
Tried converting it into a Panel (3D), but didn't find any built in functions for Windows except to convert to a DataFrame and back, so no advantage there.
Tried to create a pivot table, but couldn't find a way to reference just the first level of the MultiIndex. df.index.levels[0] or ...levels[1] wasn't working.
Thanks!
Can you try the following to see if it works?
df['30_day_volume'] = df.groupby(level=0)['PX_VOLUME'].rolling(window=30).mean().values
df['volume_change_%'] = (df['PX_VOLUME'] - df['30_day_volume']) / df['30_day_volume']
I can verify Allen's answer works when using pandas_datareader, modifying the index level for the groupby operation for the datareader multiindexing.
import pandas_datareader.data as web
import datetime
start = datetime.datetime(2016, 12, 1)
end = datetime.datetime(2017, 2, 28)
data = web.DataReader(['AAPL', 'IBM', 'MSFT'], 'yahoo', start, end).to_frame()
data['30_day_volume'] = data.groupby(level=1).rolling(window=30)['Volume'].mean().values
data['volume_change_%'] = (data['Volume'] - data['30_day_volume']) / data['30_day_volume']
# double-check that it computed starting at 30 trading days.
data.loc['2017-1-17':'2017-1-30']
The original poster might try editing this line:
df['30_day_volume'] = df.groupby(level=0,group_keys=True)['PX_VOLUME'].rolling(window=30).mean()
to the following, using mean().values:
df['30_day_volume'] = df.groupby(level=0,group_keys=True)['PX_VOLUME'].rolling(window=30).mean().values
The data don't get properly aligned without this, resulting in NaN's.