Python Dataframe Rolling median with group by - python

I have a dataframe with three columns, viz., date,commodity and values. I want to add another column, median_20, the rolling median of last 20 days for each commodity in the df. Also, I want to add other columns which show value n days before, for example, lag_1 column shows value 1 day before for a given commodity, lag_2 shows value 2 days before, and so on. My df is quite big (>2 million rows) in size.
dates = pd.date_range('2017-01-01', '2017-07-02')
df1 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'GOLD'})
df2 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'SILVER'})
df = pd.concat([df1, df2])
df = df.sort('date')
date commodity value
0 2017-01-01 GOLD -1.239422
0 2017-01-01 SILVER -0.209840
1 2017-01-02 SILVER 0.146293
1 2017-01-02 GOLD 1.422454
2 2017-01-03 GOLD 0.453222
...

Try:
import pandas as pd
import numpy as np
# create dataframe
dates = pd.date_range('2017-01-01', '2017-07-02')
df1 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'GOLD'})
df2 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'SILVER'})
df = pd.concat([df1, df2])
df = df.sort_values(by='date').reset_index(drop=True)
# create columns
df['median_20_temp'] = df.groupby('market')['commodity'].rolling(20).median()
df['median_20'] = df.groupby('market')['median_20_temp'].shift(1)
df['lag_1'] = df.groupby('market')['commodity'].shift(1)
df['lag_2'] = df.groupby('market')['commodity'].shift(2)
df.drop(['median_20_temp'], axis=1, inplace=True)
Edit:
The following should work with version 0.16.2:
import numpy as np
import pandas as pd
np.random.seed(123)
dates = pd.date_range('2017-01-01', '2017-07-02')
df1 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'GOLD'})
df2 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'SILVER'})
df = pd.concat([df1, df2])
df = df.sort('date').reset_index(drop=True)
# create columns
df['median_20_temp'] = df.groupby('market')['commodity'].apply(lambda s: pd.rolling_median(s, 20))
df['median_20'] = df.groupby('market')['median_20_temp'].shift(1)
df['lag_1'] = df.groupby('market')['commodity'].shift(1)
df['lag_2'] = df.groupby('market')['commodity'].shift(2)
df.drop(['median_20_temp'], axis=1, inplace=True)
I hope this helps.

I am sure there is a more efficient way, meanwhile try this solution:
for commo in df.market.unique():
df.loc[df.market==commo,'lag_1'] = df.loc[df.market==commo,'commodity'].shift(1)
df.loc[df.market==commo,'median_20'] = pd.rolling_median(df.loc[df.market==commo,'commodity'],20)

Related

How do I remove rows of a Pandas DataFrame based on a certain condition?

import yfinance as yf
import numpy as np
import pandas as pd
ETF_DB = ['QQQ', 'EGFIX']
fundsret = yf.download(ETF_DB, start=datetime.date(2020,12,31), end=datetime.date(2022,4,30), interval='1mo')['Adj Close'].pct_change()
df = pd.DataFrame(fundsret)
df
Gives me:
I'm trying to remove the rows in the dataframe that aren't month end such as the row 2021-03-22. How do I have the dataframe go through and remove the rows where the date doesn't end in '01'?
df.reset_index(inplace=True)
# Convert the date to datetime64
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
#select only day = 1
filtered = df.loc[df['Date'].dt.day == 1]
Did you mean month start?
You can use:
df = df[df.index.day==1]
reproducible example:
df = pd.DataFrame(columns=['A', 'B'],
index=['2021-01-01', '2021-02-01', '2021-03-01',
'2021-03-22', '2021-03-31'])
df.index = pd.to_datetime(df.index, dayfirst=False)
output:
A B
2021-01-01 NaN NaN
2021-02-01 NaN NaN
2021-03-01 NaN NaN
end of month
for the end of month, you can add 1 day and check if this jumps to the next month:
end = (df.index+pd.Timedelta('1d')).month != df.index.month
df = df[end]
or add an offset and check if the value is unchanged:
end = df.index == (df.index + pd.offsets.MonthEnd(0))
df = df[end]
output:
A B
2021-03-31 NaN NaN
import pandas as pd
import re
# Dummy Dictionary
dict={
'Date': ['2021-01-01','2022-03-01','2023-04-22','2023-04-01'],
'Name' : ['A','B','C','D']
}
# Making a DataFrame
df=pd.DataFrame(dict)
# Date Pattern Required
pattern= '(\d{4})-(\d{2})-01'
new_df=df[df['Date'].str.match(r'((\d{4})-(\d{2})-01)')]
print(new_df)

Filter dataframe based on corresponding rows in another one

would like to create df3 where the url would come from df1 and the traffic value from corresponding rows in df2.
Current code:
import pandas as pd
data = [['http://url1.com'], ['http://url3.com']]
data_2 = [[{'url':'http://url1.com', 'traffic':100}], [{'url':'http://url2.com', 'traffic':200}], [{'url':'http://url3.com', 'traffic':300}]]
df1 = pd.DataFrame(data=data, columns=['url'])
df2 = pd.DataFrame(data=data_2, columns=['url', 'traffic'])
df3 = pd.merge(left=df1, right=df2, on='url')
Expected output:
url traffic
0 http://url1.com 100
1 http://url3.com 300
Current output:
ValueError: 2 columns passed, passed data had 1 columns
Regarding https and http, you need to make sure you overwrite the dataframe:
import pandas as pd
data = [['https://url1.com'], ['https://url3.com']]
data_2 = [[{'url':'http://url1.com', 'traffic':100}], [{'url':'http://url2.com', 'traffic':200}], [{'url':'http://url3.com', 'traffic':300}]]
df1 = pd.DataFrame(data=data, columns=['url'])
df2 = pd.DataFrame([row[0] for row in data_2])
df1 = df1.replace(to_replace = 'https', value='http', regex=True)
df3 = pd.merge(left=df1, right=df2, on='url')
print(df3)
url traffic
0 http://url1.com 100
1 http://url3.com 300

Changing pandas dataframe by reference

I have two large DataFrames that I don't want to make copies of, but want to apply the same change to. How can I do this properly? For example, this is similar to what I want to do, but on a smaller scale. This only creates the temporary variable df that gives the result of each DataFrame, but I want both DataFrames to be themselves changed:
import pandas as pd
df1 = pd.DataFrame({'a':[1,2,3]})
df2 = pd.DataFrame({'a':[0,1,5,7]})
for df in [df1, df2]:
df = df[df['a'] < 3]
We can do query with inplace
df1 = pd.DataFrame({'a':[1,2,3]})
df2 = pd.DataFrame({'a':[0,1,5,7]})
for df in [df1, df2]:
df.query('a<3',inplace=True)
df1
a
0 1
1 2
df2
a
0 0
1 1
Don't think this is the best solution, but should do the job.
import pandas as pd
df1 = pd.DataFrame({'a':[1,2,3]})
df2 = pd.DataFrame({'a':[0,1,5,7]})
dfs = [df1, df2]
for i, df in enumerate(dfs):
dfs[i] = df[df['a'] < 3]
dfs[0]
a
0 1
1 2

python dataframe concatenate based on a chosen date

Say I have the following variables and dataframe:
a = '2020-04-23 14:00:00+00:00','2020-04-23 13:00:00+00:00','2020-04-23 12:00:00+00:00','2020-04-23 11:00:00+00:00','2020-04-23 10:00:00+00:00','2020-04-23 09:00:00+00:00','2020-04-23 08:00:00+00:00','2020-04-23 07:00:00+00:00','2020-04-23 06:00:00+00:00','2020-04-23 04:00:00+00:00'
b = '2020-04-23 10:00:00+00:00','2020-04-23 09:00:00+00:00','2020-04-23 08:00:00+00:00','2020-04-23 07:00:00+00:00','2020-04-23 06:00:00+00:00','2020-04-23 05:00:00+00:00','2020-04-23 04:00:00+00:00','2020-04-23 03:00:00+00:00','2020-04-23 02:00:00+00:00','2020-04-23 01:00:00+00:00'
aa = 7105.50,6923.50,6692.50,6523.00,6302.5,6081.5,6262.0,6451.50,6369.50,6110.00
bb = 6386.00,6221.00,6505.00,6534.70,6705.00,6535.00,7156.50,7422.00,7608.50,8098.00
df1 = pd.DataFrame()
df1['timestamp'] = a
df1['price'] = aa
df2 = pd.DataFrame()
df2['timestamp'] = b
df2['price'] = bb
print(df1)
print(df2)
I am trying to concatenate the rows of following:
top row of df1 to '2020-04-23 08:00:00+00:00'
'2020-04-23 07:00:00+00:00' to the last row of df2
for illustration purposes the following is what the dataframe should look like:
c = '2020-04-23 14:00:00+00:00','2020-04-23 13:00:00+00:00','2020-04-23 12:00:00+00:00','2020-04-23 11:00:00+00:00','2020-04-23 10:00:00+00:00','2020-04-23 09:00:00+00:00','2020-04-23 08:00:00+00:00','2020-04-23 07:00:00+00:00','2020-04-23 06:00:00+00:00','2020-04-23 05:00:00+00:00','2020-04-23 04:00:00+00:00','2020-04-23 03:00:00+00:00','2020-04-23 02:00:00+00:00','2020-04-23 01:00:00+00:00'
cc = 7105.50,6923.50,6692.50,6523.00,6302.5,6081.5,6262.0,6534.70,6705.00,6535.00,7156.50,7422.00,7608.50,8098.00
df = pd.DataFrame()
df['timestamp'] = c
df['price'] = cc
print(df)
Any ideas?
You can convert the timestamp columns to pd.date_time objects, and then use boolean indexing and pd.concat to select and merge them:
df1.timestamp = pd.to_datetime(df1.timestamp)
df2.timestamp = pd.to_datetime(df2.timestamp)
dfs = [df1.loc[df1.timestamp >= pd.to_datetime("2020-04-23 08:00:00+00:00"),:],
df2.loc[df2.timestamp <= pd.to_datetime("2020-04-23 07:00:00+00:00"),:]
]
df_conc = pd.concat(dfs)

Fater update pandas DataFrame

I have a DataFrame named df has column GENDER, AGE and ID and others columns, and there is another DataFrame named df_2 which has only 3 columns GENDER, AGE and ID too. I want to update the value of GENDER and AGE in df with values from df_2.
So my ideas is
df_id = df.ID.tolist()
df_2_id = df_2.ID.tolist()
df = df.set_index('ID')
df_2 = df_2.set_index('ID')
# all the ids in df_2_id are in df_id
for id in tqdm.tqdm_notebook(df_2_id):
df.loc[id, 'GENDER'] = df_2.loc[id, 'GENDER']
df.loc[id, 'AGE'] = df_2.loc[id, 'AGE']
However, the for loop only has 17.2 iterations per seconds, and it around takes 2 hours to update the data. How can I make it faster?
I think you need first intersection of indices and then set values:
idx = df.index.intersection(df_2.index)
df.loc[idx, 'GENDER'] = df_2['GENDER']
df.loc[idx, 'AGE'] = df_2['AGE']
Or concat them together and remove duplicates, keep last value:
df = pd.concat([df, df_2])
df = df[~df.index.duplicated(keep='last')]
Similar solution:
df = pd.concat([df, df_2]).reset_index().drop_duplicates('ID', keep='last')
Sample:
df = pd.DataFrame({'ID':list('abcdef'),
'AGE':[5,3,6,9,2,4],
'GENDER':list('aaabbb')})
#print (df)
df_2 = pd.DataFrame({'ID':list('def'),
'AGE':[90,20,40],
'GENDER':list('eee')})
#print (df_2)
df = df.set_index('ID')
df_2 = df_2.set_index('ID')
idx = df.index.intersection(df_2.index)
df.loc[idx, 'GENDER'] = df_2['GENDER']
df.loc[idx, 'AGE'] = df_2['AGE']
print (df)
AGE GENDER
ID
a 5 a
b 3 a
c 6 a
d 90 e
e 20 e
f 40 e

Categories

Resources