A few operations with df.groupby()

A few operations with df.groupby() - python

I working with a forex dataset, trying to fill in my dataframe with open, high, low, close updated every tick.
Here is my code:
import pandas as pd
# pandas settings
pd.set_option('display.max_columns', 320)
pd.set_option('display.max_rows', 320)
pd.set_option('display.width', 320)
# creating dataframe
df = pd.read_csv('https://www.dropbox.com/s/tcek3kmleklgxm5/eur_usd_lastweek.csv?dl=1', names=['timestamp', 'ask', 'bid', 'avol', 'bvol'], parse_dates=[0], header=0)
df['spread'] = df.ask - df.bid
df['symbol'] = 'EURUSD'
times = pd.DatetimeIndex(df.timestamp)
# parameters for df.groupby()
df['date'] = times.date
df['hour'] = times.hour
# 1h candles updated every tick
df['candle_number'] = '...'
df['1h_open'] = '...'
df['1h_high'] = '...'
df['1h_low'] = '...'
df['1h_close'] = '...'
# print(df)
grouped = df.groupby(['date', 'hour'])
for idx, x in enumerate(grouped):
print(idx)
print(x)
So as you can see, with for loop I'm getting groups.
Now I want to fill the following columns in my dataframe:
idx be my df['candle_number']
df['1h_open'] must be equal to the very first df.bid in the group
df['1h_high'] = the highest number in df.bid up until current row (so for instance if there are 350 rows in the group, for 20th value
we count the highest number from 0-20 span, on 215th value we the
highest value from 0-215 span which can be completely different.
df['1h_low'] = lowest value up until the current iteration (same approach as for the above)
I hope it's not too confusing =)
Cheers

It's convinient to reindex on date and hour:
df_new = df.set_index(['date', 'hour'])
Then apply groupby functions aggregating by index:
df_new['candle_number'] = df_new.groupby(level=[0,1]).ngroup()
df_new['1h_open'] = df_new.groupby(level=[0,1])['bid'].first()
df_new['1h_high'] = df_new.groupby(level=[0,1])['bid'].cummax()
df_new['1h_low'] = df_new.groupby(level=[0,1])['bid'].cummin()
you can reset_index() back to a flat dataframe.

Related

How can I speed up a multi-column loop?

I have a ~8million-ish row data frame consisting of sales for 615 products across 16 stores each day for five years.
I need to make new column/s that consists of the sales shifted back from 1 to 7 days. I've decided to sort the data frame by date, product and location. The I concatenate item and location as its own column.
Using that column I loop through each unique item/location concatenation and make the shifted sales columns. This code is below:
import pandas as pd
#sort values by item, location, date
df = df.sort_values(['date', 'product', 'location'])
df['sort_values'] = df['product']+"_"+df['location']
df1 = pd.DataFrame()
z = 0
for i in list(df['sort_values'].unique()):
df_ = df[df['sort_values']==i]
df_ = df_.sort_values('ORD_DATE')
df_['eaches_1'] = df_['eaches'].shift(-1)
df_['eaches_2'] = df_['eaches'].shift(-2)
df_['eaches_3'] = df_['eaches'].shift(-3)
df_['eaches_4'] = df_['eaches'].shift(-4)
df_['eaches_5'] = df_['eaches'].shift(-5)
df_['eaches_6'] = df_['eaches'].shift(-6)
df_['eaches_7'] = df_['eaches'].shift(-7)
df1 = pd.concat((df1, df_))
z+=1
if z % 100 == 0:
print(z)
The above code gets me exactly what I want, but takes FOREVER to complete. Is there a faster way to accomplish what I want?

How to create lag feature in pandas in this case?

I have a table like this (with more columns):
date,Sector,Value1,Value2
14/03/22,Medical,86,64
14/03/22,Medical,464,99
14/03/22,Industry,22,35
14/03/22,Services,555,843
15/03/22,Services,111,533
15/03/22,Industry,222,169
15/03/22,Medical,672,937
15/03/22,Medical,5534,825
I have created some features like this:
sectorGroup = df.groupby(["date","Sector"])["Value1","Value2"].mean().reset_index()
df = pd.merge(df,sectorGroup,on=["date","Sector"],how="left",suffixes=["","_bySector"])
dateGroupGroup = df.groupby(["date"])["Value1","Value2"].mean().reset_index()
df = pd.merge(df,dateGroupGroup,on=["date"],how="left",suffixes=["","_byDate"])
Now my new df looks like this:
date,Sector,Value1,Value2,Value1_bySector,Value2_bySector,Value1_byDate,Value2_byDate
14/03/22,Medical,86,64,275.0,81.5,281.75,260.25
14/03/22,Medical,464,99,275.0,81.5,281.75,260.25
14/03/22,Industry,22,35,22.0,35.0,281.75,260.25
14/03/22,Services,555,843,555.0,843.0,281.75,260.25
15/03/22,Services,111,533,111.0,533.0,1634.75,616.0
15/03/22,Industry,222,169,222.0,169.0,1634.75,616.0
15/03/22,Medical,672,937,3103.0,881.0,1634.75,616.0
15/03/22,Medical,5534,825,3103.0,881.0,1634.75,616.0
Now, I want to create lag features for Value1_bySector,Value2_bySector,Value1_byDate,Value2_byDate
For example, a new column named Value1_by_Date_lag1 and Value1_bySector_lag1.
And this new column will look like this:
date,Sector,Value1_by_Date_lag1,Value1_bySector_lag1
15/03/22,Services,281.75,555.0
15/03/22,Industry,281.75,22.0
15/03/22,Medical,281.75,275.0
15/03/22,Medical,281.75,275.0
Basically in Value1_by_Date_lag1, the date "15/03" will contain the value "281.75" which is for the date "14/03" (lag of 1 shift).
Basically in Value1_bySector_lag1, the date "15/03" and Sector "Medical" will contain the value "275.0", which is the value for "14/03" and "Medical" rows.
I hope, the question is clear and gave you all the details.

Create a lagged date variable by shifting the date column, and then merge again with dateGroupGroup and sectorGroup using the lagged date instead of the actual date.
df = pd.read_csv(io.StringIO("""date,Sector,Value1,Value2
14/03/22,Medical,86,64
14/03/22,Medical,464,99
14/03/22,Industry,22,35
14/03/22,Services,555,843
15/03/22,Services,111,533
15/03/22,Industry,222,169
15/03/22,Medical,672,937
15/03/22,Medical,5534,825"""))
# Add a lagged date variable
lagged = df.groupby("date")["date"].first().shift()
df = df.join(lagged, on="date", rsuffix="_lag")
# Create date and sector groups and merge them into df, as you already do
sectorGroup = df.groupby(["date","Sector"])[["Value1","Value2"]].mean().reset_index()
df = pd.merge(df,sectorGroup,on=["date","Sector"],how="left",suffixes=["","_bySector"])
dateGroupGroup = df.groupby("date")[["Value1","Value2"]].mean().reset_index()
df = pd.merge(df, dateGroupGroup, on="date",how="left", suffixes=["","_byDate"])
# Merge again, this time matching the lagged date in df to the actual date in sectorGroup and dateGroupGroup
df = pd.merge(df, sectorGroup, left_on=["date_lag", "Sector"], right_on=["date", "Sector"], how="left", suffixes=["", "_by_sector_lag"])
df = pd.merge(df, dateGroupGroup, left_on="date_lag", right_on="date", how="left", suffixes=["", "_by_date_lag"])
# Drop the extra unnecessary columns that have been created in the merge
df = df.drop(columns=['date_by_date_lag', 'date_by_sector_lag'])
This assumes the data is sorted by date - if not you will have to sort before generating the lagged date. It will work whether or not all the dates are consecutive.

I found 1 inefficient solution (slow and memory intensive).
Lag of "date" group
cols = ["Value1_byDate","Value2_byDate"]
temp = df[["date"]+cols]
temp = temp.drop_duplicates()
for i in range(10):
temp.date = temp.date.shift(-1-i)
df = pd.merge(df,temp,on="date",how="left",suffixes=["","_lag"+str(i+1)])
Lag of "date" and "Sector" group
cols = ["Value1_bySector","Value2_bySector"]
temp = df[["date","Sector"]+cols]
temp = temp.drop_duplicates()
for i in range(10):
temp[["Value1_bySector","Value2_bySector"]] = temp.groupby("Sector")["Value1_bySector","Value2_bySector"].shift(1+1)
df = pd.merge(df,temp,on=["date","Sector"],how="left",suffixes=["","_lag"+str(i+1)])
Is there a more simple solution?

Create pandas column of pd.date_range

I have data like this:
import datetime as dt
import pandas as pd
df = pd.DataFrame({'date':[dt.datetime(2018,8,25), dt.datetime(2018,7,21)],
'n':[10,7]})
I would like to create a third column which contains a date range created by pd.date_range, using 'date' as the start date and 'n' as the number of periods.
So the first entry should be:
pd.date_range(dt.datetime(2018,8,25), periods=10, freq='d')
(I have a list of "target" dates, and my goal is to check whether the date_range contains any of those target dates).
I tried this:
df['date_range'] = df.apply(lambda x: pd.date_range(x['date'],
x['n'],
freq='d'))
But this gives a KeyError: ('date', 'occurred at index date')
Any idea on how to do this without using a for loop, or is there a better solution altogether?

You can solve your problem without creating date range or day columns. To check if a target date in tgt belongs to a date range specified by rows of df, you can calculate the end of date range, and then check if each date in tgt falls in between the start and end of a time interval. The code below implements this, and produces "target_date" column identical to the one in your own answer:
df = pd.DataFrame({'date':[dt.datetime(2018,8,25), dt.datetime(2018,7,21)],
'n':[10,7]})
df["daterange_end"] = df.apply(lambda x: x["date"] + pd.Timedelta(days=x["n"]), axis=1)
tgt = [dt.datetime(2018,8,26)]
df['target_date'] = 0
df.loc[(tgt[0] > df.date) &(tgt[0] < df.daterange_end),"target_date"] = 1
print(df)
# date n daterange_end target_date
# 0 2018-08-25 10 2018-09-04 1
# 1 2018-07-21 7 2018-07-28 0

You should add axis=1 in apply
df['date_range'] = df.apply(lambda x: pd.date_range(x['date'], x['n'], freq='d'), axis=1)

I came up with a solution that works (but I'm sure there's a nicer way...)
# define target
tgt = [dt.datetime(2018,8,26)]
# find max n
max_n = max(df['n'])
# create that many columns and increment the day
for i in range(max_n):
df['date_{}'.format(i)] = df['date'] + dt.timedelta(days=i)
new_cols = ['date_{}'.format(n) for n in range(max_n)]
# check each one and replace with a 1 if it matches the "tgt"
df['target_date'] = 0
for col in new_cols:
df['target_date'] = np.where(df[col].isin(tgt),
1,
df['target_date'])
# drop intermediate cols
df = df[[i for i in df.columns if not i in new_cols]]

Pandas and stocks: From daily values (in columns) to monthly values (in rows)

I am having trouble reformatting a dataframe.
My input is a day value rows by symbols columns (each symbol has different dates with it's values):
Input
code to generate input
data = [("01-01-2010", 15, 10), ("02-01-2010", 16, 11), ("03-01-2010", 16.5, 10.5)]
labels = ["date", "AAPL", "AMZN"]
df_input = pd.DataFrame.from_records(data, columns=labels)
The needed output is (month row with new row for each month):
Needed output
code to generate output
data = [("01-01-2010","29-01-2010", "AAPL", 15, 20), ("01-01-2010","29-01-2010", "AMZN", 10, 15),("02-02-2010","30-02-2010", "AAPL", 20, 32)]
labels = ['bd start month', 'bd end month','stock', 'start_month_value', "end_month_value"]
df = pd.DataFrame.from_records(data, columns=labels)
Meaning (Pseudo code)
1. for each row take only non nan values to create a new "row" (maybe dictionary with the date as the index and the [stock, value] as the value.
2. take only rows that are business start of month or business end of month.
3. write those rows to a new datatframe.
I have read several posts like this and this and several more.
All treat with dataframe of the same "type" and just resampling while I need to change to structure...
My code so far
# creating the new index with business days
df1 =pd.DataFrame(range(10000), index = pd.date_range(df.iloc[0].name, periods=10000, freq='D'))
from pandas.tseries.offsets import CustomBusinessMonthBegin
from pandas.tseries.holiday import USFederalHolidayCalendar
bmth_us = CustomBusinessMonthBegin(calendar=USFederalHolidayCalendar())
df2 = df1.resample(bmth_us).mean()
# creating the new index interseting my old one (daily) with the monthly index
new_index = df.index.intersection(df2.index)
# selecting only the rows I want
df = df.loc[new_index]
# creating a dict that will be my new dataset
new_dict = collections.OrderedDict()
# iterating over the rows and adding to dictionary
for index, row in df.iterrows():
# print index
date = df.loc[index].name
# values are the not none values
values = df.loc[index][~df.loc[index].isnull().values]
new_dict[date]=values
# from dict to list
data=[]
for key, values in new_dict.iteritems():
for i in range(0, len(values)):
date = key
stock_name = str(values.index[i])
stock_value = values.iloc[i]
row = (key, stock_name, stock_value)
data.append(row)
# from the list to df
labels = ['date','stock', 'value']
df = pd.DataFrame.from_records(data, columns=labels)
df.to_excel("migdal_format.xls")
Current output I get
One big problem:
I only get value of the stock on the start of month day.. I need start and end so I can calculate the stock gain on this month..
One smaller problem:
I am sure this is not the cleanest and fastest code :)
Thanks a lot!

So I have found a way.
looping through each column
groupby month
taking the first and last value I have in that month
calculate return
df_migdal = pd.DataFrame()
for col in df_input.columns[0:]:
stock_position = df_input.loc[:,col]
name = stock_position.name
name = re.sub('[^a-zA-Z]+', '', name)
name = name[0:-4]
stock_position=stock_position.groupby([pd.TimeGrouper('M')]).agg(['first', 'last'])
stock_position["name"] = name
stock_position["return"] = ((stock_position["last"] / stock_position["first"]) - 1) * 100
stock_position.dropna(inplace=True)
df_migdal=df_migdal.append(stock_position)
df_migdal=df_migdal.round(decimals=2)
I tried I way cooler way, but did not know how to handle the ,multi index I got... I needed that for each column, to take the two sub columns and create a third one from some lambda function.
df_input.groupby([pd.TimeGrouper('M')]).agg(['first', 'last'])

Deleting the same outliers in two timeseries

I have a question about eliminating outliers from two-time series. One time series includes spot market prices and the other includes power outputs. The two series are from 2012 to 2016 and are both CSV files with the with a timestamp and then a value. As example for the power output: 2012-01-01 00:00:00,2335.2152646951617 and for the price: 2012-01-01 00:00:00,17.2
Because the spot market prices are very volatile and have a lot of outliers, I have filtered them. For the second time series, I have to delete the values with the same timestamp, which were eliminated in the time series of the prices. I thought about generating a list with the deleted values and writing a loop to delete the values with the same timestamp in the second time series. But so far that has not worked and I'm not really on. Does anyone have an idea?
My python code looks as follow:
import pandas as pd
import matplotlib.pyplot as plt
power_output = pd.read_csv("./data/external/power_output.csv", delimiter=",", parse_dates=[0], index_col=[0])
print(power_output.head())
plt.plot(power_output)
spotmarket = pd.read_csv("./data/external/spotmarket_dhp.csv", delimiter=",", parse_dates=[0], index_col=[0])
print(spotmarket.head())
r = spotmarket['price'].pct_change().dropna() * 100
print(r)
plt.plot(r)
Q1 = r.quantile(.25)
Q3 = r.quantile(.75)
q1 = Q1-2*(Q3-Q1)
q3 = Q3+2*(Q3-Q1)
a = r[r.between(q1, q3)]
print(a)
plt.plot(a)
Can somebody help me?

If your question is about how to compare two timestamps you can have a look at this.
Basically you could do:
out = r[~r.between(q1, q3)] # negation of your between to get the outliers
df=pd.merge(spotmarker,out,on=['date'],how="outer",indicator=True)
df=df[df['_merge']=='left_only']
Which is a merge operation that conserves only those rows that are only present in the left dataframe

The following suggestion is based on an answer of mine from a previous post.
You can solve your problem by merging both of your series and storing them in pandas dataframe. Then you can use any desired technique to identify and remove outliers. Take a look at the post mentioned above.
Here is my take on your particular problem using a snippet that can handle more than one series:
Since I don't have access to your data, the following snippet will produce two series where one of them has a distinctive outlier:
def sample(colname):
base = 100
nsample = 20
sigma = 10
# Basic df with trend and sinus seasonality
trend1 = np.linspace(0,1, nsample)
y1 = np.sin(trend1)
dates = pd.date_range(pd.datetime(2016, 1, 1).strftime('%Y-%m-%d'), periods=nsample).tolist()
df = pd.DataFrame({'dates':dates, 'trend1':trend1, 'y1':y1})
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
# Gaussian Noise with amplitude sigma
df['y2'] = sigma * np.random.normal(size=nsample)
df['y3'] = df['y2'] + base + (np.sin(trend1))
df['trend2'] = 1/(np.cos(trend1)/1.05)
df['y4'] = df['y3'] * df['trend2']
df=df['y4'].to_frame()
df.columns = [colname]
return(df)
df_sample1 = sample(colname = 'series1')
df_sample2 = sample(colname = 'series2')
df_sample2['series2'].iloc[10] = 800
df_sample1.plot()
df_sample2.plot()
Series 1 - No outliers
Series 2 - A distinctive outlier
Now you can merge those series like this:
# Merge dataframes
df_merged = pd.merge(df_sample1, df_sample2, how='outer', left_index=True, right_index=True)
df_merged.plot()
What is considered an outlier will depend full on the nature of your dataset. In this case, you can set the level for identifying outliers using sscipy.zscore(). In the following case, every observation with a difference that exceeds 3 is considered an outlier.
# A function for removing outliers
def noSpikes(df, level, keepFirst):
# 1. Get some info about the original data:
##%%
#df = df_merged
#level = 3
#keepFirst = True
##%%
firstVal = df[:1]
colNames = df.columns
colNumber = len(df.columns)
#cleanBy = 'Series1'
# 2. Take the first difference and
df_diff = df.diff()
# 3. Remove missing values
df_clean = df_diff.dropna()
# 4. Select a level for a Z-score to identify and remove outliers
df_Z = df_clean[(np.abs(stats.zscore(df_clean)) < level).all(axis=1)]
ix_keep = df_Z.index
# 5. Subset the raw dataframe with the indexes you'd like to keep
df_keep = df.loc[ix_keep]
# 6.
# df_keep will be missing some indexes.
# Do the following if you'd like to keep those indexes
# and, for example, fill missing values with the previous values
df_out = pd.merge(df_keep, df, how='outer', left_index=True, right_index=True)
# 7. Keep only the original columns (drop the diffs)
df_out = df_out.ix[:,:colNumber]
# 8. Fill missing values
df_complete = df_out.fillna(axis=0, method='ffill')
# 9. Reset column names
df_complete.columns = colNames
# Keep the first value
if keepFirst:
df_complete.iloc[0] = firstVal.iloc[0]
return(df_complete)
df_clean = noSpikes(df = df_merged, level = 3, keepFirst = True)
df_clean.plot()
Let me know how this works out for you.
Here's the whole thing for an easy copy-paste:
# Imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
np.random.seed(22)
# A function for noisy data with a trend element
def sample(colname):
base = 100
nsample = 20
sigma = 10
# Basic df with trend and sinus seasonality
trend1 = np.linspace(0,1, nsample)
y1 = np.sin(trend1)
dates = pd.date_range(pd.datetime(2016, 1, 1).strftime('%Y-%m-%d'), periods=nsample).tolist()
df = pd.DataFrame({'dates':dates, 'trend1':trend1, 'y1':y1})
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
# Gaussian Noise with amplitude sigma
df['y2'] = sigma * np.random.normal(size=nsample)
df['y3'] = df['y2'] + base + (np.sin(trend1))
df['trend2'] = 1/(np.cos(trend1)/1.05)
df['y4'] = df['y3'] * df['trend2']
df=df['y4'].to_frame()
df.columns = [colname]
return(df)
df_sample1 = sample(colname = 'series1')
df_sample2 = sample(colname = 'series2')
df_sample2['series2'].iloc[10] = 800
df_sample1.plot()
df_sample2.plot()
# Merge dataframes
df_merged = pd.merge(df_sample1, df_sample2, how='outer', left_index=True, right_index=True)
df_merged.plot()
# A function for removing outliers
def noSpikes(df, level, keepFirst):
# 1. Get some info about the original data:
firstVal = df[:1]
colNames = df.columns
colNumber = len(df.columns)
#cleanBy = 'Series1'
# 2. Take the first difference and
df_diff = df.diff()
# 3. Remove missing values
df_clean = df_diff.dropna()
# 4. Select a level for a Z-score to identify and remove outliers
df_Z = df_clean[(np.abs(stats.zscore(df_clean)) < level).all(axis=1)]
ix_keep = df_Z.index
# 5. Subset the raw dataframe with the indexes you'd like to keep
df_keep = df.loc[ix_keep]
# 6.
# df_keep will be missing some indexes.
# Do the following if you'd like to keep those indexes
# and, for example, fill missing values with the previous values
df_out = pd.merge(df_keep, df, how='outer', left_index=True, right_index=True)
# 7. Keep only the original columns (drop the diffs)
df_out = df_out.ix[:,:colNumber]
# 8. Fill missing values
df_complete = df_out.fillna(axis=0, method='ffill')
# 9. Reset column names
df_complete.columns = colNames
# Keep the first value
if keepFirst:
df_complete.iloc[0] = firstVal.iloc[0]
return(df_complete)
df_clean = noSpikes(df = df_merged, level = 3, keepFirst = True)
df_clean.plot()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

A few operations with df.groupby() - python

Related

How can I speed up a multi-column loop?

How to create lag feature in pandas in this case?

Create pandas column of pd.date_range

Pandas and stocks: From daily values (in columns) to monthly values (in rows)

Deleting the same outliers in two timeseries

Categories

Resources