I have DataFrame like below:
df = pd.DataFrame({"data" : ["25.01.2020", and many more other dates...]})
df["data"] = pd.to_datetime(df["data"], format = "%d%m%Y")
And I have a series of special dates like below:
special_date = pd.Series(pd.to_datetime(["16.01.2020",
"27.01.2020",
and many more other dates...], dayfirst=True))
And I need to calculate 2 more columns in this DataFrame:
col1 = number of weeks to the next special date
col2 = number of weeks after las special date
So I need results like below:
col1 = 1 because next special date after 25.01 is 27.01 so it is the same week
col2 = 2 because last special date before 25.01 is 16.01 so i is 2 weeks ago
*please be aware that I have many more dates, so code needs to work for more dates than only 2 special dates or only 1 data from df.
You can use broadcasting to create a matrix of time deltas and than calculate the minima for your new columns
import numpy as np, pandas as pd
df = pd.DataFrame({'data': pd.to_datetime(["01.01.2020","25.01.2020","20.02.2020"], dayfirst=True)})
s = pd.Series(pd.to_datetime(["16.01.2020","27.01.2020","08.02.2020","19.02.2020"], dayfirst=True))
delta = (s.to_numpy()[:,None] - df['data'].to_numpy()).astype('timedelta64[D]') / np.timedelta64(1, 'D')
n = np.min( delta, 0, where=delta> 0, initial=np.inf)
p = np.min(-delta, 0, where=delta<=0, initial=np.inf)
df['next'] = np.ceil(n/7) #consider np.floor
df['prev'] = np.ceil(p/7)
Alternatively to using the where argument you could perform the steps by hand:
n = delta.copy(); n[delta<=0] = np.inf; n = np.abs(np.min(n,0))
p = delta.copy(); p[delta> 0] = -np.inf; p = np.abs(np.min(-p,0))
Related
I have a ~8million-ish row data frame consisting of sales for 615 products across 16 stores each day for five years.
I need to make new column/s that consists of the sales shifted back from 1 to 7 days. I've decided to sort the data frame by date, product and location. The I concatenate item and location as its own column.
Using that column I loop through each unique item/location concatenation and make the shifted sales columns. This code is below:
import pandas as pd
#sort values by item, location, date
df = df.sort_values(['date', 'product', 'location'])
df['sort_values'] = df['product']+"_"+df['location']
df1 = pd.DataFrame()
z = 0
for i in list(df['sort_values'].unique()):
df_ = df[df['sort_values']==i]
df_ = df_.sort_values('ORD_DATE')
df_['eaches_1'] = df_['eaches'].shift(-1)
df_['eaches_2'] = df_['eaches'].shift(-2)
df_['eaches_3'] = df_['eaches'].shift(-3)
df_['eaches_4'] = df_['eaches'].shift(-4)
df_['eaches_5'] = df_['eaches'].shift(-5)
df_['eaches_6'] = df_['eaches'].shift(-6)
df_['eaches_7'] = df_['eaches'].shift(-7)
df1 = pd.concat((df1, df_))
z+=1
if z % 100 == 0:
print(z)
The above code gets me exactly what I want, but takes FOREVER to complete. Is there a faster way to accomplish what I want?
I have below dataframe called "df" and calculating the sum by unique id called "Id".
Can anyone help me in optimizing the code i have tried.
import pandas as pd
from datetime import datetime, timedelta
df= {'Date':['2019-01-11 10:23:45','2019-01-09 10:23:45', '2019-01-11 10:27:45',
'2019-01-11 10:25:45', '2019-01-11 10:30:45', '2019-01-11 10:35:45',
'2019-02-09 10:25:45'],
'Id':['100','200','300','100','100', '100','200'],
'Amount':[200,400,330,100,300,200,500],
}
df= pd.DataFrame(df)
df["Date"] = pd.to_datetime(df['Date'])
You can try to use groupby, after this each adjust within sub-groupby not to the whole df
s = {}
for x , y in df.groupby(['Id','NCC']):
for i in y.index:
start_date = y['Date'][i] - timedelta(seconds=300)
end_date = y['Date'][i]
mask = (y['Date'] >= start_date) & (y['Date'] < end_date)
count = y.loc[mask]
count = count.loc[(y['Sys'] == 1)]
if len(count) == 0:
s.update({i : 0})
else:
s.update({i : count['Amount'].sum()})
df['New']=pd.Series(s)
If the original data frame has 2 million rows, it would probably be faster to convert the 'Date' column to an index and sort it. Then you can sub select each 5-minute interval:
df = df.set_index('Date').sort_index()
df['Sum_Amt'] = 0
for end in df.index:
start = end - pd.Timedelta('5min')
current_window = df[start : end] # data frame with 5-minute look-back
sum_amt = <calc logic applied to `current_window` goes here>
df.at[end, 'Sum_Amt'] = sum_amt
print(current_window)
print()
I'm not following the logic for calculating Sum_Amt, so I left that out.
I have a PostgreSQL database that has data similar to:
date, character varying, character varying, integer[]
In the interger array column is stored a list of values: 1,2,3,4,5
I'm using pd.read_sql to read the data into a dataframe.
So I have a dataframe with a date column, several string columns, and then a column with a list of intergers.
The array values are regularly used in numpy arrays to do vector math.
In the past I couldn't find a way to convert the list column to a numpy array column without looping and recreating the dataframe row by row.
As an example:
import pandas as pd
import numpy as np
col1 = ['String data'] * 4
col2 = [[1,2,3,4,5]] * 4
d = {'Description': col1, 'Measures':col2}
df = pd.DataFrame(d)
new_df = pd.DataFrame(columns=df.columns)
for i in range(len(df)):
new_df.loc[i, ['Description','Measures']] = [df.at[i,'Description'], np.array(df.at[i,'Measures'])]
print(new_df)
This looping could be over a few thousand rows.
More recently I figured out that if I could do a single line conversion of Series -> list -> nparray -> list -> Series and achieve the result a lot more efficiently.
import pandas as pd
import numpy as np
col1 = ['String data'] * 4
col2 = [[1,2,3,4,5]] * 4
d = {'Description': col1, 'Measures':col2}
df = pd.DataFrame(d)
df['NParray'] = pd.Series(list(np.array(list(np.array(df['Measures'])))))
df.drop(['Measures'], axis=1, inplace=True)
print(df)
print(type(df['NParray'][0]))
I read about and tried to use Series.array and Series.to_numpy, but they don't really achieve what I'm trying to do.
So, the question is:
Is there a method to convert a pd.Series of lists to a numpy array as I'm trying to do?
Is there any easier way to mass convert these lists to numpy arrays?
I was hoping for something like simple like:
df['NParray'] =np.asarray(df['Measures'])
df['NParray'] =np.array(df['Measures'])
df['NParray'] =df['Measures'].array
df['NParray'] =df['Measures'].to_numpy()
But these have different functions and do not work for my purpose.
------------Edited after testing------------------------------------------------
I setup a small test to see what the difference in timings and efficiency would be:
import pandas as pd
import numpy as np
def get_dataframe():
col1 = ['String data'] * 10000
col2 = [list(range(0,5000))] * 10000
d = {'Description': col1, 'Measures':col2}
df = pd.DataFrame(d)
return(df)
def old_looping(df):
new_df = pd.DataFrame(columns=df.columns)
starttime = pd.datetime.now()
for i in range(len(df)):
new_df.loc[i, ['Description','Measures']] = [df.at[i,'Description'], np.array(df.at[i,'Measures'])]
endtime = pd.datetime.now()
duration = endtime - starttime
print('Looping', duration)
def series_transforms(df):
starttime = pd.datetime.now()
df['NParray'] = pd.Series(list(np.array(list(np.array(df['Measures'])))))
df.drop(['Measures'], axis=1, inplace=True)
endtime = pd.datetime.now()
duration = endtime - starttime
print('Transforms', duration)
def use_apply(df):
starttime = pd.datetime.now()
df['Measures'] = df['Measures'].apply(np.array)
endtime = pd.datetime.now()
duration = endtime - starttime
print('Apply', duration)
def run_test(tests):
for i in range(tests):
construct_df = get_dataframe()
old_looping(construct_df)
for i in range(tests):
construct_df = get_dataframe()
series_transforms(construct_df)
for i in range(tests):
construct_df = get_dataframe()
use_apply(construct_df)
run_test(5)
With 10,000 rows the results were:
Transforms 3.945816
Transforms 3.968821
Transforms 3.891866
Transforms 3.859437
Transforms 3.860590
Apply 4.218867
Apply 4.015742
Apply 4.046986
Apply 3.906360
Apply 3.890740
Looping 27.662418
Looping 27.814523
Looping 27.298895
Looping 27.565626
Looping 27.222970
Transforming through Series-List-NP Array-List-Series is negligibly faster than using Apply. Apply is definitely shorter code and possibly easier to understand.
Increasing the number of rows or array length will increase the times by the same magnitude.
Easiest might be to go with apply to convert to the np.array: df['Measures'].apply(np.array)
Full example:
import pandas as pd
import numpy as np
col1 = ['String data'] * 4
col2 = [[1,2,3,4,5]] * 4
d = {'Description': col1, 'Measures':col2}
df = pd.DataFrame(d)
display(df.Measures)
df['NParray'] = df['Measures'].apply(np.array)
df.drop(['Measures'], axis=1, inplace=True)
print(df)
print(type(df['NParray'][0]))
I have data like this:
import datetime as dt
import pandas as pd
df = pd.DataFrame({'date':[dt.datetime(2018,8,25), dt.datetime(2018,7,21)],
'n':[10,7]})
I would like to create a third column which contains a date range created by pd.date_range, using 'date' as the start date and 'n' as the number of periods.
So the first entry should be:
pd.date_range(dt.datetime(2018,8,25), periods=10, freq='d')
(I have a list of "target" dates, and my goal is to check whether the date_range contains any of those target dates).
I tried this:
df['date_range'] = df.apply(lambda x: pd.date_range(x['date'],
x['n'],
freq='d'))
But this gives a KeyError: ('date', 'occurred at index date')
Any idea on how to do this without using a for loop, or is there a better solution altogether?
You can solve your problem without creating date range or day columns. To check if a target date in tgt belongs to a date range specified by rows of df, you can calculate the end of date range, and then check if each date in tgt falls in between the start and end of a time interval. The code below implements this, and produces "target_date" column identical to the one in your own answer:
df = pd.DataFrame({'date':[dt.datetime(2018,8,25), dt.datetime(2018,7,21)],
'n':[10,7]})
df["daterange_end"] = df.apply(lambda x: x["date"] + pd.Timedelta(days=x["n"]), axis=1)
tgt = [dt.datetime(2018,8,26)]
df['target_date'] = 0
df.loc[(tgt[0] > df.date) &(tgt[0] < df.daterange_end),"target_date"] = 1
print(df)
# date n daterange_end target_date
# 0 2018-08-25 10 2018-09-04 1
# 1 2018-07-21 7 2018-07-28 0
You should add axis=1 in apply
df['date_range'] = df.apply(lambda x: pd.date_range(x['date'], x['n'], freq='d'), axis=1)
I came up with a solution that works (but I'm sure there's a nicer way...)
# define target
tgt = [dt.datetime(2018,8,26)]
# find max n
max_n = max(df['n'])
# create that many columns and increment the day
for i in range(max_n):
df['date_{}'.format(i)] = df['date'] + dt.timedelta(days=i)
new_cols = ['date_{}'.format(n) for n in range(max_n)]
# check each one and replace with a 1 if it matches the "tgt"
df['target_date'] = 0
for col in new_cols:
df['target_date'] = np.where(df[col].isin(tgt),
1,
df['target_date'])
# drop intermediate cols
df = df[[i for i in df.columns if not i in new_cols]]
I have a question about eliminating outliers from two-time series. One time series includes spot market prices and the other includes power outputs. The two series are from 2012 to 2016 and are both CSV files with the with a timestamp and then a value. As example for the power output: 2012-01-01 00:00:00,2335.2152646951617 and for the price: 2012-01-01 00:00:00,17.2
Because the spot market prices are very volatile and have a lot of outliers, I have filtered them. For the second time series, I have to delete the values with the same timestamp, which were eliminated in the time series of the prices. I thought about generating a list with the deleted values and writing a loop to delete the values with the same timestamp in the second time series. But so far that has not worked and I'm not really on. Does anyone have an idea?
My python code looks as follow:
import pandas as pd
import matplotlib.pyplot as plt
power_output = pd.read_csv("./data/external/power_output.csv", delimiter=",", parse_dates=[0], index_col=[0])
print(power_output.head())
plt.plot(power_output)
spotmarket = pd.read_csv("./data/external/spotmarket_dhp.csv", delimiter=",", parse_dates=[0], index_col=[0])
print(spotmarket.head())
r = spotmarket['price'].pct_change().dropna() * 100
print(r)
plt.plot(r)
Q1 = r.quantile(.25)
Q3 = r.quantile(.75)
q1 = Q1-2*(Q3-Q1)
q3 = Q3+2*(Q3-Q1)
a = r[r.between(q1, q3)]
print(a)
plt.plot(a)
Can somebody help me?
If your question is about how to compare two timestamps you can have a look at this.
Basically you could do:
out = r[~r.between(q1, q3)] # negation of your between to get the outliers
df=pd.merge(spotmarker,out,on=['date'],how="outer",indicator=True)
df=df[df['_merge']=='left_only']
Which is a merge operation that conserves only those rows that are only present in the left dataframe
The following suggestion is based on an answer of mine from a previous post.
You can solve your problem by merging both of your series and storing them in pandas dataframe. Then you can use any desired technique to identify and remove outliers. Take a look at the post mentioned above.
Here is my take on your particular problem using a snippet that can handle more than one series:
Since I don't have access to your data, the following snippet will produce two series where one of them has a distinctive outlier:
def sample(colname):
base = 100
nsample = 20
sigma = 10
# Basic df with trend and sinus seasonality
trend1 = np.linspace(0,1, nsample)
y1 = np.sin(trend1)
dates = pd.date_range(pd.datetime(2016, 1, 1).strftime('%Y-%m-%d'), periods=nsample).tolist()
df = pd.DataFrame({'dates':dates, 'trend1':trend1, 'y1':y1})
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
# Gaussian Noise with amplitude sigma
df['y2'] = sigma * np.random.normal(size=nsample)
df['y3'] = df['y2'] + base + (np.sin(trend1))
df['trend2'] = 1/(np.cos(trend1)/1.05)
df['y4'] = df['y3'] * df['trend2']
df=df['y4'].to_frame()
df.columns = [colname]
return(df)
df_sample1 = sample(colname = 'series1')
df_sample2 = sample(colname = 'series2')
df_sample2['series2'].iloc[10] = 800
df_sample1.plot()
df_sample2.plot()
Series 1 - No outliers
Series 2 - A distinctive outlier
Now you can merge those series like this:
# Merge dataframes
df_merged = pd.merge(df_sample1, df_sample2, how='outer', left_index=True, right_index=True)
df_merged.plot()
What is considered an outlier will depend full on the nature of your dataset. In this case, you can set the level for identifying outliers using sscipy.zscore(). In the following case, every observation with a difference that exceeds 3 is considered an outlier.
# A function for removing outliers
def noSpikes(df, level, keepFirst):
# 1. Get some info about the original data:
##%%
#df = df_merged
#level = 3
#keepFirst = True
##%%
firstVal = df[:1]
colNames = df.columns
colNumber = len(df.columns)
#cleanBy = 'Series1'
# 2. Take the first difference and
df_diff = df.diff()
# 3. Remove missing values
df_clean = df_diff.dropna()
# 4. Select a level for a Z-score to identify and remove outliers
df_Z = df_clean[(np.abs(stats.zscore(df_clean)) < level).all(axis=1)]
ix_keep = df_Z.index
# 5. Subset the raw dataframe with the indexes you'd like to keep
df_keep = df.loc[ix_keep]
# 6.
# df_keep will be missing some indexes.
# Do the following if you'd like to keep those indexes
# and, for example, fill missing values with the previous values
df_out = pd.merge(df_keep, df, how='outer', left_index=True, right_index=True)
# 7. Keep only the original columns (drop the diffs)
df_out = df_out.ix[:,:colNumber]
# 8. Fill missing values
df_complete = df_out.fillna(axis=0, method='ffill')
# 9. Reset column names
df_complete.columns = colNames
# Keep the first value
if keepFirst:
df_complete.iloc[0] = firstVal.iloc[0]
return(df_complete)
df_clean = noSpikes(df = df_merged, level = 3, keepFirst = True)
df_clean.plot()
Let me know how this works out for you.
Here's the whole thing for an easy copy-paste:
# Imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
np.random.seed(22)
# A function for noisy data with a trend element
def sample(colname):
base = 100
nsample = 20
sigma = 10
# Basic df with trend and sinus seasonality
trend1 = np.linspace(0,1, nsample)
y1 = np.sin(trend1)
dates = pd.date_range(pd.datetime(2016, 1, 1).strftime('%Y-%m-%d'), periods=nsample).tolist()
df = pd.DataFrame({'dates':dates, 'trend1':trend1, 'y1':y1})
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
# Gaussian Noise with amplitude sigma
df['y2'] = sigma * np.random.normal(size=nsample)
df['y3'] = df['y2'] + base + (np.sin(trend1))
df['trend2'] = 1/(np.cos(trend1)/1.05)
df['y4'] = df['y3'] * df['trend2']
df=df['y4'].to_frame()
df.columns = [colname]
return(df)
df_sample1 = sample(colname = 'series1')
df_sample2 = sample(colname = 'series2')
df_sample2['series2'].iloc[10] = 800
df_sample1.plot()
df_sample2.plot()
# Merge dataframes
df_merged = pd.merge(df_sample1, df_sample2, how='outer', left_index=True, right_index=True)
df_merged.plot()
# A function for removing outliers
def noSpikes(df, level, keepFirst):
# 1. Get some info about the original data:
firstVal = df[:1]
colNames = df.columns
colNumber = len(df.columns)
#cleanBy = 'Series1'
# 2. Take the first difference and
df_diff = df.diff()
# 3. Remove missing values
df_clean = df_diff.dropna()
# 4. Select a level for a Z-score to identify and remove outliers
df_Z = df_clean[(np.abs(stats.zscore(df_clean)) < level).all(axis=1)]
ix_keep = df_Z.index
# 5. Subset the raw dataframe with the indexes you'd like to keep
df_keep = df.loc[ix_keep]
# 6.
# df_keep will be missing some indexes.
# Do the following if you'd like to keep those indexes
# and, for example, fill missing values with the previous values
df_out = pd.merge(df_keep, df, how='outer', left_index=True, right_index=True)
# 7. Keep only the original columns (drop the diffs)
df_out = df_out.ix[:,:colNumber]
# 8. Fill missing values
df_complete = df_out.fillna(axis=0, method='ffill')
# 9. Reset column names
df_complete.columns = colNames
# Keep the first value
if keepFirst:
df_complete.iloc[0] = firstVal.iloc[0]
return(df_complete)
df_clean = noSpikes(df = df_merged, level = 3, keepFirst = True)
df_clean.plot()