Averaging every two consecutive index values(every 2min) in pandas dataframe - python

I know there are similar questions that have already been answered. However, I can't seem to troubleshoot why none of the solutions are working for me.
My sample dataset:
TimeStamp 340 341 342
10:27:00 1.953036 2.110234 1.981548
10:28:00 1.973408 2.046361 1.806923
10:29:00 0.000000 0.000000 0.014881
10:30:00 2.567976 3.169928 3.479591
I want to find the mean of the data every two minutes for each column. While df.groupby promises a neat solution, it makes my TimeStamp column disappear for some reason. the help is greatly appreciated.
Expected output:
TimeStamp 340 341 342
10:27:30 1.963222 2.078298 1.894235
10:29:30 1.283988 1.584964 1.747236
Attempted code:
import pandas as pd
import numpy as np
path = '/Users/username/Desktop/Model/'
file1 = 'filename.csv'
df = pd.read_csv(path + file1, skipinitialspace = True)
df['TimeStamp'] = pd.to_timedelta(df['TimeStamp'])
df['TimeStamp'] = df['TimeStamp'].dt.floor('min')
df.set_index('TimeStamp')
rowF = len(df['TimeStamp'])
# Average every two min
newdf = df.groupby(np.arange(len(df.index))//2).mean()
print(newdf)

Set the time as index:
df.set_index(pd.to_timedelta(df.TimeStamp), inplace=True)
And then use resample and aggregate every two minutes:
df.resample("2min").mean().reset_index()
# TimeStamp 340 341 342
#0 10:27:00 1.963222 2.078298 1.894235
#1 10:29:00 1.283988 1.584964 1.747236
#2 10:31:00 NaN NaN NaN
Drop the last observation with iloc:
df.resample("2min").mean().reset_index().iloc[:-1]
# TimeStamp 340 341 342
#0 10:27:00 1.963222 2.078298 1.894235
#1 10:29:00 1.283988 1.584964 1.747236
If you prefer to shift the TimeStamp by 30 seconds:
(df.resample("2min").mean().reset_index()
.assign(TimeStamp = lambda x: x.TimeStamp + pd.Timedelta('30 seconds'))
.iloc[:-1])
# TimeStamp 340 341 342
#0 10:27:30 1.963222 2.078298 1.894235
#1 10:29:30 1.283988 1.584964 1.747236

Related

How to generate a rolling mean for a specific date range and location with pandas

I have a large data set with names of stores, dates and profits.
My data set is not the most organized but I now have it in this df.
df
Store Date Profit
ABC May 1 2018 234
XYZ May 1 2018 410
AZY May 1 2018 145
ABC May 2 2018 234
XYZ May 2 2018 410
AZY May 2 2018 145
I proudly created a function to get each day into one df by itself until I realized it would be very time consuming to do one for each day.
def avg(n):
return df.loc[df['Date'] == "May" + " " + str(n) + " " +str(2018)]
where n would be the date I want to get. So that function get me just the dates I want.
What I really need is to have a way to get all dates I want in a list and to append them to a pd for each day. I tried doing this but did not work out.
def avg(n):
dlist= []
for i in n:
dlist= df.loc[df['Date'] == "May" + " " + str(i) + " " +str(2018)]
dlist=pd.DataFrame(dlist)
dlist.append(i)
return dlist
df2=avg([21,23,24,25])
My goal there was to have all the dates of (21,23,24,25) for the May
into its own series of df.
But it was a total fail got this error
cannot concatenate object of type ""; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid
I am not sure if it's also possible to add a rolling average or mean, to columns for each day of (21,23,24,25), but that's where analysis will conclude.
output desired
Store Date Profit Rolling Mean
ABC May 1 2018 234 250
XYZ May 1 2018 410 401
AZY May 1 2018 145 415
where the rolling mean is for the past 30 days. Above all, I would like to have each day into its own df where I can save it to csv file the end.
Rolling Mean:
The example data given in the question, has data in the format of May 1 2018, which can't be used for rolling. Rolling requires a datetime index.
Instead of string splitting the original Date column, it should be converted to datetime, using df.Date = pd.to_datetime(df.Date), which will give dates in the format 2018-05-01
With a properly formatted datetime column, use df['Day'] = df.Date.dt.day and df['Month'] = df.Date.dt.month_name() to get a Day and Month column, if desired.
Given the original data:
Original Data:
Store Date Profit
ABC May 1 2018 234
XYZ May 1 2018 410
AZY May 1 2018 145
ABC May 2 2018 234
XYZ May 2 2018 410
AZY May 2 2018 145
Transformed Original Data:
df.Date = pd.to_datetime(df.Date)
df['Day'] = df.Date.dt.day
df['Month'] = df.Date.dt.month_name()
Store Date Profit Day Month
ABC 2018-05-01 234 1 May
XYZ 2018-05-01 410 1 May
AZY 2018-05-01 145 1 May
ABC 2018-05-02 234 2 May
XYZ 2018-05-02 410 2 May
AZY 2018-05-02 145 2 May
Rolling Example:
The example dataset is insufficient to produce a 30-day rolling average
In order to have a 30-day rolling mean, there needs to be more than 30 days of data for each store (i.e. on the 31st day, you get the 1st mean, for the previous 30 days)
The following example will setup a dataframe consisting of every day in 2018, a random profit between 100 and 1001, and a random store, chosen from ['ABC', 'XYZ', 'AZY'].
Extended Sample:
import pandas as pd
import random
import numpy as np
from datetime import datetime, timedelta
list_of_dates = [date for date in np.arange(datetime(2018, 1, 1), datetime(2019, 1, 1), timedelta(days=1)).astype(datetime)]
df = pd.DataFrame({'Store': [random.choice(['ABC', 'XYZ', 'AZY']) for _ in range(365)],
'Date': list_of_dates,
'Profit': [np.random.randint(100, 1001) for _ in range(365)]})
Store Date Profit
ABC 2018-01-01 901
AZY 2018-01-02 540
AZY 2018-01-03 417
XYZ 2018-01-04 280
XYZ 2018-01-05 384
XYZ 2018-01-06 104
XYZ 2018-01-07 691
ABC 2018-01-08 376
XYZ 2018-01-09 942
XYZ 2018-01-10 297
df.set_index('Date', inplace=True)
df_rolling = df.groupby(['Store']).rolling(30).mean()
df_rolling.rename(columns={'Profit': '30-Day Rolling Mean'}, inplace=True)
df_rolling.reset_index(inplace=True)
df_rolling.head():
Note the first 30-days for each store, will be NaN
Store Date 30-Day Rolling Mean
ABC 2018-01-01 NaN
ABC 2018-01-03 NaN
ABC 2018-01-07 NaN
ABC 2018-01-11 NaN
ABC 2018-01-13 NaN
df_rolling.tail():
Store Date 30-Day Rolling Mean
XYZ 2018-12-17 556.966667
XYZ 2018-12-18 535.633333
XYZ 2018-12-19 534.733333
XYZ 2018-12-24 551.066667
XYZ 2018-12-27 572.033333
Plot:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6))
g = sns.lineplot(x='Date', y='30-Day Rolling Mean', data=df_rolling, hue='Store')
for item in g.get_xticklabels():
item.set_rotation(60)
plt.show()
Alternatively: A dataframe for each store:
It's also possible to create a separate dataframe for each store and put it inside a dict
This alternative makes is easier to plot a more detailed graph with less code
import pandas as pd
import random
import numpy as np
from datetime import datetime, timedelta
list_of_dates = [date for date in np.arange(datetime(2018, 1, 1), datetime(2019, 1, 1), timedelta(days=1)).astype(datetime)]
df = pd.DataFrame({'Store': [random.choice(['ABC', 'XYZ', 'AZY']) for _ in range(365)],
'Date': list_of_dates,
'Profit': [np.random.randint(100, 1001) for _ in range(365)]})
df_dict = dict()
for store in df.Store.unique():
df_dict[store] = df[['Date', 'Profit']][df.Store == store]
df_dict[store].set_index('Date', inplace=True)
df_dict[store]['Profit: 30-Day Rolling Mean'] = df_dict[store].rolling(30).mean()
print(df_dict.keys())
>>> dict_keys(['ABC', 'XYZ', 'AZY'])
print(df_dict['ABC'].head())
Plot:
import matplotlib.pyplot as plt
_, axes = plt.subplots(1, 1, figsize=(13, 8), sharex=True)
for k, v in df_dict.items():
axes.plot(v['Profit'], marker='.', linestyle='-', linewidth=0.5, label=k)
axes.plot(v['Profit: 30-Day Rolling Mean'], marker='o', markersize=4, linestyle='-', linewidth=0.5, label=f'{k} Rolling')
axes.legend()
axes.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.ylabel('Profit ($)')
plt.xlabel('Date')
plt.title('Recorded Profit vs. 30-Day Rolling Mean of Profit')
plt.show()
Get a dataframe for a specific month:
Recall, this is randomly generated data, so the stores don't have data for every day of the month.
may_df = dict()
for k, v in df_dict.items():
v.reset_index(inplace=True)
may_df[k] = v[v.Date.dt.month_name() == 'May']
may_df[k].set_index('Date', inplace=True)
print(may_df['XYZ'])
Plot: May data only:
Save dataframes:
pandas.DataFrame.to_csv()
may_df.reset_index(inplace=True)
may_df.to_csv('may.csv', index=False)
A simple solution may be groupby()
Check out this example :
import pandas as pd
listt = [['a',2,3],
['b',5,7],
['a',3,9],
['a',1,3],
['b',9,4],
['a',4,7],
['c',7,2],
['a',2,5],
['c',4,7],
['b',5,5]]
my_df = pd.DataFrame(listt)
my_df.columns=['Class','Day_1','Day_2']
my_df.groupby('Class')['Day_1'].mean()
OutPut :
Class
a 2.400000
b 6.333333
c 5.500000
Name: Day_1, dtype: float64
Note : Similarly You can Group your data by Date and get Average of your Profit.

Create new velocity column from distance and datetime data

I am attempting to create a downward velocity model for offshore drilling which uses the variables Depth (which increases every 1 foot) and DateTime data which is more intermittent and is only updated every foot of depth:
Dept DateTime
1141 5/24/2017 04:31
1142 5/24/2017 04:32
1143 5/24/2017 04:40
1144 5/24/2017 04:42
1145 5/25/2017 04:58
I am trying to get something like this:
Where Velocity iterated down dept/(DateTime gap)
If you are happy to use a 3rd party library, this is straightforward with Pandas:
import pandas as pd
# read file into dataframe
df = pd.read_csv('file.csv')
# convert series to datetime
df['DateTime'] = pd.to_datetime(df['DateTime'])
# perform calculation
df['Velocity'] = df['Dept'].diff() / (df['DateTime'].diff().dt.total_seconds() / 60)
# expert to csv
df.to_csv('file_out.csv', index=False)
print(df)
# Dept DateTime Velocity
# 0 1141 2017-05-24 04:31:00 NaN
# 1 1142 2017-05-24 04:32:00 1.000000
# 2 1143 2017-05-24 04:40:00 0.125000
# 3 1144 2017-05-24 04:42:00 0.500000
# 4 1145 2017-05-25 04:58:00 0.000687

Trouble in plotting dates in PyPlot

I am trying to plot a simple time-series. Here's my code:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
%matplotlib inline
df = pd.read_csv("sample.csv", parse_dates=['t'])
df[['sq', 'iq', 'rq']] = df[['sq', 'iq', 'rq']].apply(pd.to_numeric, errors='coerce')
df = df.fillna(0)
df.set_index('t')
This is part of the output:
df[['t','sq']].plot()
plt.show()
As you can see, the x-axis in the plot above is not the dates I intended it to show. When I change the plotting call as below, I get the following gibberish plot, although the x-axis is now correct.
df[['t','sq']].plot(x = 't')
plt.show()
Any tips on what I am doing wrong? Please comment and let me know if you need more information about the problem. Thanks in advance.
I think your problem is that although you have parsed the t column it is not of type date-time. Try the following:
# Set t to date-time and then to index
df['t'] = pd.to_datetime(df['t'])
df.set_index('t', inplace=True)
Reading you comment and the answer you have added someone may conclude that this kind of problem can only be solved by specifying a parser in pd.read_csv(). So here is proof that my solution works in principle. Looking at what you have posted as a question, the other problem with you code is the way you have specified the plot command. Once t has become an index, you only need to select columns other than t for the plot command.
import pandas as pd
import matplotlib.pyplot as plt
# Read data from file
df = pd.read_csv('C:\\datetime.csv', parse_dates=['Date'])
# Convert Date to date-time and set as index
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
df.plot(marker='D')
plt.xlabel('Date')
plt.ylabel('Number of Visitors')
plt.show()
df
Out[37]:
Date Adults Children Seniors
0 2018-01-05 309 240 296
1 2018-01-06 261 296 308
2 2018-01-07 273 249 338
3 2018-01-08 311 250 244
4 2018-01-08 272 234 307
df
Out[39]:
Adults Children Seniors
Date
2018-01-05 309 240 296
2018-01-06 261 296 308
2018-01-07 273 249 338
2018-01-08 311 250 244
2018-01-08 272 234 307
The issue turned out to be incorrect parsing of dates, as pointed out in an answer above. However, the solution for it was to pass a date_parser to the read_csv method call:
from datetime import datetime as dt
dtm = lambda x: dt.strptime(str(x), "%Y-%m-%d")
df = pd.read_csv("sample.csv", parse_dates=['t'], infer_datetime_format = True, date_parser= dtm)

Pandas Frequency Conversion

I'm trying to find if is possible to use data.asfreq(MonthEnd()) with no date_range created data.
What I'm trying to achive. I run csv query with the following code:
import numpy as np
import pandas as pd
data = pd.read_csv("https://www.quandl.com/api/v3/datasets/FRED/GDPC1.csv?api_key=", parse_dates=True)
data.columns = ["period", "integ"]
data['period'] = pd.to_datetime(data['period'], infer_datetime_format=True)
Then I want to assign frequency to my 'period' column by doing this:
tdelta = data.period[1] - data.period[0]
data.period.freq = tdelta
And some print comands:
print(data)
print(data.period.freq)
print(data.dtypes)
Returns:
..........
270 1948-07-01 2033.2
271 1948-04-01 2021.9
272 1948-01-01 1989.5
273 1947-10-01 1960.7
274 1947-07-01 1930.3
275 1947-04-01 1932.3
276 1947-01-01 1934.5
[277 rows x 2 columns]
-92 days +00:00:00
period datetime64[ns]
integ float64
dtype: object
I can also parse the original 'DATE' column by making it 'index':
data = pd.read_csv("https://www.quandl.com/api/v3/datasets/FRED/GDPC1.csv?api_key=", parse_dates=True, index_col='DATE')
What I want to do is just to covert the quarterly data in to monthly rows. For example:
270 1948-07-01 2033.2
271 1948-06-01 NaN
272 1948-05-01 NaN
273 1948-04-01 2021.9
274 1948-03-01 NaN
275 1948-02-01 NaN
276 1948-01-01 1989.5
......and so on.......
I'm eventually trying to do this by using ts.asfreq(MonthBegin()) and , ts.asfreq(MonthBegin(), method='pad'). So far unsuccessfully. I have the following error:
NameError: name 'MonthBegin' is not defined
My question is can I use asfreq if I don't use date_range to create the frame? Somehow to 'pass' my date column to the function. If this is not the solution is it there any other easy way to convert quarterly to monthly frequency?
Use a TimeGrouper:
import pandas as pd
periods = ['1948-07-01', '1948-04-01', '1948-01-01', '1947-10-01',
'1947-07-01', '1947-04-01', '1947-01-01']
integs = [2033.2, 2021.9, 1989.5, 1960.7, 1930.3, 1932.3, 1934.5]
df = pd.DataFrame({'period': pd.to_datetime(periods), 'integ': integs})
df = df.set_index('period')
df = df.groupby(pd.TimeGrouper('MS')).sum().sort_index(ascending=False)
EDIT: You can also use resample instead of a TimeGrouper:
df.resample('MS').sum().sort_index(ascending=False)

Column Manipulations with date-Time Pandas

I am trying to do some column manipulations with row and column at same time including date and time series in Pandas. Traditionally with no series python dictionaries are great. But with Pandas it a new thing for me.
Input file : N number of them.
File1.csv, File2.csv, File3.csv, ........... Filen.csv
Ids,Date-time-1 Ids,Date-time-2 Ids,Date-time-1
56,4568 645,5545 25,54165
45,464 458,546
I am trying to merge the Date-time column of all the files into a big data file with respect to Ids
Ids,Date-time-ref,Date-time-1,date-time-2
56,100,4468,NAN
45,150,314,NAN
645,50,NAN,5495
458,200,NAN,346
25,250,53915,NAN
Check for date-time column - If not matched create one and then fill the values with respect to Ids by Subtracting the current date-time value with the value of date-time-ref of that respective Ids.
Fill in empty place with NAN and if next file has that value then replace the new value with NAN
If it were straight column subtract it was pretty much easy but in sync with date-time series and with respect to Ids seems a bit confusing.
Appreciate some suggestions to begin with. Thanks in advance.
Here is one way to do it.
import pandas as pd
import numpy as np
from StringIO import StringIO
# your csv file contents
csv_file1 = 'Ids,Date-time-1\n56,4568\n45,464\n'
csv_file2 = 'Ids,Date-time-2\n645,5545\n458,546\n'
# add a duplicated Ids record for testing purpose
csv_file3 = 'Ids,Date-time-1\n25,54165\n645, 4354\n'
csv_file_all = [csv_file1, csv_file2, csv_file3]
# read csv into df using list comprehension
# I use buffer here, replace stringIO with your file path
df_all = [pd.read_csv(StringIO(csv_file)) for csv_file in csv_file_all]
# processing
# =====================================================
# concat along axis=0, outer join on axis=1
merged = pd.concat(df_all, axis=0, ignore_index=True, join='outer').set_index('Ids')
Out[206]:
Date-time-1 Date-time-2
Ids
56 4568 NaN
45 464 NaN
645 NaN 5545
458 NaN 546
25 54165 NaN
645 4354 NaN
# custom function to handle/merge duplicates on Ids (axis=0)
def apply_func(group):
return group.fillna(method='ffill').iloc[-1]
# remove Ids duplicates
merged_unique = merged.groupby(level='Ids').apply(apply_func)
Out[207]:
Date-time-1 Date-time-2
Ids
25 54165 NaN
45 464 NaN
56 4568 NaN
458 NaN 546
645 4354 5545
# do the subtraction
master_csv_file = 'Ids,Date-time-ref\n56,100\n45,150\n645,50\n458,200\n25,250\n'
df_master = pd.read_csv(io.StringIO(master_csv_file), index_col=['Ids']).sort_index()
# select matching records and horizontal concat
df_matched = pd.concat([df_master,merged_unique.reindex(df_master.index)], axis=1)
# use broadcasting
df_matched.iloc[:, 1:] = df_matched.iloc[:, 1:].sub(df_matched.iloc[:, 0], axis=0)
Out[208]:
Date-time-ref Date-time-1 Date-time-2
Ids
25 250 53915 NaN
45 150 314 NaN
56 100 4468 NaN
458 200 NaN 346
645 50 4304 5495

Categories

Resources