Python Dataframe extract quarterly data and export to a quarterly folder

Python Dataframe extract quarterly data and export to a quarterly folder - python

I have a summary dataframe. I want to extract quarterly data and export it to the quarterly folders created already.
My code:
ad = pd.DataFrame({"sensor_value":[10,20]},index=['2019-01-01 05:00:00','2019-06-01 05:00:00'])
ad =
sensor_value
2019-01-01 05:00:00 10
2019-06-01 05:00:00 20
ad.index = pd.to_datetime(ad.index,format = '%Y-%m-%d %H:%M:%S')
# create quarter column
ad['quarter'] = ad.index.to_period('Q')
ad =
sensor_value quarter
2019-01-01 05:00:00 10 2019Q1
2019-06-01 05:00:00 20 2019Q2
# quarters list
qt_list = ad['quarter'].unique()
# extract data for quarter and store it in the corresponding folder that already exist
fold_location = 'C\\Data\\'
for i in qt_list:
auxdf = ad[ad['quarter']=='%s'%(i)]
save_loc = fold_location+'\\'+str(i)
auxdf.to_csv(save_loc+'\\'+'Sensor_1minData_%s.csv'%(i))
Is there a better way of doing it?
Thanks

You can use groupby with something like:
for quarter, df in ad.groupby('quarter'):
df.to_csv(f"C\\Data\\{quarter}\\Sensor_1minData_{quarter}.csv")

Related

Adding datetime by 1 hour implementation to existing dataset with (dumb question)

I have 3 columns in the dataset to which I wanna add dates
Date
temperature
humidity
2015-01-01 00:00:00
5.9
NA
2015-01-01 01:00:00
5.5
NA
⋮
⋮
⋮
2015-01-01 23:00:00
7
NA
I wanna add 2 months like from 1st may to 31 july to Date column
with hour implementation it will be smth like this
Date
temperature
humidity
⋮
⋮
⋮
2015-01-01 23:00:00
7
NA
2015-05-01 00:00:00
..
NA
2015-05-01 01:00:00
..
NA
⋮
⋮
⋮
until i get to
Date
temperature
humidity
⋮
⋮
⋮
2015-07-31 23:00:00
..
NA
I've tried
date = datetime.datetime(2015,3,31,23,0,0)
for i in range(32):
date += datetime.timedelta(hours=1)
print(date)
is there an easier way to do it?

Well, you have a start with the iteration from datetime.datetime and datetime.timedelta
You need to accumulate those in a list, rather than printing them
listofdates=[]
date = datetime.datetime(2015,3,31,23,0,0)
for i in range(32):
date += datetime.timedelta(hours=1)
listofdates.append(date)
And then, create a new dataframe from the existing one (let's call it df) and this list of dates. Do do so, you can use pd.concat that creates a dataframe from two dataframes.
So, you need a dataframe from your new list of dates. With the same column name
newlines = pd.DataFrame({'Date':listofdates})
Which gives
Date
2015-04-01 00:00:00
⋮
2015-04-02 07:00:00
(Note that it starts at 4/1 00:00, not 3/31 23:00, because you add the timedelta beform appending)
We can concatenate your dataframe with this one (missing columns will be filled with NA) like this
newdf = pd.concat([df, newlines])
Last remarks that I kept for the end to avoid confusion:
I would have stored the timedelta once for all, rather than creating one each time (it is not that expansive, but still)
So, altogether
date=datetime.datetime(2015,3,31,23,0,0)
dt=datetime.timedelta(hours=1)
listofdates=[]
for i in range(32):
date += dt
listofdates.append(date)
newlines = pd.DataFrame({'Date':listofdates})
newdf = pd.concat([df, newlines])
For this kind of usage, you can build the list directly using compound lists
listofdates=[date+k*dt for k in range(1,33)]
Or using numpy
listofdates=date+np.arange(1,33)*dt
Which allows for one-liner
newdf = pd.concat([df, pd.DataFrame({'Date':date+np.arange(1,33)*dt})])
But don't try to understand this one before you understood the longer version previously described

How to tackle a dataset that has multiple same date values

I have a large data set that I'm trying to produce a time series using ARIMA. However
some of the data in the date column has multiple rows with the same date.
The data for the dates was entered this way in the data set as it was not known the exact date of the event, hence unknown dates where entered for the first of that month(biased). Known dates have been entered correctly in the data set.
2016-01-01 10035
2015-01-01 5397
2013-01-01 4567
2014-01-01 4343
2017-01-01 3981
2011-01-01 2049
Ideally I want to randomise the dates within the month so they are not the same. I have the code to randomise the date but I cannot find a way to replace the data with the date ranges.
import random
import time
def str_time_prop(start, end, time_format, prop):
stime = time.mktime(time.strptime(start, time_format))
etime = time.mktime(time.strptime(end, time_format))
ptime = stime + prop * (etime - stime)
return time.strftime(time_format, time.localtime(ptime))
def random_date(start, end, prop):
return str_time_prop(start, end, '%Y-%m-%d', prop)
# check if the random function works
print(random_date("2021-01-02", "2021-01-11", random.random()))
The code above I use to generate a random date within a date range but I'm stuggling to find a way to replace the dates.
Any help/guidance would be great.
Thanks

With the following toy dataframe:
import random
import time
import pandas as pd
df = pd.DataFrame(
{
"date": [
"2016-01-01",
"2015-01-01",
"2013-01-01",
"2014-01-01",
"2017-01-01",
"2011-01-01",
],
"value": [10035, 5397, 4567, 4343, 3981, 2049],
}
)
print(df)
# Output
date value
0 2016-01-01 10035
1 2015-01-01 5397
2 2013-01-01 4567
3 2014-01-01 4343
4 2017-01-01 3981
5 2011-01-01 2049
Here is one way to do it:
df["date"] = [
random_date("2011-01-01", "2022-04-17", random.random()) for _ in range(df.shape[0])
]
print(df)
# Ouput
date value
0 2013-12-30 10035
1 2016-06-17 5397
2 2018-01-26 4567
3 2012-02-14 4343
4 2014-06-26 3981
5 2019-07-03 2049

Since the data in the date column has multiple rows with the same date, and you want to randomize the dates within the month, you could group by the year and month and select only those who have the day equal 1. Then, use calendar.monthrange to find the last day of the month for that particular year, and use that information when replacing the timestamp's day. Change the FIRST_DAY and last_day values to match your desired range.
import pandas as pd
import calendar
import numpy as np
np.random.seed(42)
df = pd.read_csv('sample.csv')
df['date'] = pd.to_datetime(df['date'])
# group multiple rows with the same year, month and day equal 1
grouped = df.groupby([df['date'].dt.year, df['date'].dt.month, df['date'].dt.day==1])
FIRST_DAY = 2 # set for the desired range
df_list = []
for n,g in grouped:
last_day = calendar.monthrange(n[0], n[1])[1] # get last day for this month and year
g['New_Date'] = g['date'].apply(lambda d:
d.replace(day=np.random.randint(FIRST_DAY,last_day+1))
)
df_list.append(g)
new_df = pd.concat(df_list)
print(new_df)
Output from new_df
date num New_Date
2 2013-01-01 4567 2013-01-08
3 2014-01-01 4343 2014-01-21
1 2015-01-01 5397 2015-01-30
0 2016-01-01 10035 2016-01-16
4 2017-01-01 3981 2017-01-12

Extracting date time from a mixed letter and numeric column pandas

I have a column in pandas dataframe that contains two types of information = 1. date and time, 2=company name. I have to split the column into two (date_time, full_company_name). Firstly I tried to split the columns based on character count (first 19 one column, the rest to other column) but then I realized that sometimes the date is missing so the split might not work. Then I tried using regex but I cant seem to extract it correctly.
The column:
the desired output:

If the dates are all properly formatted, maybe you don't have to use regex
df = pd.DataFrame({"A": ["2021-01-01 05:00:00Acme Industries",
"2021-01-01 06:00:00Acme LLC"]})
df["date"] = pd.to_datetime(df.A.str[:19])
df["company"] = df.A.str[19:]
df
# A date company
# 0 2021-01-01 05:00:00Acme Industries 2021-01-01 05:00:00 Acme Industries
# 1 2021-01-01 06:00:00Acme LLC 2021-01-01 06:00:00 Acme LLC
OR
df.A.str.extract("(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})(.*)")

Note:
If you have an option to avoid concatenating those strings to begin with, please do so. This is not a healthy habit.
Solution (not that pretty gets the job done):
import pandas as pd
from datetime import datetime
import re
df = pd.DataFrame()
# creating a list of companies
companies = ['Google', 'Apple', 'Microsoft', 'Facebook', 'Amazon', 'IBM',
'Oracle', 'Intel', 'Yahoo', 'Alphabet']
# creating a list of random datetime objects
dates = [datetime(year=2000 + i, month=1, day=1) for i in range(10)]
# creating the column named 'date_time/full_company_name'
df['date_time/full_company_name'] = [f'{str(dates[i])}{companies[i]}' for i in range(len(companies))]
# Before:
# date_time/full_company_name
# 2000-01-01 00:00:00Google
# 2001-01-01 00:00:00Apple
# 2002-01-01 00:00:00Microsoft
# 2003-01-01 00:00:00Facebook
# 2004-01-01 00:00:00Amazon
# 2005-01-01 00:00:00IBM
# 2006-01-01 00:00:00Oracle
# 2007-01-01 00:00:00Intel
# 2008-01-01 00:00:00Yahoo
# 2009-01-01 00:00:00Alphabet
new_rows = []
for row in df['date_time/full_company_name']:
# extract the date_time from the row using regex
date_time = re.search(r'\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}', row)
# handle case of empty date_time
date_time = date_time.group() if date_time else ''
# extract the company name from the row from where the date_time ends
company_name = row[len(date_time):]
# create a new row with the extracted date_time and company_name
new_rows.append([date_time, company_name])
# drop the column 'date_time/full_company_name'
df = df.drop(columns=['date_time/full_company_name'])
# add the new columns to the dataframe: 'date_time' and 'company_name'
df['date_time'] = [row[0] for row in new_rows]
df['company_name'] = [row[1] for row in new_rows]
# After:
# date_time full_company_name
# 2000-01-01 00:00:00 Google
# 2001-01-01 00:00:00 Apple
# 2002-01-01 00:00:00 Microsoft
# 2003-01-01 00:00:00 Facebook
# 2004-01-01 00:00:00 Amazon
# 2005-01-01 00:00:00 IBM
# 2006-01-01 00:00:00 Oracle
# 2007-01-01 00:00:00 Intel
# 2008-01-01 00:00:00 Yahoo
# 2009-01-01 00:00:00 Alphabet

use a non capturing group ?.* instead of (.*)
df = pd.DataFrame({"A": ["2021-01-01 05:00:00Acme Industries",
"2021-01-01 06:00:00Acme LLC"]})
df.A.str.extract("(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})?.*")

is there way to convert daily predictions to hourly predictions in pandas

#df_test.head(3)
ID Datetime
0 18288 26-09-2014 00:00
1 18289 26-09-2014 01:00
2 18290 26-09-2014 02:00
#df_test['Datetime'] = pd.to_datetime(df_test['Datetime'])
#df_test = df_test.set_index('Datetime')
Datetime ID
2014-09-26 00:00:00 18288
2014-09-26 01:00:00 18289
2014-09-26 02:00:00 18290
# Converting to daily mean
df_test_daily = df_test.resample('D').mean()
# model.predict(df_test_daily)
After making the traffic predictions count on the daily data, how can we convert it to hourly predictions.

Just change the D to H.
# Converting to daily mean
df_test_daily = df_test.resample('H').mean()
More info here: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects

pandas- changing the start and end date of resampled timeseries

I've a time series that i resampled into this dataframe df ,
My data is from 6th june to 28 june. it want to extend the data from 1st june to 30th june. count column will have 0 value in only extended period and my real values from 6th to 28th.
Out[123]:
count
Timestamp
2009-06-07 02:00:00 1
2009-06-07 03:00:00 0
2009-06-07 04:00:00 0
2009-06-07 05:00:00 0
2009-06-07 06:00:00 0
i need to the make the
start date:2009-06-01 00:00:00
end date:2009-06-30 23:00:00
so the data would look something like this:
count
Timestamp
2009-06-01 01:00:00 0
2009-06-01 02:00:00 0
2009-06-01 03:00:00 0
is there an effective way to perform this. the only way i can think of is not that effective.i am trying this since yesterday. please help
index = pd.date_range('2009-06-01 00:00:00','2009-06-30 23:00:00', freq='H')
df = pandas.DataFrame(numpy.zeros(len(index),1), index=index)
df.columns=['zeros']
result= pd.concat([df2,df])
result1= pd.concat([df,result])
result1.fillna(0)
del result1['zero']

You can create a new index with the desired start and end day/times, resample the time series data and aggregate by count, then set the index to the new index.
import pandas as pd
# create the index with the start and end times you want
t_index = pd.DatetimeIndex(pd.date_range(start='2009-06-01', end='2009-06-30 23:00:00', freq="1h"))
# create the data frame
df = pd.DataFrame([['2009-06-07 02:07:42'],
['2009-06-11 17:25:28'],
['2009-06-11 17:50:42'],
['2009-06-11 17:59:18']], columns=['daytime'])
df['daytime'] = pd.to_datetime(df['daytime'])
# resample the data to 1 hour, aggregate by counts,
# then reset the index and fill the na's with 0
df2 = df.resample('1h', on='daytime').count().reindex(t_index).fillna(0)

DatetimeIndex() no longer works with those arguments, raises __new__() got an unexpected keyword argument 'start'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Dataframe extract quarterly data and export to a quarterly folder - python

You can use groupby with something like: for quarter, df in ad.groupby('quarter'): df.to_csv(f"C\\Data\\{quarter}\\Sensor_1minData_{quarter}.csv")

Related

Adding datetime by 1 hour implementation to existing dataset with (dumb question)

How to tackle a dataset that has multiple same date values

Extracting date time from a mixed letter and numeric column pandas

is there way to convert daily predictions to hourly predictions in pandas

pandas- changing the start and end date of resampled timeseries

Categories

Resources