I need to extract date features (Day, Week, Month, Year) from a date column of a pandas data frame, using pandasql. I can't seem to locate what version of SQL pandasql is using so I am not sure how to accomplish this feat. Has anyone else tried something similar?
Here is what I have so far:
#import the needed libraries
import numpy as np
import pandas as pd
import pandasql as psql
#establish dataset
doc = 'room_data.csv'
df = pd.read_csv(doc)
df.head()
df2 = psql.sqldf('''
SELECT
Timestamp
, EXTRACT (DAY FROM "Timestamp") AS Day --DOES NOT WORK IN THIS VERSION OF SQL
, Temperature
, Humidity
FROM df
''')
df2.head()
Data Frame Example:
As far as I know , SQLite does not support EXTRACT() function.
You can try strftime('%d', Timestamp)
psql.sqldf('''SELECT
Timestamp
, strftime('%d', Timestamp) AS Day
, Temperature
, Humidity
FROM df
''')
Consider the below example which demonstrates the above query:
Example dataframe:
np.random.seed(123)
dates = pd.date_range('01-01-2020','01-05-2020',freq='H')
temp = np.random.randint(0,100,97)
humidity = np.random.randint(20,100,97)
df = pd.DataFrame({"Timestamp":dates,"Temperature":temp,"Humidity":humidity})
print(df.head())
Timestamp Temperature Humidity
0 2020-01-01 00:00:00 66 29
1 2020-01-01 01:00:00 92 43
2 2020-01-01 02:00:00 98 34
3 2020-01-01 03:00:00 17 58
4 2020-01-01 04:00:00 83 39
Working Query:
import pandasql as ps
query = '''SELECT
Timestamp
, strftime('%d', Timestamp) AS Day
, Temperature
, Humidity
FROM df'''
print(ps.sqldf(query).head())
Timestamp Day Temperature Humidity
0 2020-01-01 00:00:00.000000 01 66 29
1 2020-01-01 01:00:00.000000 01 92 43
2 2020-01-01 02:00:00.000000 01 98 34
3 2020-01-01 03:00:00.000000 01 17 58
4 2020-01-01 04:00:00.000000 01 83 39
you can get more details here to get more date extract functions, common ones are shown below:
import pandasql as ps
query = '''SELECT
Timestamp
, strftime('%d', Timestamp) AS Day
,strftime('%m', Timestamp) AS Month
,strftime('%Y', Timestamp) AS Year
,strftime('%H', Timestamp) AS Hour
, Temperature
, Humidity
FROM df'''
print(ps.sqldf(query).head())
Timestamp Day Month Year Hour Temperature Humidity
0 2020-01-01 00:00:00.000000 01 01 2020 00 66 29
1 2020-01-01 01:00:00.000000 01 01 2020 01 92 34
2 2020-01-01 02:00:00.000000 01 01 2020 02 98 90
3 2020-01-01 03:00:00.000000 01 01 2020 03 17 32
4 2020-01-01 04:00:00.000000 01 01 2020 04 83 74
Here you go:
df['year'] = pd.DatetimeIndex(df['date']).year
df['month'] = pd.DatetimeIndex(df['date']).month
df['day'] = pd.DatetimeIndex(df['date']).day
Related
I'm new in pandas and trying to make aggregation. I converted Dataframe to date format and made indexing change for every day.
model['time_only'] = [time.time() for time in model['date']]
model['date_only'] = [date.date() for date in model['date']]
model['cumsum'] = ((model['date_only'].diff() == datetime.timedelta(days=1))*1).cumsum()
def get_out_of_market_data(data):
df = data.copy()
start_market_time = datetime.time(hour=13,minute=30)
end_market_time = datetime.time(hour=20,minute=0)
df['time_only'] = [time.time() for time in df['date']]
df['date_only'] = [date.date() for date in df['date']]
cond = (start_market_time > df['time_only']) | (df['time_only'] >= end_market_time)
return data[cond]
model['date'] = pd.to_datetime(model['date'])
new = model.drop(columns=['time_only', 'date_only'])
get_out_of_market_data(data=new).head(20)
what i get
0 0 65.5000 65.50 65.5000 65.500 DD 1 125 65.500000 2016-01-04 13:15:00 0
26 26 62.7438 62.96 62.6600 62.956 DD 1639 174595 62.781548 2016-01-04 20:00:00 0
27 27 62.5900 62.79 62.5300 62.747 DD 2113 268680 62.650260 2016-01-04 20:15:00 0
28 28 62.7950 62.80 62.5400 62.590 DD 2652 340801 62.652640 2016-01-04 20:30:00 0
29 29 63.1000 63.12 62.7800 62.800 DD 6284 725952 62.963512 2016-01-04 20:45:00 0
30 30 63.2200 63.22 63.0700 63.080 DD 21 699881 63.070114 2016-01-04 21:00:00 0
31 31 63.2200 63.22 63.2200 63.220 DD 7 1973 63.220000 2016-01-04 22:00:00 0
32 32 63.4000 63.40 63.4000 63.400 DD 2 150 63.400000 2016-01-05 00:30:00 1
33 33 62.3700 62.37 62.3700 62.370 DD 3 350 62.370000 2016-01-05 11:00:00 1
34 34 62.1000 62.37 62.1000 62.370 DD 2 300 62.280000 2016-01-05 11:15:00 1
35 35 62.0800 62.08 62.0800 62.080 DD 1 100 62.080000 2016-01-05 11:45:00 1
the last two columns are the time interval from 20:00 to 13:30 with the indexes of change of each day and the indices of change of the day
I tried to group by the last column the interval from 20:00 one day to 13:00 the next with indexing each interval through the groupbuy
I do not fully understand the method, but for example
new.groupby(pd.Grouper(freq='17hours'))
how to move the indexing to this interval ?
You could try creating a new column to represent the market day it belongs to. If the time is less than 13:30:00, it is yesterday's market day, otherwise it is today's market day. Then you can group by it.The code will be:
def get_market_day(dt):
if dt.time() < datetime.time(13, 30, 0):
return dt.date() - datetime.timedelta(days=1)
else:
return dt.date()
df["market_day"] = df["dt"].map(get_market_day)
df.groupby("market_day").agg(...)
I have a csv file in the format:
20 05 2019 12:00:00, 100
21 05 2019 12:00:00, 200
22 05 2019 12:00:00, 480
And i want to access the second variable, ive tried a variety of different alterations but none have worked.
Initially i tried
import pandas as pd
import numpy as np
col = [i for i in range(2)]
col[1] = "Power"
data = pd.read_csv('FILENAME.csv', names=col)
df1 = data.sum(data, axis=1)
df2 = np.cumsum(df1)
print(df2)
You can use cumsum function:
data['Power'].cumsum()
Output:
0 100
1 300
2 780
Name: Power, dtype: int64
Use df.cumsum:
In [1820]: df = pd.read_csv('FILENAME.csv', names=col)
In [1821]: df
Out[1821]:
0 Power
0 20 05 2019 12:00:00 100
1 21 05 2019 12:00:00 200
2 22 05 2019 12:00:00 480
In [1823]: df['cumulative sum'] = df['Power'].cumsum()
In [1824]: df
Out[1824]:
0 Power cumulative sum
0 20 05 2019 12:00:00 100 100
1 21 05 2019 12:00:00 200 300
2 22 05 2019 12:00:00 480 780
I need a column for the df that will be used to group it by weeks.
The problem is all the reports in Tableau are build using the following format for week: 2019-01-01 it is like, using the first day of week repetitively Mon-Sun.
Data:
cw = pd.DataFrame({ "lead_date" : [2019-01-01 00:02:16, 2018-08-01 00:02:16 , 2017-07-07 00:02:16, 2015-12-01 00:02:16, 2016-09-01 00:02:16] ,
"name": ["aa","bb","cc", "dd", "EE"] )}
My code:
# extracting
cw["week"] = cw["lead_date"].apply(lambda df: df.strftime("%W") )
cw["month"] = cw["lead_date"].apply(lambda df: df.strftime("%m") )
cw["year"] = cw["lead_date"].apply(lambda df: df.strftime("%Y") )
Output:
lead_date year month week
2019-01-01 00:02:16, 2019 , 01 , 00
-
-
-
etc..
Desired output:
having week as date format rather then just 00 or 01 etc..
lead_date year month week
2019-01-01 00:02:16, 2019 , 01 , 2019-01-01
2019-01-15 00:02:16, 2019 , 01 , 2019-01-14
2019-01-25 00:02:16, 2019 , 01 , 2019-01-21
2019-01-28 00:02:16, 2019 , 01 , 2019-01-21
You can do like this:
from datetime import datetime, timedelta
cw['lead_date'].apply(lambda r: datetime.strptime(r, '%Y-%m-%d') - timedelta(days=datetime.strptime(r, '%Y-%m-%d').weekday()))
This will set every date to starting day of that week.
You can do it as follows with using pandas.DatetimeIndex.dayofweek and pandas.Timedelta()
(Note that the first day of 2019.01.01. week is 2018.12.31.):
import pandas as pd
cw = pd.DataFrame({"lead_date" : pd.DatetimeIndex([
"2019-01-01 00:02:16", "2018-08-01 00:02:16" , "2017-07-07 00:02:16",
"2015-12-01 00:02:16", "2016-09-01 00:02:16"]),
"name": ["aa","bb","cc", "dd", "EE"]})
# extracting
cw["month"] = cw["lead_date"].apply(lambda df: df.strftime("%m") )
cw["year"] = cw["lead_date"].apply(lambda df: df.strftime("%Y") )
cw["week"] = (cw["lead_date"] - ((cw["lead_date"].dt.dayofweek) *
pd.Timedelta(days=1)).values.astype('M8[D]'))
print(cw[["lead_date", "year", "month", "week"]])
Out:
lead_date year month week
0 2019-01-01 00:02:16 2019 01 2018-12-31
1 2018-08-01 00:02:16 2018 08 2018-07-30
2 2017-07-07 00:02:16 2017 07 2017-07-03
3 2015-12-01 00:02:16 2015 12 2015-11-30
4 2016-09-01 00:02:16 2016 09 2016-08-29
I think this gets you the output you want:
cw = pd.DataFrame({ "lead_date" : [pd.to_datetime('2019-01-01 00:02:16'), pd.to_datetime('2018-08-01 00:02:16') , pd.to_datetime('2017-07-07 00:02:16'), pd.to_datetime('2015-12-01 00:02:16'), pd.to_datetime('2016-09-01 00:02:16')] ,
"name": ["aa","bb","cc", "dd", "EE"] })
cw["year"] = cw["lead_date"].apply(lambda df: df.strftime("%Y") )
cw["month"] = cw["lead_date"].apply(lambda df: df.strftime("%m") )
cw["week"] = cw["lead_date"].apply(lambda df: df.strftime("%Y-%m-%d") )
cw.drop(columns='name', inplace=True)
output:
lead_date year month week
0 2019-01-01 00:02:16 2019 01 2019-01-01
1 2018-08-01 00:02:16 2018 08 2018-08-01
2 2017-07-07 00:02:16 2017 07 2017-07-07
3 2015-12-01 00:02:16 2015 12 2015-12-01
4 2016-09-01 00:02:16 2016 09 2016-09-01
I am currently working on a dataset of 8 000 rows.
I want to split my date column by day, month, year. dtype for the date is object
How to convert the whole column of date by date. month, year?
A sample of the date of my dataset is shown below:
date
01-01-2016
01-01-2016
01-01-2016
01-01-2016
01-01-2016
df=pd.DataFrame(columns=['date'])
df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)
print(df)
dt=datetime.strptime('date',"%d-%m-%y")
print(dt)
This is the code I am using for date splitting but it is showing mean error
ValueError: time data 'date' does not match format '%d-%m-%y'
If you have pandas you can do this:
import pandas as pd
# Recreate your dataframe
df = pd.DataFrame(dict(date=['01-01-2016']*6))
df.date = pd.to_datetime(df.date)
# Create 3 new columns
df[['year','month','day']] = df.date.apply(lambda x: pd.Series(x.strftime("%Y,%m,%d").split(",")))
df
Returns
date year month day
0 2016-01-01 2016 01 01
1 2016-01-01 2016 01 01
2 2016-01-01 2016 01 01
3 2016-01-01 2016 01 01
4 2016-01-01 2016 01 01
5 2016-01-01 2016 01 01
Or without the formatting options:
df['year'],df['month'],df['day'] = df.date.dt.year, df.date.dt.month, df.date.dt.day
df
Returns
date year month day
0 2016-01-01 2016 1 1
1 2016-01-01 2016 1 1
2 2016-01-01 2016 1 1
3 2016-01-01 2016 1 1
4 2016-01-01 2016 1 1
5 2016-01-01 2016 1 1
I found this but cant get the syntax correct.
time.asctime(time.strptime('2017 28 1', '%Y %W %w'))
I want to set a new column to show month in the format "201707" for July. It can be int64 or string doesnt have to be an actual readable date in the column.
My dataframe column ['Week'] is also in the format 201729 i.e. YYYYWW
dfAttrition_Billings_KPIs['Day_1'] = \
time.asctime(time.strptime(dfAttrition_Billings_KPIs['Week'].str[:4]
+ dfAttrition_Billings_KPIs['Month'].str[:-2] - 1 + 1', '%Y %W %w'))
So I want the output of the rows that have week 201729 to show in a new field month 201707. the output depends on what the row value is in 'Week'.
I have a million records so would like to avoid iterations of rows, lambdas and slow functions where possible :)
Use to_datetime with parameter format with add 1 for Mondays, last for format YYYYMM use strftime
df = pd.DataFrame({'date':[201729,201730,201735]})
df['date1']=pd.to_datetime(df['date'].astype(str) + '1', format='%Y%W%w')
df['date2']=pd.to_datetime(df['date'].astype(str) + '1', format='%Y%W%w').dt.strftime('%Y%m')
print (df)
date date1 date2
0 201729 2017-07-17 201707
1 201730 2017-07-24 201707
2 201735 2017-08-28 201708
If need convert from datetime to weeks custom format:
df = pd.DataFrame({'date':pd.date_range('2017-01-01', periods=10)})
df['date3'] = df['date'].dt.strftime('%Y %W %w')
print (df)
date date3
0 2017-01-01 2017 00 0
1 2017-01-02 2017 01 1
2 2017-01-03 2017 01 2
3 2017-01-04 2017 01 3
4 2017-01-05 2017 01 4
5 2017-01-06 2017 01 5
6 2017-01-07 2017 01 6
7 2017-01-08 2017 01 0
8 2017-01-09 2017 02 1
9 2017-01-10 2017 02 2